Both Eldric Client and Multi-API support multiple backends. Mix local inference with cloud APIs across your infrastructure.

Local & Self-Hosted

Ollama

  • Port: 11434
  • REST API
  • Auto model discovery
  • Default backend

vLLM

  • Port: 8000
  • OpenAI-compatible
  • PagedAttention
  • High throughput

llama.cpp

  • Port: 8080
  • REST + WebSocket
  • GGUF models
  • CPU + GPU

HuggingFace TGI

  • Port: 8080
  • REST + gRPC
  • Tensor parallelism
  • Continuous batching

LocalAI

  • Port: 8080
  • OpenAI-compatible
  • Multiple formats
  • CPU optimized

ExLlamaV2

  • Port: 5000
  • REST API
  • GPTQ/EXL2 quants
  • Fast inference

LMDeploy

  • Port: 23333
  • OpenAI-compatible
  • TurboMind engine
  • Quantization

MLC LLM

  • Port: 8080
  • REST API
  • Universal deploy
  • WebGPU support

Enterprise & ML Platforms

NVIDIA Triton

  • Port: 8000-8002
  • REST + gRPC
  • TensorRT optimization
  • Multi-framework

NVIDIA NIM

  • Port: 8000
  • OpenAI-compatible
  • Optimized containers
  • Enterprise ready

TensorFlow Serving

  • Port: 8501/8500
  • REST + gRPC
  • Model versioning
  • Batch prediction

TorchServe

  • Port: 8080/8081
  • REST + gRPC
  • PyTorch native
  • Model archive

ONNX Runtime

  • Port: 8001
  • REST + gRPC
  • Cross-platform
  • Hardware agnostic

DeepSpeed-MII

  • Port: 28080
  • REST API
  • ZeRO-Inference
  • Low latency

BentoML

  • Port: 3000
  • REST + gRPC
  • Model packaging
  • Adaptive batching

Ray Serve

  • Port: 8000
  • REST API
  • Auto-scaling
  • Distributed

Cloud AI Services

AWS SageMaker

  • HTTPS endpoint
  • REST API
  • Auto-scaling
  • Multi-model

AWS Bedrock

  • HTTPS endpoint
  • REST API
  • Foundation models
  • Managed service

Azure ML

  • HTTPS endpoint
  • REST + SDK
  • Managed compute
  • MLflow integration

Azure OpenAI

  • HTTPS endpoint
  • OpenAI-compatible
  • Enterprise security
  • Regional deploy

Google Vertex AI

  • HTTPS endpoint
  • REST + gRPC
  • TPU support
  • Model Garden

Groq

  • HTTPS API
  • OpenAI-compatible
  • LPU inference
  • Ultra-fast

Together AI

  • HTTPS API
  • OpenAI-compatible
  • Open models
  • Fine-tuning

Fireworks AI

  • HTTPS API
  • OpenAI-compatible
  • Fast inference
  • Function calling

Anyscale

  • HTTPS API
  • OpenAI-compatible
  • Ray-based
  • Scalable

Replicate

  • HTTPS API
  • REST API
  • Model hosting
  • Pay-per-use

Model Provider APIs

OpenAI

  • HTTPS API
  • REST API
  • GPT-4, GPT-4o
  • Assistants API

Anthropic

  • HTTPS API
  • REST API
  • Claude models
  • Tool use

Google Gemini

  • HTTPS API
  • REST API
  • Gemini Pro/Ultra
  • Multimodal

Mistral AI

  • HTTPS API
  • OpenAI-compatible
  • Mistral/Mixtral
  • Function calling

Cohere

  • HTTPS API
  • REST API
  • Command models
  • Embeddings + Rerank

AI21 Labs

  • HTTPS API
  • REST API
  • Jurassic models
  • Specialized tasks

Specialized & Platform-Specific

MLX (Apple Silicon)

  • Port: 8080
  • REST API
  • Metal acceleration
  • Unified memory

KServe

  • Port: 8080
  • REST + gRPC
  • Kubernetes native
  • Serverless

Seldon Core

  • Port: 9000
  • REST + gRPC
  • ML deployment
  • A/B testing

OpenAI-Compatible

  • Any port
  • Custom endpoints
  • API key auth
  • Drop-in support

Backend by Use Case

Use Case Recommended Backends Why
Development Ollama, LocalAI, LMDeploy Easy setup, free, local
Production API vLLM, TGI, Triton, NIM High throughput, batching, enterprise
Edge / IoT llama.cpp, MLC LLM, ExLlamaV2 CPU inference, small footprint, quantized
Apple Silicon MLX, Ollama, MLC LLM Metal acceleration, unified memory
Low Latency Groq, Fireworks, DeepSpeed-MII Optimized hardware, fast inference
Enterprise Cloud Azure OpenAI, Bedrock, Vertex AI Compliance, SLA, managed
Open Models Together AI, Anyscale, Replicate Llama, Mistral, open weights
Kubernetes KServe, Seldon, Ray Serve Cloud-native, auto-scaling

Availability

Eldric Client (CLI + GUI): Ollama, vLLM, llama.cpp, TGI, MLX, OpenAI-compatible endpoints

Eldric Multi-API: All 32+ backends with unified API, load balancing, and failover