Build, deploy, and orchestrate AI infrastructure at any scale. Controllers for orchestration, routers for load balancing, workers for inference, and edge gateways for external access.

What Can You Build?

Small Team / Startup

1 controller + 1 router + 2 workers. Internal API access. Cost: Your existing hardware.

Growing Company

Add Edge gateway for OpenWebUI. Scale workers. AI-powered routing for optimization.

Enterprise

Multi-region with secondary controllers. Geo-aware routing, automatic failover, compliance.

Global Scale

CDN for AI. Primary controller orchestrating worldwide. 50+ edge locations, 100+ workers.

Architecture

External Clients Internal Clients EDGE FARM Port 443 • TLS • Auth • Rate Limiting CONTROLLER Port 8880 • Orchestration • License • Config Sync • Metrics ROUTER Load Balance ROUTER AI Routing ROUTER Failover :8881 WORKERS vLLM • Ollama • TGI WORKERS Triton • llama.cpp WORKERS TGI • Custom :8890 DATA WORKERS Port 8895 • Connection Pooling • Schema Discovery SQLite Local/Embedded PostgreSQL Enterprise MySQL MariaDB IBM DB2 z/OS • DRDA • ODBC/CLI Mainframe Integration

Ports

443
Edge
8880
Controller
8881
Router
8890
Worker
8895
Data Worker

Edge Gateway Features

Secure external access with TLS, API key authentication, and rate limiting.

🔒 TLS Termination HTTPS on Port 443 • Let's Encrypt auto-renewal • Custom certificates • HTTP → HTTPS redirect • mTLS for workers --cert cert.pem ⚡ High Availability Farm Mode Clustering • Active-active edge nodes • Automatic failover • Health check peers • Session persistence --mode farm --peers 🌐 OpenWebUI Ready OpenAI-Compatible API • API key authentication • Rate limiting (RPM) • Model allowlisting • Usage tracking /v1/chat/completions

Deployment Scenarios

1. Quick Start (Single Machine)

Controller :8880 Router :8881 Worker Ollama :8890

Development setup on localhost

2. Team Setup (3 Servers)

Controller + Router mgmt server GPU Worker 1 RTX 4090 llama3.1:70b GPU Worker 2 A100 vLLM • mixtral:8x7b 2 workers • 128GB GPU

Small team with dedicated GPUs

3. Production with Edge Gateway

Internet / OpenWebUI 🔒 Edge Gateway :443 Controller Router 1 Router 2 Workers (HA Pool)

HTTPS, API keys, rate limiting

4. AI-Powered Smart Routing

Request: llama3.1:70b 🧠 AI Router llama3.2:3b decision engine "A100 worker: lowest latency 32ms optimal for 70B model" RTX 4090 45ms latency A100 ✓ 32ms latency H100 busy 8/10

LLM-powered worker selection

5. Config Sync Across Cluster

Controller Config Store v3-7a8b latest heartbeat sync Worker 1 ✓ v3-7a8b Worker 2 ⟳ updating Hash-based versioning Auto-sync on heartbeat (30s)

Automatic configuration distribution

6. Multi-Region Deployment

🇺🇸 US-West PRIMARY 4 Workers 🇪🇺 EU-West SECONDARY 3 Workers 🌏 APAC SECONDARY 2 Workers Real-time sync • Geo-routing 9 Workers • 3 Regions Automatic failover

Global HA with region-aware routing

Routing Strategies

Round Robin

  • Simple rotation
  • Even distribution
  • No overhead
  • Best for uniform load

Least Connections

  • Fewest active requests
  • Dynamic balancing
  • Prevents queue buildup
  • Good for varied tasks

Load Based

  • Real-time metrics
  • CPU/memory aware
  • Prevent hotspots
  • Default strategy

Latency Based

  • Fastest response time
  • Adaptive routing
  • Best for real-time
  • Tracks P95 latency

Random

  • Simple selection
  • No state needed
  • Good for testing
  • Minimal overhead

AI Routing

  • LLM-powered decisions
  • Context-aware selection
  • Model matching
  • Self-optimizing
# Configure routing strategy on router ./eldric-router --strategy load_based --controller http://controller:8880 # Available strategies: round_robin, least_connections, load_based, latency_based, random, ai_routing

Worker Backend Connectivity

Each worker connects to one or more AI inference backends. Workers automatically discover available models and report capabilities to the controller. Mix different backends across your cluster for optimal resource utilization.

Local & Self-Hosted Backends

Ollama

  • Port: 11434
  • REST API
  • Auto model discovery
  • Default backend

vLLM

  • Port: 8000
  • OpenAI-compatible
  • PagedAttention
  • High throughput

llama.cpp

  • Port: 8080
  • REST + WebSocket
  • GGUF models
  • CPU + GPU

HuggingFace TGI

  • Port: 8080
  • REST + gRPC
  • Tensor parallelism
  • Continuous batching

LocalAI

  • Port: 8080
  • OpenAI-compatible
  • Multiple formats
  • CPU optimized

ExLlamaV2

  • Port: 5000
  • REST API
  • GPTQ/EXL2 quants
  • Fast inference

LMDeploy

  • Port: 23333
  • OpenAI-compatible
  • TurboMind engine
  • Quantization

MLC LLM

  • Port: 8080
  • REST API
  • Universal deploy
  • WebGPU support

Enterprise & ML Platforms

NVIDIA Triton

  • Port: 8000-8002
  • REST + gRPC
  • TensorRT optimization
  • Multi-framework

NVIDIA NIM

  • Port: 8000
  • OpenAI-compatible
  • Optimized containers
  • Enterprise ready

TensorFlow Serving

  • Port: 8501/8500
  • REST + gRPC
  • Model versioning
  • Batch prediction

TorchServe

  • Port: 8080/8081
  • REST + gRPC
  • PyTorch native
  • Model archive

ONNX Runtime

  • Port: 8001
  • REST + gRPC
  • Cross-platform
  • Hardware agnostic

DeepSpeed-MII

  • Port: 28080
  • REST API
  • ZeRO-Inference
  • Low latency

BentoML

  • Port: 3000
  • REST + gRPC
  • Model packaging
  • Adaptive batching

Ray Serve

  • Port: 8000
  • REST API
  • Auto-scaling
  • Distributed

Cloud AI Services

AWS SageMaker

  • HTTPS endpoint
  • REST API
  • Auto-scaling
  • Multi-model

Azure ML

  • HTTPS endpoint
  • REST + SDK
  • Managed compute
  • MLflow integration

Google Vertex AI

  • HTTPS endpoint
  • REST + gRPC
  • TPU support
  • Model Garden

Groq

  • HTTPS API
  • OpenAI-compatible
  • LPU inference
  • Ultra-fast

Together AI

  • HTTPS API
  • OpenAI-compatible
  • Open models
  • Fine-tuning

Fireworks AI

  • HTTPS API
  • OpenAI-compatible
  • Fast inference
  • Function calling

Anyscale

  • HTTPS API
  • OpenAI-compatible
  • Ray-based
  • Scalable

Replicate

  • HTTPS API
  • REST API
  • Model hosting
  • Pay-per-use

Specialized & Platform-Specific

MLX (Apple Silicon)

  • Port: 8080
  • REST API
  • Metal acceleration
  • Unified memory

Mistral AI

  • HTTPS API
  • OpenAI-compatible
  • Mistral models
  • Function calling

Cohere

  • HTTPS API
  • REST API
  • Embeddings
  • Rerank

Anthropic

  • HTTPS API
  • REST API
  • Claude models
  • Tool use

OpenAI

  • HTTPS API
  • REST API
  • GPT models
  • Assistants

KServe

  • Port: 8080
  • REST + gRPC
  • Kubernetes native
  • Serverless

Seldon Core

  • Port: 9000
  • REST + gRPC
  • ML deployment
  • A/B testing

OpenAI-Compatible

  • Any port
  • Custom endpoints
  • API key auth
  • Drop-in support

Worker Configuration Examples

# Connect worker to local Ollama ./eldric-workerd --controller http://ctrl:8880 --backend ollama \ --backend-url http://localhost:11434 # Connect worker to vLLM server ./eldric-workerd --controller http://ctrl:8880 --backend vllm \ --backend-url http://gpu-server:8000 --models mixtral:8x7b # Connect worker to NVIDIA Triton ./eldric-workerd --controller http://ctrl:8880 --backend triton \ --backend-url http://triton-server:8000 --models llama3.1:70b # Connect worker to TensorFlow Serving ./eldric-workerd --controller http://ctrl:8880 --backend tensorflow \ --backend-url http://tf-server:8501 --models gpt-neo # Connect worker to HuggingFace TGI ./eldric-workerd --controller http://ctrl:8880 --backend tgi \ --backend-url http://tgi-server:8080 --models codellama:34b # Connect worker to MLX on Mac ./eldric-workerd --controller http://ctrl:8880 --backend mlx \ --backend-url http://localhost:8080 --models llama3.2:3b # Connect to any OpenAI-compatible endpoint ./eldric-workerd --controller http://ctrl:8880 --backend openai \ --backend-url https://api.example.com/v1 --api-key $API_KEY

OpenAI-Compatible API

Drop-in replacement for OpenAI API. Works with any compatible tool.

Tools

  • OpenWebUI
  • LangChain
  • LlamaIndex
  • Cursor

Endpoints

  • /v1/chat/completions
  • /v1/completions
  • /v1/models
  • /v1/embeddings

Features

  • Streaming
  • Tool calling
  • JSON mode
  • Vision

Admin

  • /api/v1/workers
  • /api/v1/routers
  • /api/v1/metrics
  • /api/v1/config

Data Worker Integration

AI backends can directly query databases through Data Workers for real-time data access.

Direct Database Access

  • AI Workers query Data Workers via REST API
  • Connection pooling for high throughput
  • Schema discovery for context-aware queries
  • Parameterized queries prevent SQL injection

Supported Databases

  • SQLite - Built-in, local databases
  • PostgreSQL - Enterprise analytics
  • MySQL/MariaDB - Web applications
  • IBM DB2 - Mainframe and z/OS
# AI Worker requests data from database curl -X POST http://data-worker:8895/api/v1/data/query \ -H "Content-Type: application/json" \ -d '{"source_id":"warehouse","sql":"SELECT * FROM sales WHERE region=$1","params":["EMEA"]}'
Data Worker Documentation

Ready to Build Your AI Infrastructure?

Contact our team for a custom deployment plan.

Contact Sales