Multi-Region Model Serving Patterns (Triton, vLLM, TGI)
Your users don’t care which region served the token—only that it arrived fast, in order, and kept arriving during an outage. Here are proven patterns to do that across regions with Triton, vLLM, and TGI, plus how to prove resilience with the Availability Standard (CAM).
What you’ll get
- Three multi-region patterns with routing, rollout, and failover mechanics
- Latency/SLO budgets and token-streaming gotchas (stickiness, retries)
- A practical Triton vs vLLM vs TGI matrix
- Kubernetes/Helm and runtime flags you can paste in
- CAM pillar targets so you can certify the architecture’s resiliency
1) SLOs & constraints specific to multi-region LLM serving
SLOs
- p95 time-to-first-token (TTFT) and p95 time-between-tokens (TBTT) per region
- Stream continuity: no duplicated/missing tokens during failover
- Availability: 99.9–99.99% monthly (A2–A3 in CAM terms)
Hard constraints
- Token streams need connection stickiness (consistent hash by session/user/prompts)
- Retries must be idempotent (include request IDs and offsets)
- KV-cache locality dominates cost; ship requests to KV, not KV across regions
2) Pattern A — Anycast Active-Active with Session Pinning (lowest latency)
When to use: Global consumer traffic, conversational AI, AR/voice—latency rules.
┌───────────┐ Anycast ┌───────────┐
User ───▶│ Global │━━━━━━━━━━━━━━━━━━▶│ Region A │─┐
│ Anycast │ └───────────┘ │
│ (BGP) │ ┌───────────┐ │ L4/L7
└───────────┘ Anycast │ Region B │─┤ routing
▲ ▲ ─ ─ ─ ─ ─ ─ ─ ─ └───────────┘ │
│ │ │
│ └──────── health/weight updates ◀─┘
Mechanics
- Routing: Anycast IPs at each region; L7 proxy (Envoy/NGINX) enforces consistent hashing on session_id or Authorization to pin streams.
- Back-pressure: If a region saturates, it sheds new sessions by raising health-check RTO and weight; existing streams remain.
- KV cache: Region-local; no cross-region KV replication. Use prompt-signature LRU (hash of the first N tokens) to pre-warm.
- Data: Feature reads from region-local cache; write-behind to quorum store (two regions + witness).
Pros
- Best p95 latency, graceful brownouts, zero control-plane steps for failover.
Cons
- Requires Anycast + BGP competency; hard to keep model build parity without strong CI.
SLO budget (example)
- User → Edge POP (Anycast): 5–15 ms
- L7 proxy → server: 0.5–1.5 ms
- TTFT (prefill on GPU): 50–150 ms (model dependent)
- TBTT: 10–40 ms
# envoy-filter.yaml
typed_per_filter_config:
envoy.http.connection_manager:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
route_config:
name: llm
virtual_hosts:
- domains: ["*"]
routes:
- match: { prefix: "/" }
route:
cluster: llm-backends
hash_policy:
- header:
header_name: x-session-id
terminal: true
- cookie:
name: session
ttl: 86400s
3) Pattern B — Warm-Standby (Active-Passive) with Async Model Sync (cost saver)
When to use: Enterprise/internal apps, predictable hours, strict change windows.
Users ─▶ Region A (Active) ──┬──> Object Store (models)
▲ │
│ └──> Async sync → Region B (Standby)
└──────── Health/Failover switch (DNS low TTL or Anycast weight=0)
Mechanics
- Routing: DNS low TTL (15–30 s) or Anycast with manual weight change on failover.
- State: Region B runs at N+ε headroom with cold KV; pre-load popular prompts nightly.
- Cutover: Drain streams in A (graceful), new sessions land in B; existing streams optionally reconnect with offset.
Pros: Cheapest steady-state yet retains rapid recovery.
Cons: Short transient p95 spike during failover while KV warms.
Operational tip: Hourly canary traffic to standby to catch drift and ensure health.
4) Pattern C — Geo-Staged Rollouts (Shadow → Canary → Percent)
When to use: Frequent model releases; risk-managed experimentation.
- Shadow: Mirror read-only traffic to new model in one region; compare quality/latency.
- Canary: Route 1–5% live sessions to new model in two regions; guard by SLOs.
- Percent rollout: Ramp to 25/50/100% per region; halt on degradation.
package rollout
default allow = false
allow {
input.request.kind.kind == "RouteWeightChange"
input.metrics.p95_ttft_delta_ms <= 30
input.metrics.error_rate <= 0.5
verify_cosign(input.artifact)
}
5) Triton vs vLLM vs TGI (what to use where)
- LLM token serving: vLLM/TGI excellent streaming; Triton strong via backends (best with TensorRT-LLM).
- Multi-model hosting: vLLM and TGI strong; Triton supports ensembles across frameworks.
- Non-LLM modalities: Triton best (vision, ASR, custom backends).
- Observability: All export Prometheus metrics.
- Operational fit: vLLM = pure LLM farms; TGI = OSS + HF; Triton = mixed-modal enterprise & HW-tuned.
Pragmatic take: vLLM wins for raw LLM throughput and token streaming; TGI is a polished OSS stack tied to the HF ecosystem; Triton is the right call when you mix LLM + vision/ASR or want TensorRT-LLM acceleration and ensemble graphs.
6) Concrete configs (copy/paste)
python -m vllm.entrypoints.openai.api_server \
--model TheModelOrg/awesome-13b \
--tensor-parallel-size 2 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.92 \
--enable-chunked-prefill \
--kv-cache-dtype fp8_e5m2 \
--served-model-name chat-13b \
--port 8000
text-generation-launcher \
--model TheModelOrg/awesome-13b \
--port 8080 \
--num-shard 2 \
--max-batch-prefill-tokens 4096 \
--max-input-tokens 8192 \
--max-total-tokens 12288 \
--dtype bf16
model_repository/
llm/
config.pbtxt # instance_group, dynamic_batching
vision/
config.pbtxt
pipeline/
config.pbtxt # ensemble: vision→embedding→llm prompt augment
# pipeline/config.pbtxt declares an ensemble DAG:
# run vision model → build prompt → call LLM.
7) Keeping streams correct across regions (the sticky bits)
- Consistent hashing on a stable key (user/session) to minimize cross-region flips.
- Resume tokens: client sends request_id + seen_tokens on reconnect; server trims duplicates.
- In-band heartbeats every N tokens so L7 can detect a dead stream fast.
- Idempotent logging with upsert(request_id) on analytics to avoid double counts after retry.
- HTTP/3 (QUIC) for lossy mobile networks; fewer head-of-line stalls vs TCP.
req = {
"id": uuid4(),
"prompt": prompt,
"offset": 0
}
for token in stream("/v1/chat/completions", req):
print(token, end="")
req["offset"] += 1
# if reconnect:
# send same id with current offset; server resumes from that position
8) Model & feature consistency (don’t replicate pain)
- Weights/artifacts: signed & versioned in object storage; regions pull on demand.
- Feature stores: read-through caches per region; writes go to a quorum core (two regions + witness).
- Schema evolution: feature contracts versioned; reject payloads that don’t match during rollout.
allow {
input.kind == "Model"
cosign.verify(input.image, "rekor.public")
input.labels["approved"] == "true"
}
9) Observability: the four graphs that matter
- TTFT / TBTT by region (p50/p95/p99) — compare across A/B regions post-release.
- Saturation — GPU util %, queue depth, dropped sessions.
- Routing health — Anycast reachability, BGP flap count, DNS provider health.
- Correctness — token duplication/missing rate during failovers (should be ~0).
Export everything to Prometheus; set SLO burn alerts (multi-window, multi-burn).
10) CAM resiliency mapping (certify it)
For a multi-region active-active deployment serving public traffic:
- I-PWR = 3 — N+1 UPS/gensets per region; any one region can vanish.
- I-COOL = 3 — Redundant CRAH/liquid loops; free-cool where possible.
- I-NWK = 4 — Anycast across ≥2 ISPs per region; RPKI; dual DNS providers.
- I-DATA = 4 — Quorum registry & metadata; RPO 0 s; immutable backups.
- I-CTRL = 4 — Active-active control planes, GitOps, signed artifacts, independent admin domains.
Composite I-Score ≈ 3.6–4 → 4. For A3 workloads (99.98% / RTO 30 m / RPO 15 m), CAM Tier 3. For A2, CAM Tier 2–3 (you’ll typically exceed).
11) Rollout cookbook (shadow → canary → percent)
- Prepare: Publish model vNEXT to object store; sign with COSIGN.
- Shadow: Mirror 1% of traffic in Region A for 24 h; collect quality & TTFT deltas.
- Canary: Route 5% in A & B; block rollout if p95 TTFT worsens > 30 ms or error rate > 0.5%.
- Ramp: Increase weights every 30 min: 5 → 25 → 50 → 100%.
- Finalize: Garbage-collect old weights; update feature contracts; rotate secrets.
12) Cost notes (hidden traps)
- Inter-region egress will eat your lunch if you try to replicate KV caches; don’t.
- Over-pinning (too sticky) causes hotspots; use EWMA of region p95 to bias hash.
- Cold-start storms after deploy? Pre-warm by replaying top-K prompts to each region.
13) Quick starts (Helm one-liners)
helm repo add vllm https://vllm.ai/helm-charts
helm install chat vllm/vllm \
--set model.repo=TheModelOrg/awesome-13b \
--set parallelism.tensor=2 \
--set batching.maxNumSeqs=256 \
--set serve.enableChunkedPrefill=true
helm repo add tgi https://hf.co/helm
helm install tgi tgi/text-generation-inference \
--set model=TheModelOrg/awesome-13b \
--set shards=2 \
--set dtype=bf16
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install triton nvidia/triton-inference-server \
--set modelRepository=/models \
--set resources.limits.nvidia.com/gpu=2
14) Final guidance
- Pin sessions, not regions. Let Anycast/L7 send each session to the healthiest nearby region and keep it there.
- Warm smart, not heavy. Share signatures of hot prompts across regions; don’t sync KV caches.
- Prove resilience. Map your design to CAM pillars, hit I-NWK/I-DATA/I-CTRL = 4, and certify the tier.
One-liner: Multi-region model serving is a routing problem wrapped around a caching problem—with a CI problem in the middle. Solve all three, and your users will never notice which region answered.