Multi-Region Model Serving Patterns (Triton, vLLM, TGI)

Your users don’t care which region served the token—only that it arrived fast, in order, and kept arriving during an outage. Here are proven patterns to do that across regions with Triton, vLLM, and TGI, plus how to prove resilience with the Availability Standard (CAM).

Technical Guide

Updated January 2025

~18 min read

What you’ll get

Three multi-region patterns with routing, rollout, and failover mechanics
Latency/SLO budgets and token-streaming gotchas (stickiness, retries)
A practical Triton vs vLLM vs TGI matrix
Kubernetes/Helm and runtime flags you can paste in
CAM pillar targets so you can certify the architecture’s resiliency

1) SLOs & constraints specific to multi-region LLM serving

SLOs

p95 time-to-first-token (TTFT) and p95 time-between-tokens (TBTT) per region
Stream continuity: no duplicated/missing tokens during failover
Availability: 99.9–99.99% monthly (A2–A3 in CAM terms)

Hard constraints

Token streams need connection stickiness (consistent hash by session/user/prompts)
Retries must be idempotent (include request IDs and offsets)
KV-cache locality dominates cost; ship requests to KV, not KV across regions

2) Pattern A — Anycast Active-Active with Session Pinning (lowest latency)

When to use: Global consumer traffic, conversational AI, AR/voice—latency rules.

Architecture (ASCII)

         ┌───────────┐     Anycast      ┌───────────┐
 User ───▶│  Global   │━━━━━━━━━━━━━━━━━━▶│  Region A │─┐
          │  Anycast  │                  └───────────┘ │
          │  (BGP)    │                  ┌───────────┐ │  L4/L7
          └───────────┘     Anycast      │  Region B │─┤  routing
                  ▲  ▲  ─ ─ ─ ─ ─ ─ ─ ─  └───────────┘ │
                  │  │                                 │
                  │  └──────── health/weight updates ◀─┘

Mechanics

Routing: Anycast IPs at each region; L7 proxy (Envoy/NGINX) enforces consistent hashing on session_id or Authorization to pin streams.
Back-pressure: If a region saturates, it sheds new sessions by raising health-check RTO and weight; existing streams remain.
KV cache: Region-local; no cross-region KV replication. Use prompt-signature LRU (hash of the first N tokens) to pre-warm.
Data: Feature reads from region-local cache; write-behind to quorum store (two regions + witness).

Pros

Best p95 latency, graceful brownouts, zero control-plane steps for failover.

Cons

Requires Anycast + BGP competency; hard to keep model build parity without strong CI.

SLO budget (example)

User → Edge POP (Anycast): 5–15 ms
L7 proxy → server: 0.5–1.5 ms
TTFT (prefill on GPU): 50–150 ms (model dependent)
TBTT: 10–40 ms

K8s/Envoy pinning snippet (YAML)

# envoy-filter.yaml
typed_per_filter_config:
  envoy.http.connection_manager:
    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
    route_config:
      name: llm
      virtual_hosts:
      - domains: ["*"]
        routes:
        - match: { prefix: "/" }
          route:
            cluster: llm-backends
            hash_policy:
            - header:
                header_name: x-session-id
                terminal: true
            - cookie:
                name: session
                ttl: 86400s

3) Pattern B — Warm-Standby (Active-Passive) with Async Model Sync (cost saver)

When to use: Enterprise/internal apps, predictable hours, strict change windows.

Architecture (ASCII)

Users ─▶ Region A (Active) ──┬──> Object Store (models)
           ▲                 │
           │                 └──> Async sync → Region B (Standby)
           └──────── Health/Failover switch (DNS low TTL or Anycast weight=0)

Mechanics

Routing: DNS low TTL (15–30 s) or Anycast with manual weight change on failover.
State: Region B runs at N+ε headroom with cold KV; pre-load popular prompts nightly.
Cutover: Drain streams in A (graceful), new sessions land in B; existing streams optionally reconnect with offset.

Pros: Cheapest steady-state yet retains rapid recovery.

Cons: Short transient p95 spike during failover while KV warms.

Operational tip: Hourly canary traffic to standby to catch drift and ensure health.

4) Pattern C — Geo-Staged Rollouts (Shadow → Canary → Percent)

When to use: Frequent model releases; risk-managed experimentation.

Shadow: Mirror read-only traffic to new model in one region; compare quality/latency.
Canary: Route 1–5% live sessions to new model in two regions; guard by SLOs.
Percent rollout: Ramp to 25/50/100% per region; halt on degradation.

Guardrail policy (OPA/Rego)

package rollout
default allow = false

allow {
  input.request.kind.kind == "RouteWeightChange"
  input.metrics.p95_ttft_delta_ms <= 30
  input.metrics.error_rate <= 0.5
  verify_cosign(input.artifact)
}

5) Triton vs vLLM vs TGI (what to use where)

LLM token serving: vLLM/TGI excellent streaming; Triton strong via backends (best with TensorRT-LLM).
Multi-model hosting: vLLM and TGI strong; Triton supports ensembles across frameworks.
Non-LLM modalities: Triton best (vision, ASR, custom backends).
Observability: All export Prometheus metrics.
Operational fit: vLLM = pure LLM farms; TGI = OSS + HF; Triton = mixed-modal enterprise & HW-tuned.

Pragmatic take: vLLM wins for raw LLM throughput and token streaming; TGI is a polished OSS stack tied to the HF ecosystem; Triton is the right call when you mix LLM + vision/ASR or want TensorRT-LLM acceleration and ensemble graphs.

6) Concrete configs (copy/paste)

vLLM (streaming, edge-ready)

python -m vllm.entrypoints.openai.api_server \
  --model TheModelOrg/awesome-13b \
  --tensor-parallel-size 2 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8_e5m2 \
  --served-model-name chat-13b \
  --port 8000

TGI (multi-region friendly)

text-generation-launcher \
  --model TheModelOrg/awesome-13b \
  --port 8080 \
  --num-shard 2 \
  --max-batch-prefill-tokens 4096 \
  --max-input-tokens 8192 \
  --max-total-tokens 12288 \
  --dtype bf16

Triton (LLM + vision ensemble sketch)

model_repository/
  llm/
    config.pbtxt        # instance_group, dynamic_batching
  vision/
    config.pbtxt
  pipeline/
    config.pbtxt        # ensemble: vision→embedding→llm prompt augment

# pipeline/config.pbtxt declares an ensemble DAG:
# run vision model → build prompt → call LLM.

7) Keeping streams correct across regions (the sticky bits)

Consistent hashing on a stable key (user/session) to minimize cross-region flips.
Resume tokens: client sends request_id + seen_tokens on reconnect; server trims duplicates.
In-band heartbeats every N tokens so L7 can detect a dead stream fast.
Idempotent logging with upsert(request_id) on analytics to avoid double counts after retry.
HTTP/3 (QUIC) for lossy mobile networks; fewer head-of-line stalls vs TCP.

Client-side pseudo (Python)

req = {
  "id": uuid4(),
  "prompt": prompt,
  "offset": 0
}

for token in stream("/v1/chat/completions", req):
    print(token, end="")
    req["offset"] += 1

# if reconnect:
#   send same id with current offset; server resumes from that position

8) Model & feature consistency (don’t replicate pain)

Weights/artifacts: signed & versioned in object storage; regions pull on demand.
Feature stores: read-through caches per region; writes go to a quorum core (two regions + witness).
Schema evolution: feature contracts versioned; reject payloads that don’t match during rollout.

COSIGN gate (admission) — Rego

allow {
  input.kind == "Model"
  cosign.verify(input.image, "rekor.public")
  input.labels["approved"] == "true"
}

9) Observability: the four graphs that matter

TTFT / TBTT by region (p50/p95/p99) — compare across A/B regions post-release.
Saturation — GPU util %, queue depth, dropped sessions.
Routing health — Anycast reachability, BGP flap count, DNS provider health.
Correctness — token duplication/missing rate during failovers (should be ~0).

Export everything to Prometheus; set SLO burn alerts (multi-window, multi-burn).

10) CAM resiliency mapping (certify it)

For a multi-region active-active deployment serving public traffic:

I-PWR = 3 — N+1 UPS/gensets per region; any one region can vanish.
I-COOL = 3 — Redundant CRAH/liquid loops; free-cool where possible.
I-NWK = 4 — Anycast across ≥2 ISPs per region; RPKI; dual DNS providers.
I-DATA = 4 — Quorum registry & metadata; RPO 0 s; immutable backups.
I-CTRL = 4 — Active-active control planes, GitOps, signed artifacts, independent admin domains.

Composite I-Score ≈ 3.6–4 → 4. For A3 workloads (99.98% / RTO 30 m / RPO 15 m), CAM Tier 3. For A2, CAM Tier 2–3 (you’ll typically exceed).

11) Rollout cookbook (shadow → canary → percent)

Prepare: Publish model vNEXT to object store; sign with COSIGN.
Shadow: Mirror 1% of traffic in Region A for 24 h; collect quality & TTFT deltas.
Canary: Route 5% in A & B; block rollout if p95 TTFT worsens > 30 ms or error rate > 0.5%.
Ramp: Increase weights every 30 min: 5 → 25 → 50 → 100%.
Finalize: Garbage-collect old weights; update feature contracts; rotate secrets.

12) Cost notes (hidden traps)

Inter-region egress will eat your lunch if you try to replicate KV caches; don’t.
Over-pinning (too sticky) causes hotspots; use EWMA of region p95 to bias hash.
Cold-start storms after deploy? Pre-warm by replaying top-K prompts to each region.

13) Quick starts (Helm one-liners)

vLLM on K8s

helm repo add vllm https://vllm.ai/helm-charts
helm install chat vllm/vllm \
  --set model.repo=TheModelOrg/awesome-13b \
  --set parallelism.tensor=2 \
  --set batching.maxNumSeqs=256 \
  --set serve.enableChunkedPrefill=true

TGI on K8s

helm repo add tgi https://hf.co/helm
helm install tgi tgi/text-generation-inference \
  --set model=TheModelOrg/awesome-13b \
  --set shards=2 \
  --set dtype=bf16

Triton

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install triton nvidia/triton-inference-server \
  --set modelRepository=/models \
  --set resources.limits.nvidia.com/gpu=2

14) Final guidance

Pin sessions, not regions. Let Anycast/L7 send each session to the healthiest nearby region and keep it there.
Warm smart, not heavy. Share signatures of hot prompts across regions; don’t sync KV caches.
Prove resilience. Map your design to CAM pillars, hit I-NWK/I-DATA/I-CTRL = 4, and certify the tier.

One-liner: Multi-region model serving is a routing problem wrapped around a caching problem—with a CI problem in the middle. Solve all three, and your users will never notice which region answered.

Back to Resources