What this article delivers
- Three reference architectures (Retail AR, Industrial Vision, Fleet Telematics) with ASCII diagrams
- Latency budgets, SLOs, and CAM pillar mapping (I-PWR, I-COOL, I-NWK, I-DATA, I-CTRL)
- Bill of materials, deployment steps, and failure-mode notes
- Copy-paste snippets (Kubernetes/Helm, policy) you can adapt today
1) Shared assumptions and quick math
Latency budget model
End-to-end latency (RTT) ≈ Access(last-mile) + Metro + Backbone + DC-fabric + Inference + App logic
Speed of light in fiber ≈ 5 µs/km one way.
Every router/LB hop adds ~0.1–0.5 ms; TLS adds 1–2 RTT for handshake unless resumed.
P99 matters. Budget to P99, not P50.
SLO taxonomy (use or adapt)
- Availability — % of successful requests within SLO latency window
- Latency — p50/p95/p99 target for “time to first token” (LLM) or “decision time” (vision)
- Freshness — Max model staleness (hours) allowed before redeploy
- Durability — RPO/RTO for authoritative data & artifacts
CAM in one minute
- A-Level: business tolerance for downtime/RPO (A0–A5).
- I-Score: mean of 5 pillars (0–5).
- CAM Tier: intersection of A-Level and I-Score (0–5).
Architectures below show target pillar scores and the resulting CAM Tier you can certify.
2) Reference Architecture A — Retail AR “Pick-and-See” (sub-20 ms UX)
Use case: A shopper lifts a product; a phone/tablet overlay shows price, reviews, and in-stock alternatives within 20 ms.
[Phone/Tablet] --Wi-Fi 6E [In-Store Edge Pod] ==(Anycast)==> [Metro Edge PoP A/B]
| | |
Camera + AR SDK Triton/vLLM Model Registry (quorum)
| | |
+-- tiny preproc + Feature store read + Object storage (core)
SLOs
- Latency: p95 ≤ 20 ms RTT store↔inference; p99 ≤ 35 ms
- Availability: 99.95% monthly
- Freshness: model ≤ 24 h old
- Durability: RPO ≤ 1 h (features & events)
Placement logic
Inference in-store (1–2 GPUs) for “first paint” with metro PoP as spillover/backup. Features/metadata from metro cache; authoritative store in two core colos.
Latency budget (example)
Component Budget
Wi-Fi + local switch 2 ms
In-store edge pod (LLM/vision)8–10 ms
Optional metro round-trip 8–12 ms
Total 18–24 ms
Bill of Materials (per large store)
- Compute: 1× short-depth 2U server, 2× L4 or A10 class GPUs; 256 GB RAM
- Network: 2× WAN circuits (cable + 5G), SD-WAN CPE, Anycast-capable ingress (Envoy)
- Cooling: Fan-assisted short rack or wall-mount enclosure (I-COOL target 1–2)
- Storage: 2 TB NVMe for local warm cache; no durable state
- Security: TPM 2.0, secure boot, SPIFFE/SPIRE agent
CAM mapping (A-Level A2 typical)
- I-PWR 1–2 — Small UPS covers 10 min; app re-routes to metro if store drops
- I-COOL 1–2 — Enclosure fans; thermal alarms to fail to metro
- I-NWK 3 — Dual ISP or ISP + 5G; Anycast to two metros; dual DNS
- I-DATA 3 — Feature store cached; authoritative in two cores; immutable nightly backup
- I-CTRL 3 — GitOps; runners in metro & core; break-glass offline creds
Composite I-Score ≈ 2.4–3.0 → 2–3. CAM Tier for A2 = 2 (I2) or 2 (I3 also yields 2). To reach Tier 3, lift I-CTRL or I-NWK one notch.
Failure modes and behavior
- Store power loss: Fail to metro PoP; degrade to “lite overlay” without personalization.
- WAN flap: Keep serving from store; queue writes, eventually sync features.
- Model rollout gone bad: Blue-green with store→metro canary; auto-rollback via GitOps.
Deployment quick start (Helm)
# values-retail.yaml
ingress:
anycast: true
protocols: [h3, h2]
triton:
image: nvcr.io/nvidia/tritonserver:24.05
args: ["--exit-on-error=false","--pinned-memory-pool-byte-size=268435456"]
resources:
limits: { nvidia.com/gpu: 1 }
featureStore:
endpoint: http://feature-cache.metro.local:8080
mesh:
spiffe: true
mTLS: required
helm repo add retail-ai https://charts.retail.example
helm install store-edge retail-ai/retail-stack -f values-retail.yaml
3) Reference Architecture B — Industrial Vision “Stop the Arm” (≤ 20 ms deterministic)
Use case: High-speed line rejects defective parts; robotic arm must halt before contact.
[GigE Camera] -> [PoE Switch] -> [Edge GPU Node(s)] -> [PLC/SCADA]
| ^
| mTLS+PTP | EtherCAT/Profinet
v |
[Local Feature DB] |
(bounded write-behind to core)
SLOs
- Decision time: p99 ≤ 20 ms from last frame to PLC signal
- Availability: 99.99% monthly (A-Level A3 or A4, depending on safety case)
- RPO/RTO: RPO ≤ 15 min (metadata), RTO ≤ 5 min for line restart
Placement logic
Everything critical stays on-prem: inference, feature cache, PLC link. Core/cloud only for analytics, retraining, and artifact storage.
Deterministic timing practices
- Use PTP (IEEE-1588) for sub-1 ms clock skew.
- Pin GPU/CPU interrupts; isolate NIC queues (RSS) for camera streams.
- Prefer gRPC over Unix domain sockets or shared memory on single host.
Bill of Materials (per line)
- Compute: 2× industrial GPU PCs (IP65 if needed), each with 1× L4/A2 (fanless if possible)
- Network: 1× PoE+ switch; 2× uplinks to plant core; OT/IT segmentation w/ firewall
- Cooling: Direct-to-chip kits or sealed fanless chassis (I-COOL 3)
- Power: UPS 2 kVA per node; plant micro-grid or generator (I-PWR 3–4)
- Storage: NVMe mirrored; WORM snapshot daily for config & models
- Safety: Interlock relays; SIL-rated PLC
CAM mapping
- I-PWR 3–4 — Genset or micro-grid + UPS with autonomy ≥ RTO
- I-COOL 3 — No thermal excursions during shift; redundant pump/fan where liquid cooling used
- I-NWK 3 — Dual uplinks; plant ring; out-of-band 4G for alerts; DNS split horizon
- I-DATA 3–4 — On-prem primary; second site async; nightly immutable backup; periodic restore drill
- I-CTRL 4 — GitOps; two admin teams (OT/IT) with RBAC; policy-as-code; break-glass tested
I-Score ≈ 3–4 → CAM Tier 3 (A3) or Tier 4 (A4 with I4).
Failure modes and behavior
- GPU node fails: Hot spare takes camera stream within 1 s via VIP; PLC alarm but no stop if spare ok.
- WAN outage: No impact on line; metrics buffered; alerts via LTE.
- Cooling loop blockage: Thermal kill-switch halts inference; PLC puts line into safe state; maintenance SLA < 5 min.
Policy gate (OPA Rego) to block unsigned models
package admission
default allow = false
allow {
input.request.kind.kind == "ModelDeployment"
sig := input.request.object.spec.signature
verify(sig, input.request.object.spec.digest)
input.request.object.spec.tags[_] == "safety-approved"
}
verify(sig, digest) {
# abstracted; call out to cosign/rekor verifier
}
4) Reference Architecture C — Fleet Telematics “Sense → Decide → Sync” (≤ 70 ms assist; sub-second aggregation)
Use case: 100k vehicles run local safety inference; roadway MEC clusters aggregate events and serve personalized models back to the fleet.
[Vehicle Edge (ARM or x86)] --multi-IMSI/5G [MEC PoP A/B] ==Anycast==> [Regional Agg]
| | |
Camera/LiDAR + tiny model vLLM/Triton + KV cache Model registry (quorum)
| | |
+-- local decision (10–30 ms) Driver UI Cloud core analytics
SLOs
- On-vehicle assist: p99 ≤ 30 ms (local only)
- MEC query: p95 ≤ 70 ms RTT vehicle↔MEC for map/traffic augment
- Availability: 99.9% monthly (A-Level A2/A3 by feature class)
- Freshness: regional map/model deltas ≤ 5 min
Placement logic
Primary decisions on vehicle to avoid WAN dependency. MEC holds regional embeddings, map tiles, and personalization features. Regional aggregator pushes model deltas; core runs training & analytics.
Bill of Materials (per MEC PoP)
- Compute: 4× 2U servers, each 2× L4/A10; 512 GB RAM; NVMe cache 8 TB
- Network: Dual ISPs, Anycast VIPs; RPKI ROAs; DDoS scrubbing
- Power/Cooling: N+1 UPS; room or containerized pod; economizer if climate allows
- Security: HSM for signing model releases; SPIRE for workload identity
CAM mapping (for MEC)
- I-PWR 3 — N+1 UPS/genset; MEC must stay up during utility blips
- I-COOL 3 — Redundant CRAH or liquid-to-rack; free-cool if possible
- I-NWK 4 — Two ISPs + Anycast; secondary DNS; tolerate any one carrier outage
- I-DATA 3–4 — KV cache + quorum registry (two regions + MEC); RPO ≤ 15 min
- I-CTRL 3–4 — Active-active control-planes across two MECs; GitOps; signed releases
I-Score ≈ 3.4–3.8 → 3–4. For A2, CAM Tier 2–3; for A3, aim for I4 to hit Tier 3.
Failure modes and behavior
- Carrier down: Multi-IMSI modems prefer alt carrier; MEC Anycast shifts; degraded but safe.
- MEC loss: Vehicles operate locally; miss regional enrichments until next PoP reached.
- Model rollback: Signed model fails post-deploy checks; vehicle uses previous local model.
MEC bootstrap (Kubernetes)
# Control plane across two MEC sites
kubeadm init --control-plane-endpoint mec.anycast.local:6443 --upload-certs
kubeadm join mec-a ... --control-plane
kubeadm join mec-b ... --control-plane
# SPIRE/SPIRE agents
helm install spire spiffe/spire-server -n spire
helm install spire-agent spiffe/spire-agent -n spire --set nodeAttestor=tpm
5) Decision matrix: where should your model run?
Constraint | Core Cloud | Metro Edge PoP | In-Store/On-Prem Edge | On-Device |
---|---|---|---|---|
RTT target | > 70 ms | 15–40 ms | 3–20 ms | 0–10 ms |
Data sovereignty | Medium | High (regional) | Very high | Highest |
Cap-ex/ops | Lowest | Medium | Medium–High | Low (per node, high fleet) |
Control | Medium | High | Highest | High but fragmented |
CAM pillar stress | I-DATA | I-NWK | I-PWR / I-COOL / I-CTRL | I-CTRL (supply chain) |
Pick the lowest-cost tier that meets your SLO. Prove it with CAM; invest only in the pillars that cap your Tier.
6) Implementation checklist (copy/paste into your runbook)
- Define SLOs (p95 latency, availability, freshness, durability) per user journey.
- Choose A-Level from the Availability Standard (A2 for “important,” A3/A4 for revenue-critical).
- Place compute: device → store → metro → core, stopping at the first tier that meets SLO.
- Score pillars with CAM rubric; list the cheapest improvements to hit target Tier.
- Ship identity: SPIFFE/SPIRE + mTLS; sign every artifact (images, models, policies).
- Test RTO/RPO: cut power, flap WAN, corrupt a replica; verify SLOs don’t break.
- Automate rollouts: GitOps, progressive delivery, canaries; fail closed on unsigned deploys.
- Attest continuously: export pillar KPIs to Grafana; publish green CAM badge.
7) Final take
Real-time AI isn’t a monolith. Some answers must come from the phone, some from the back room, some from the metro, and a few from the core. The right architecture is the one that meets your SLO at the lowest total cost, and the right proof is a CAM Tier you can show to customers, auditors, and your CFO.