What “Tier-0+” actually means
Tier-0+ is a pragmatic edge design point:
- Power: single utility feed, small line-interactive UPS (5–15 minutes). No genset.
- Cooling: ambient/fan-assisted enclosure; optionally rear-door HX or compact liquid loop.
- Space: wall-mount cabinet, micro-pod, or ¼-rack in a shared closet.
- Staffing: unstaffed; truck rolls only for swaps.
- Security: locked enclosure + tamper sensors; zero-trust overlay for all traffic.
The “+” is important: although each site is fragile, the system is not. We raise I-NWK, I-DATA, and I-CTRL high enough that the application’s end-to-end availability meets the same outcome a Tier-3 facility would deliver.
Design tenets for Tier-0+ fleets
- Stateless at the edge: treat nodes as disposable. If a box dies, traffic shifts elsewhere and state rehydrates.
- Durable core: authoritative data lives in two or more independent core sites (or regions).
- Anycast everything user-facing: fail-over by BGP announcement, not DNS TTL drama.
- Two admin domains minimum: no single operator or cloud account can brick production.
- Signed artifacts only: images, models, policies—verify before admit; roll back automatically.
- Evidence or it didn’t happen: runbooks, restore drills, and telemetry are part of the product.
Reference blueprint (Tier-0+ edge → metro PoP → core)
ASCII sketch:
[Users/Devices]
| Anycast VIP (H3/H2) |
┌───────────────┐ ┌────────────────┐ ┌─────────────────┐
│ Tier-0+ Edge │ ==WAN==> │ Metro Edge PoP │ ==WAN==> │ Core Colos │
│ (1–2 GPUs) │ <==WAN== │ (2+ sites) │ <==WAN== │ (2 independent) │
└───────────────┘ └────────────────┘ └─────────────────┘
| | |
stateless svc cache/feature KV quorum DB (3-way)
sidecar mTLS Anycast ingress object store (WORM)
SPIFFE agent GitOps runners HSM/signing servicePlacement
- First-hop inference and pre/post-processing at Tier-0+ edge (fastest UX).
- Metro PoPs absorb overflow and act as cache/coordination layer.
- Core colos host the durable truth (databases, registries, WORM backups).
SLOs and A-Level to target
Typical real-time app SLOs:
- Latency p95: 20–50 ms depending on UX.
- Availability: 99.9–99.99 % (A2–A3); payments/ops control may be A4.
Pick A-Level per your BIA (see Availability Standard Section 5). This article assumes A2/A3 for most edge AI.
CAM mapping: how Tier-0+ hits Tier-3-equivalent service
Goal: achieve CAM Tier 3 for an A2/A3 workload even when individual edge sites are fragile.
Suggested targets:
- I-PWR = 1–2 (small UPS, no genset)
- I-COOL = 1–2 (fan or compact liquid)
- I-NWK = 4 (two ISPs/carriers, Anycast, dual DNS, RPKI)
- I-DATA = 4 (multi-region quorum, RPO 0 s to core, immutable backups)
- I-CTRL = 4 (active-active control planes, signed releases, independent admins)
Composite I-Score = round((2 + 2 + 4 + 4 + 4) / 5) = round(3.2) = 3 → CAM Tier for A2/A3 @ I3 = Tier 2/3. Bump either I-NWK or I-CTRL to 5 (e.g., stronger independence) and you still average I3; Tier stays 3. The key is that three strong pillars neutralize two weak ones for stateless edge roles.
Bill of materials (per Tier-0+ site)
- Compute: short-depth 1U/2U node, 1× L4/A10-class GPU (or smaller); 128–256 GB RAM; 2× NVMe (cache only).
- Power: 1.5 kVA line-interactive UPS, networked PDU; dry contacts to tamper sensor.
- Cooling: fan-assisted enclosure OR small D2C kit with CDU in cabinet (if >700 W sustained).
- Network: 1× primary DIA, 1× 5G multi-SIM or second DIA; SD-WAN/edge router with BGP; OOB LTE for management optional.
- Security: TPM 2.0; secure boot; lockable case; camera on door sensor if feasible.
Network plan (I-NWK 4 without drama)
- Two ISPs or ISP + 5G; physically diverse entrances where possible.
- BGP multihoming at metro PoPs; edges don’t need to speak BGP to the internet—terminate Anycast at the PoP and tunnel to edges.
- Dual DNS providers with DNSSEC; health-check from outside each ASN.
- RPKI ROAs on your prefixes; MANRS hygiene.
- QUIC/HTTP-3 at ingress to cut head-of-line blocking for token streaming.
Minimal FRR Anycast at PoP (illustrative):
router bgp 65010
neighbor ISP1 remote-as 64500
neighbor ISP2 remote-as 64510
address-family ipv4 unicast
network 203.0.113.0/24 ! your Anycast prefix
maximum-paths 2
!
ip route 203.0.113.0/24 Null0 255Data plan (I-DATA 4 with RPO 0 s to core)
- Edge: no durable truth. Use local NVMe only as a warm cache (models, tiles, embeddings).
- Metro: write-buffer/feature store with bounded queues; tolerate brief WAN loss.
- Core: three-way quorum (e.g., Postgres Patroni in two colos + witness, or CockroachDB/Yugabyte); RPO 0 s across quorum.
- Backups: WORM snapshots to third account/tenant; quarterly restore drills; hash attestations.
Restore drill snippet (Postgres example)
- Create new cluster in Colo-B from WORM snapshot timestamp T.
- Re-point read-only analytics to restored follower; verify checksums and row counts.
- Promote to read/write in staging; execute synthetic transactions; cut back.
- Record RTO, operator steps, anomalies; attach logs to Evidence Pack.
Control plane (I-CTRL 4 that heals itself)
- Two independent control-plane clusters (metro-A and metro-B) mirrored with GitOps; edges subscribe to both.
- Admission control enforces signatures (COSIGN) on images, models, policies; un-signed = denied.
- Two admin teams or at minimum two distinct cloud accounts/tenants; least-privilege RBAC; hardware-token MFA.
- Break-glass: smartcard stored offline; use audited vault; quarterly tested.
OPA admission rule (conceptual Rego):
package admission
default allow = false
allow {
input.request.kind.kind == "Pod"
sigstore.verify_image(input.request.object.spec.containers[_].image)
input.request.object.metadata.labels["release.track"] == "stable"
}Power and cooling (embracing “fragile but fast”)
- UPS runtime sized only for graceful drain and fail-over (10 minutes is typically enough).
- Consider fanless boxes up to ~300 W; above that, direct-to-chip liquid or a rear-door HX makes I-COOL=2 practical without adding gensets.
- Monitor inlet temp and ΔT; if envelope exceeded, fail traffic to metro automatically (health probe) and power down node.
Security (zero-trust by default)
- SPIFFE/SPIRE for workload identity; mTLS everywhere (Envoy/Linkerd).
- Supply-chain: SBOMs (Syft/Grype), SLSA-attested builds; Rekor transparency logs for model artifacts.
- Device bootstrap: secure boot + TPM attestation → SPIRE join token; rotate SVIDs every 24 h.
Failure-mode drills (make it muscle memory)
Run these quarterly at a subset of sites (or all, if automated):
- Pull the plug at an edge cabinet. Expect: drain in < 30 s; Anycast shifts to nearest PoP; SLOs hold for ≥99 % of users.
- Delete the wrong model tag. Expect: admission controller blocks deploy; no traffic served from unsigned artifact.
- Kill metro PoP-A. Expect: PoP-B takes Anycast; edges reconnect; control-plane latency increases but SLO within budget.
- Corrupt a replica in core DB. Expect: checksum mismatch triggers rebuild; RPO 0 s maintained; WORM untouched.
Cost reality check (why this beats “Tier-3 everywhere”)
Back-of-napkin for a 50-site fleet supporting 100k MAU:
- Tier-0+ site: $6–10k cap-ex (node + UPS + enclosure) + $250–$450/mo opex (power, DIA/5G).
- Two metro PoPs: $12–20k/mo each (space/power, dual ISPs).
- Two core colos: existing or $20–30k/mo incremental for DB quorum + object store.
Compare: building three Tier-3 minis at $2–3M cap-ex each and $60–90k/mo opex—yet still farther (higher RTT) from users. The Tier-0+ approach typically saves 40–60 % TCO while improving p95 latency 2–5×.
CAM certification path (what to file and when)
- Choose A-Level (A2 or A3).
- Score pillars with evidence:
- • I-PWR: UPS logs, SLD, autonomy calc.
- • I-COOL: thermal map, envelope adherence.
- • I-NWK: contracts, traceroutes, RPKI ROAs, dual DNS proof.
- • I-DATA: replica status, restore report with timings, WORM config.
- • I-CTRL: GitOps config, admission policies, COSIGN verify logs, break-glass test.
- Run cam-cli; target Tier 3. If computed tier < target, fix the cheapest pillar first (usually I-CTRL or I-NWK).
- Optional Platinum: stream live telemetry (UPS runtime, BGP sessions, quorum health, control-plane SLO) to Trust Hub.
Pitfalls and how to avoid them
- Single DNS provider: halves your I-NWK. Add a second with DNSSEC and independent ASN.
- Backups in same cloud account: ransomware eats everything. Use cross-account WORM with MFA-delete.
- Mutable edge nodes: config drift causes weird outages. Bake immutable images; redeploy, don’t patch.
- Over-investing in UPS runtime: don’t. Spend on Anycast and GitOps to fail away, not on batteries to ride out hours.
Runbook snippet (operations you’ll actually use)
- Health probe policy: edge node marked unhealthy if inlet > 45 °C for 2 min, UPS on battery, or SPIRE SVID renewal fails → cordon + drain + evacuate.
- Traffic policy: 70/30 split across two metros; PoP with lower p95 latency gets more weight; cap single PoP at 80 %.
- Rollout: blue/green with 5 % canary at one metro; synthetic and real traffic checks; automatic rollback on p95 deterioration > 20 %.
Executive summary and one-liner
Design for the application, not the building. Put fast, disposable nodes where your users are; anchor truth in two independent cores; bind it with Anycast, quorum, and signed control. That’s how a fleet of Tier-0+ sites delivers Tier-3 service—with better latency and lower TCO.
One-liner: “Make sites cheap and the system smart; CAM will prove it’s just as reliable.”