GridSite Learning Center

What “Tier-0+” actually means

Tier-0+ is a pragmatic edge design point:

Power: single utility feed, small line-interactive UPS (5–15 minutes). No genset.
Cooling: ambient/fan-assisted enclosure; optionally rear-door HX or compact liquid loop.
Space: wall-mount cabinet, micro-pod, or ¼-rack in a shared closet.
Staffing: unstaffed; truck rolls only for swaps.
Security: locked enclosure + tamper sensors; zero-trust overlay for all traffic.

The “+” is important: although each site is fragile, the system is not. We raise I-NWK, I-DATA, and I-CTRL high enough that the application’s end-to-end availability meets the same outcome a Tier-3 facility would deliver.

Design tenets for Tier-0+ fleets

Stateless at the edge: treat nodes as disposable. If a box dies, traffic shifts elsewhere and state rehydrates.
Durable core: authoritative data lives in two or more independent core sites (or regions).
Anycast everything user-facing: fail-over by BGP announcement, not DNS TTL drama.
Two admin domains minimum: no single operator or cloud account can brick production.
Signed artifacts only: images, models, policies—verify before admit; roll back automatically.
Evidence or it didn’t happen: runbooks, restore drills, and telemetry are part of the product.

Reference blueprint (Tier-0+ edge → metro PoP → core)

ASCII sketch:

[Users/Devices]
| Anycast VIP (H3/H2) |
┌───────────────┐      ┌────────────────┐      ┌─────────────────┐
│  Tier-0+ Edge │ ==WAN==> │  Metro Edge PoP │ ==WAN==> │   Core Colos    │
│   (1–2 GPUs)  │ <==WAN== │   (2+ sites)    │ <==WAN== │ (2 independent) │
└───────────────┘      └────────────────┘      └─────────────────┘
      |                        |                          |
  stateless svc          cache/feature KV           quorum DB (3-way)
  sidecar mTLS           Anycast ingress            object store (WORM)
  SPIFFE agent           GitOps runners             HSM/signing service

Placement

First-hop inference and pre/post-processing at Tier-0+ edge (fastest UX).
Metro PoPs absorb overflow and act as cache/coordination layer.
Core colos host the durable truth (databases, registries, WORM backups).

SLOs and A-Level to target

Typical real-time app SLOs:

Latency p95: 20–50 ms depending on UX.
Availability: 99.9–99.99 % (A2–A3); payments/ops control may be A4.

Pick A-Level per your BIA (see Availability Standard Section 5). This article assumes A2/A3 for most edge AI.

CAM mapping: how Tier-0+ hits Tier-3-equivalent service

Goal: achieve CAM Tier 3 for an A2/A3 workload even when individual edge sites are fragile.

Suggested targets:

I-PWR = 1–2 (small UPS, no genset)
I-COOL = 1–2 (fan or compact liquid)
I-NWK = 4 (two ISPs/carriers, Anycast, dual DNS, RPKI)
I-DATA = 4 (multi-region quorum, RPO 0 s to core, immutable backups)
I-CTRL = 4 (active-active control planes, signed releases, independent admins)

Composite I-Score = round((2 + 2 + 4 + 4 + 4) / 5) = round(3.2) = 3 → CAM Tier for A2/A3 @ I3 = Tier 2/3. Bump either I-NWK or I-CTRL to 5 (e.g., stronger independence) and you still average I3; Tier stays 3. The key is that three strong pillars neutralize two weak ones for stateless edge roles.

Bill of materials (per Tier-0+ site)

Compute: short-depth 1U/2U node, 1× L4/A10-class GPU (or smaller); 128–256 GB RAM; 2× NVMe (cache only).
Power: 1.5 kVA line-interactive UPS, networked PDU; dry contacts to tamper sensor.
Cooling: fan-assisted enclosure OR small D2C kit with CDU in cabinet (if >700 W sustained).
Network: 1× primary DIA, 1× 5G multi-SIM or second DIA; SD-WAN/edge router with BGP; OOB LTE for management optional.
Security: TPM 2.0; secure boot; lockable case; camera on door sensor if feasible.

Network plan (I-NWK 4 without drama)

Two ISPs or ISP + 5G; physically diverse entrances where possible.
BGP multihoming at metro PoPs; edges don’t need to speak BGP to the internet—terminate Anycast at the PoP and tunnel to edges.
Dual DNS providers with DNSSEC; health-check from outside each ASN.
RPKI ROAs on your prefixes; MANRS hygiene.
QUIC/HTTP-3 at ingress to cut head-of-line blocking for token streaming.

Minimal FRR Anycast at PoP (illustrative):

router bgp 65010
 neighbor ISP1 remote-as 64500
 neighbor ISP2 remote-as 64510
 address-family ipv4 unicast
  network 203.0.113.0/24   ! your Anycast prefix
  maximum-paths 2
 !
 ip route 203.0.113.0/24 Null0 255

Data plan (I-DATA 4 with RPO 0 s to core)

Edge: no durable truth. Use local NVMe only as a warm cache (models, tiles, embeddings).
Metro: write-buffer/feature store with bounded queues; tolerate brief WAN loss.
Core: three-way quorum (e.g., Postgres Patroni in two colos + witness, or CockroachDB/Yugabyte); RPO 0 s across quorum.
Backups: WORM snapshots to third account/tenant; quarterly restore drills; hash attestations.

Restore drill snippet (Postgres example)

Create new cluster in Colo-B from WORM snapshot timestamp T.
Re-point read-only analytics to restored follower; verify checksums and row counts.
Promote to read/write in staging; execute synthetic transactions; cut back.
Record RTO, operator steps, anomalies; attach logs to Evidence Pack.

Control plane (I-CTRL 4 that heals itself)

Two independent control-plane clusters (metro-A and metro-B) mirrored with GitOps; edges subscribe to both.
Admission control enforces signatures (COSIGN) on images, models, policies; un-signed = denied.
Two admin teams or at minimum two distinct cloud accounts/tenants; least-privilege RBAC; hardware-token MFA.
Break-glass: smartcard stored offline; use audited vault; quarterly tested.

OPA admission rule (conceptual Rego):

package admission

default allow = false

allow {
  input.request.kind.kind == "Pod"
  sigstore.verify_image(input.request.object.spec.containers[_].image)
  input.request.object.metadata.labels["release.track"] == "stable"
}

Power and cooling (embracing “fragile but fast”)

UPS runtime sized only for graceful drain and fail-over (10 minutes is typically enough).
Consider fanless boxes up to ~300 W; above that, direct-to-chip liquid or a rear-door HX makes I-COOL=2 practical without adding gensets.
Monitor inlet temp and ΔT; if envelope exceeded, fail traffic to metro automatically (health probe) and power down node.

Security (zero-trust by default)

SPIFFE/SPIRE for workload identity; mTLS everywhere (Envoy/Linkerd).
Supply-chain: SBOMs (Syft/Grype), SLSA-attested builds; Rekor transparency logs for model artifacts.
Device bootstrap: secure boot + TPM attestation → SPIRE join token; rotate SVIDs every 24 h.

Failure-mode drills (make it muscle memory)

Run these quarterly at a subset of sites (or all, if automated):

Pull the plug at an edge cabinet. Expect: drain in < 30 s; Anycast shifts to nearest PoP; SLOs hold for ≥99 % of users.
Delete the wrong model tag. Expect: admission controller blocks deploy; no traffic served from unsigned artifact.
Kill metro PoP-A. Expect: PoP-B takes Anycast; edges reconnect; control-plane latency increases but SLO within budget.
Corrupt a replica in core DB. Expect: checksum mismatch triggers rebuild; RPO 0 s maintained; WORM untouched.

Cost reality check (why this beats “Tier-3 everywhere”)

Back-of-napkin for a 50-site fleet supporting 100k MAU:

Tier-0+ site: $6–10k cap-ex (node + UPS + enclosure) + $250–$450/mo opex (power, DIA/5G).
Two metro PoPs: $12–20k/mo each (space/power, dual ISPs).
Two core colos: existing or $20–30k/mo incremental for DB quorum + object store.

Compare: building three Tier-3 minis at $2–3M cap-ex each and $60–90k/mo opex—yet still farther (higher RTT) from users. The Tier-0+ approach typically saves 40–60 % TCO while improving p95 latency 2–5×.

CAM certification path (what to file and when)

Choose A-Level (A2 or A3).
Score pillars with evidence:
- • I-PWR: UPS logs, SLD, autonomy calc.
- • I-COOL: thermal map, envelope adherence.
- • I-NWK: contracts, traceroutes, RPKI ROAs, dual DNS proof.
- • I-DATA: replica status, restore report with timings, WORM config.
- • I-CTRL: GitOps config, admission policies, COSIGN verify logs, break-glass test.
Run cam-cli; target Tier 3. If computed tier < target, fix the cheapest pillar first (usually I-CTRL or I-NWK).
Optional Platinum: stream live telemetry (UPS runtime, BGP sessions, quorum health, control-plane SLO) to Trust Hub.

Pitfalls and how to avoid them

Single DNS provider: halves your I-NWK. Add a second with DNSSEC and independent ASN.
Backups in same cloud account: ransomware eats everything. Use cross-account WORM with MFA-delete.
Mutable edge nodes: config drift causes weird outages. Bake immutable images; redeploy, don’t patch.
Over-investing in UPS runtime: don’t. Spend on Anycast and GitOps to fail away, not on batteries to ride out hours.

Runbook snippet (operations you’ll actually use)

Health probe policy: edge node marked unhealthy if inlet > 45 °C for 2 min, UPS on battery, or SPIRE SVID renewal fails → cordon + drain + evacuate.
Traffic policy: 70/30 split across two metros; PoP with lower p95 latency gets more weight; cap single PoP at 80 %.
Rollout: blue/green with 5 % canary at one metro; synthetic and real traffic checks; automatic rollback on p95 deterioration > 20 %.

Executive summary and one-liner

Design for the application, not the building. Put fast, disposable nodes where your users are; anchor truth in two independent cores; bind it with Anycast, quorum, and signed control. That’s how a fleet of Tier-0+ sites delivers Tier-3 service—with better latency and lower TCO.

One-liner: “Make sites cheap and the system smart; CAM will prove it’s just as reliable.”

← Back to Resources

Designing Tier-0+ Edge Sites That Still Hit Tier-3 Service Levels