Files
Shade/docs/PRODUCTION-CHECKLIST.md
Sterister e6fdf31b49
Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
release(v4.0.0): Shade GA — V3.x consolidation + audit prep
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire
format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers
byte-for-byte. The version bump is semantic: audit-cycle complete,
opt-in surface fully exposed, threat model refreshed for every new
surface.

Highlights:
- All 24 @shade/* packages bumped to 4.0.0 in lockstep.
- CHANGELOG 4.0.0 section is the canonical manifest of what landed.
- THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12
  Web-Worker boundary) + residual-risks table refreshed.
- OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox,
  bridge, observer, /metrics, /healthz, /ready.
- MIGRATION 0.3.x → 4.0 documented + smoke-tested against
  shade migrate-storage on a real SQLite DB.
- docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer.
- scripts/soak.ts harness for the GA-stable 2-week soak window.
- All V*.md plans archived under docs/archive/ with Status: Done.
- Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen
  non-realtime stack.

Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green.
Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports
  version 4.0.0 on /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:35:35 +02:00

8.2 KiB
Raw Blame History

Shade Production Checklist

A flat punch-list for taking a Shade prekey server from "it boots" to "production-ready". Every item below is a hard gate — if you can't tick it, don't ship.

The deeper "why" behind each item lives in THREAT-MODEL.md, SECURITY.md, and docs/DEPLOYMENT.md. This file is the operator's checklist.

Scope: a single Shade prekey container (@shade/server) plus any consumer apps that talk to it. For E2EE file transfer hardening (max-size, retention, quotas), see the Hardening and Retention sections of docs/streams.md.


1. TLS termination

  • Public traffic is TLS 1.2+ only — Shade itself speaks plain HTTP and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's built-in proxy) terminates TLS in front of it.
  • HSTS is on (Strict-Transport-Security: max-age=15552000).
  • The proxy is configured to pass the original Host header through so signed payloads bound to the canonical address don't trip the replay-window check on a mismatch.
  • Internal traffic between consumer apps and the prekey container runs on a private network (Docker bridge / VPC); the prekey port is not exposed to the public internet without TLS in front.

Why: identity signatures and observer bearer tokens travel in request bodies / headers. Without TLS, a network attacker can read the observer token and replay it for the full validity window, and can read the metadata (who registers, who fetches whose bundle). See THREAT-MODEL.md § 1 (network attacker).

2. Backups

  • SQLite: scheduled sqlite3 /data/shade-prekeys.db ".backup ..." at least daily. The .db file plus -wal and -shm together is the recovery unit; never copy the bare .db while the container is running without using the online backup API.
  • Postgres: pg_dump (or your provider's snapshot) at least daily; verify a restore at least once per quarter.
  • Backups are stored on different infrastructure than the primary volume (different host / region / provider).
  • Backups are encrypted at rest (your storage provider's server-side encryption, age, or restic with a passphrase).
  • Restore drill: at least once before going live, restore the backup into a fresh volume and confirm /health is green and a registered identity is still resolvable.

Why: prekey records contain identity public keys and one-time prekeys. Losing them means new sessions can't be established to those identities until each user re-registers. Existing sessions keep ratcheting on the device-side state.

3. Observer token rotation

  • SHADE_OBSERVER_TOKEN is set to ≥ 16 chars of high-entropy random data (e.g. openssl rand -hex 32). The server logs a warning and disables the observer if the token is shorter.
  • The token is held in your secret manager (Dokploy secret, GitHub Actions secret, Vault, 1Password CLI), never committed to a compose file or .env checked into git.
  • The token is rotated on a schedule (recommended: every 90 days) and immediately if it has been shared with anyone who no longer needs access.
  • If you expose the dashboard publicly, you also gate it behind basic-auth at the proxy layer — bearer tokens are not revocation-friendly on their own.

Why: the observer dashboard exposes metadata about every active identity, registration timestamp, and recent activity. Anyone with the token can scrape the entire prekey directory.

4. SQLite vs PostgreSQL

Pick one and stick to it.

  • SQLite is the default. Use it when one Shade container is enough, you can tolerate downtime during backup snapshots, and your write rate is below ~500 req/s. Path: SHADE_PREKEY_DB_PATH, default /data/shade-prekeys.db.
  • PostgreSQL is for multi-replica deployments, shared infrastructure, or when you already operate a managed Postgres and want one fewer thing to back up. Path: SHADE_PREKEY_PG_URL. Tables are auto-created with shade_server_* prefix.
  • Whichever you pick, the database lives behind TLS for the connection (sslmode=require for Postgres) and on storage that is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
  • You do not mix them in the same deployment. Setting SHADE_PREKEY_PG_URL overrides SQLite silently — pick one in compose.yml and document which.

Why: Shade does not encrypt the database itself (V3.2 will). Disk-level / volume-level encryption is the operator's responsibility until at-rest encryption ships.

5. Log level and structured logs

  • SHADE_LOG_LEVEL is set to info (production) or warn (high-traffic). Avoid debug in prod — it logs request bodies including signed payloads.
  • Logs are shipped to a retention-bounded sink (Loki, CloudWatch, Datadog) with redaction of Authorization headers and signed bodies if your sink doesn't already strip them.
  • You alert on error-level logs and on the absence of cleanup cycles (a stuck cleanup loop = unbounded DB growth).

Why: at debug level the server logs signature material. While Ed25519 signatures are not secrets per se, leaking them widens the replay-window blast radius and reveals timing patterns.

6. Stale-identity cleanup parameters

  • SHADE_STALE_DAYS is set deliberately for your product. The default (30 days) is right for "active chat app"; "occasional use" apps should bump to 90+ to avoid surprise re-registration.
  • SHADE_CLEANUP_INTERVAL_HOURS is left at 24 unless you have a specific reason — running cleanup more often does not free more space, and running it less often risks one cycle missing a day.
  • You watch the shade_cleanup_purged_total metric (Prometheus) and alert on a sudden 10× spike — that often signals a bug or a deployment that broke client-side activity timestamps.

Why: stale cleanup is the only thing keeping the prekey directory from growing forever. A misconfigured SHADE_STALE_DAYS = 0 would nuke every identity on every cycle. Bound the value at ≥ 1 in your deployment config.

7. Secret rotation

  • Identity signing keys: each consumer rotates via the documented identity-rotation flow (7-day grace period for old sessions). Operators do not touch identity keys directly.
  • Observer token: see § 3.
  • Database credentials (Postgres only): rotate per your standard cadence, with the connection string supplied through the secret manager.
  • No long-lived API keys or service tokens are stored in the container image or volume.

8. Rate-limit and body-size caps

  • You have not lowered the built-in rate limits below the defaults (per-IP register/bundle and per-identity replenish/delete).
  • You have not raised the 64 KiB POST body limit. Prekey bundles fit comfortably; raising the limit only enables abuse.
  • Your reverse proxy enforces an additional connection / request- rate limit at the edge (Caddy rate_limit, Cloudflare, etc.) so a single noisy IP can't even reach Shade's per-route limits.

9. Health checks and metrics scrape

  • Container has a Docker HEALTHCHECK (the official image already ships one against /health).
  • /metrics is scraped by Prometheus / OpenTelemetry and retained ≥ 30 days.
  • Alerts are wired for: /health failing for > 2 min, request latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for > 25 h.

10. OpenAPI contract drift

  • CI runs the OpenAPI lint (bun test packages/shade-server/tests/openapi-lint.test.ts) on every PR — the spec must remain valid OpenAPI 3.1 with no dangling $refs.
  • Generated clients (Python, Go, Kotlin) are regenerated from the shipped spec on each release; mismatches between server and client are caught at integration test time, not production.

Pre-flight summary

If you can answer "yes" to every box above, ship it. If you can't, write down which box and why before you do — that note belongs in your runbook so the next operator inherits the gap, not the surprise.