V3.1 → V3.12 consolidated and tagged for the first GA release. Wire format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers byte-for-byte. The version bump is semantic: audit-cycle complete, opt-in surface fully exposed, threat model refreshed for every new surface. Highlights: - All 24 @shade/* packages bumped to 4.0.0 in lockstep. - CHANGELOG 4.0.0 section is the canonical manifest of what landed. - THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12 Web-Worker boundary) + residual-risks table refreshed. - OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox, bridge, observer, /metrics, /healthz, /ready. - MIGRATION 0.3.x → 4.0 documented + smoke-tested against shade migrate-storage on a real SQLite DB. - docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer. - scripts/soak.ts harness for the GA-stable 2-week soak window. - All V*.md plans archived under docs/archive/ with Status: Done. - Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen non-realtime stack. Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green. Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports version 4.0.0 on /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.2 KiB
Shade Production Checklist
A flat punch-list for taking a Shade prekey server from "it boots" to "production-ready". Every item below is a hard gate — if you can't tick it, don't ship.
The deeper "why" behind each item lives in THREAT-MODEL.md,
SECURITY.md, and docs/DEPLOYMENT.md. This file is the operator's
checklist.
Scope: a single Shade prekey container (
@shade/server) plus any consumer apps that talk to it. For E2EE file transfer hardening (max-size, retention, quotas), see the Hardening and Retention sections ofdocs/streams.md.
1. TLS termination
- Public traffic is TLS 1.2+ only — Shade itself speaks plain HTTP and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's built-in proxy) terminates TLS in front of it.
- HSTS is on (
Strict-Transport-Security: max-age=15552000). - The proxy is configured to pass the original
Hostheader through so signed payloads bound to the canonical address don't trip the replay-window check on a mismatch. - Internal traffic between consumer apps and the prekey container runs on a private network (Docker bridge / VPC); the prekey port is not exposed to the public internet without TLS in front.
Why: identity signatures and observer bearer tokens travel in request bodies / headers. Without TLS, a network attacker can read the observer token and replay it for the full validity window, and can read the metadata (who registers, who fetches whose bundle). See
THREAT-MODEL.md § 1(network attacker).
2. Backups
- SQLite: scheduled
sqlite3 /data/shade-prekeys.db ".backup ..."at least daily. The.dbfile plus-waland-shmtogether is the recovery unit; never copy the bare.dbwhile the container is running without using the online backup API. - Postgres:
pg_dump(or your provider's snapshot) at least daily; verify a restore at least once per quarter. - Backups are stored on different infrastructure than the primary volume (different host / region / provider).
- Backups are encrypted at rest (your storage provider's server-side encryption, age, or restic with a passphrase).
- Restore drill: at least once before going live, restore the
backup into a fresh volume and confirm
/healthis green and a registered identity is still resolvable.
Why: prekey records contain identity public keys and one-time prekeys. Losing them means new sessions can't be established to those identities until each user re-registers. Existing sessions keep ratcheting on the device-side state.
3. Observer token rotation
SHADE_OBSERVER_TOKENis set to ≥ 16 chars of high-entropy random data (e.g.openssl rand -hex 32). The server logs a warning and disables the observer if the token is shorter.- The token is held in your secret manager (Dokploy secret, GitHub
Actions secret, Vault, 1Password CLI), never committed to a
compose file or
.envchecked into git. - The token is rotated on a schedule (recommended: every 90 days) and immediately if it has been shared with anyone who no longer needs access.
- If you expose the dashboard publicly, you also gate it behind basic-auth at the proxy layer — bearer tokens are not revocation-friendly on their own.
Why: the observer dashboard exposes metadata about every active identity, registration timestamp, and recent activity. Anyone with the token can scrape the entire prekey directory.
4. SQLite vs PostgreSQL
Pick one and stick to it.
- SQLite is the default. Use it when one Shade container is
enough, you can tolerate downtime during backup snapshots, and
your write rate is below ~500 req/s. Path:
SHADE_PREKEY_DB_PATH, default/data/shade-prekeys.db. - PostgreSQL is for multi-replica deployments, shared
infrastructure, or when you already operate a managed Postgres
and want one fewer thing to back up. Path:
SHADE_PREKEY_PG_URL. Tables are auto-created withshade_server_*prefix. - Whichever you pick, the database lives behind TLS for the
connection (
sslmode=requirefor Postgres) and on storage that is itself encrypted (LUKS, EBS encryption, managed-DB encryption). - You do not mix them in the same deployment. Setting
SHADE_PREKEY_PG_URLoverrides SQLite silently — pick one incompose.ymland document which.
Why: Shade does not encrypt the database itself (V3.2 will). Disk-level / volume-level encryption is the operator's responsibility until at-rest encryption ships.
5. Log level and structured logs
SHADE_LOG_LEVELis set toinfo(production) orwarn(high-traffic). Avoiddebugin prod — it logs request bodies including signed payloads.- Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
Datadog) with redaction of
Authorizationheaders and signed bodies if your sink doesn't already strip them. - You alert on
error-level logs and on the absence of cleanup cycles (a stuck cleanup loop = unbounded DB growth).
Why: at
debuglevel the server logs signature material. While Ed25519 signatures are not secrets per se, leaking them widens the replay-window blast radius and reveals timing patterns.
6. Stale-identity cleanup parameters
SHADE_STALE_DAYSis set deliberately for your product. The default (30 days) is right for "active chat app"; "occasional use" apps should bump to 90+ to avoid surprise re-registration.SHADE_CLEANUP_INTERVAL_HOURSis left at 24 unless you have a specific reason — running cleanup more often does not free more space, and running it less often risks one cycle missing a day.- You watch the
shade_cleanup_purged_totalmetric (Prometheus) and alert on a sudden 10× spike — that often signals a bug or a deployment that broke client-side activity timestamps.
Why: stale cleanup is the only thing keeping the prekey directory from growing forever. A misconfigured
SHADE_STALE_DAYS = 0would nuke every identity on every cycle. Bound the value at ≥ 1 in your deployment config.
7. Secret rotation
- Identity signing keys: each consumer rotates via the documented identity-rotation flow (7-day grace period for old sessions). Operators do not touch identity keys directly.
- Observer token: see § 3.
- Database credentials (Postgres only): rotate per your standard cadence, with the connection string supplied through the secret manager.
- No long-lived API keys or service tokens are stored in the container image or volume.
8. Rate-limit and body-size caps
- You have not lowered the built-in rate limits below the defaults (per-IP register/bundle and per-identity replenish/delete).
- You have not raised the 64 KiB POST body limit. Prekey bundles fit comfortably; raising the limit only enables abuse.
- Your reverse proxy enforces an additional connection / request-
rate limit at the edge (Caddy
rate_limit, Cloudflare, etc.) so a single noisy IP can't even reach Shade's per-route limits.
9. Health checks and metrics scrape
- Container has a Docker
HEALTHCHECK(the official image already ships one against/health). /metricsis scraped by Prometheus / OpenTelemetry and retained ≥ 30 days.- Alerts are wired for:
/healthfailing for > 2 min, request latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for > 25 h.
10. OpenAPI contract drift
- CI runs the OpenAPI lint (
bun test packages/shade-server/tests/openapi-lint.test.ts) on every PR — the spec must remain valid OpenAPI 3.1 with no dangling$refs. - Generated clients (Python, Go, Kotlin) are regenerated from the shipped spec on each release; mismatches between server and client are caught at integration test time, not production.
Pre-flight summary
If you can answer "yes" to every box above, ship it. If you can't, write down which box and why before you do — that note belongs in your runbook so the next operator inherits the gap, not the surprise.