Files
Shade/docs/PRODUCTION-CHECKLIST.md
Sterister e6fdf31b49
Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
release(v4.0.0): Shade GA — V3.x consolidation + audit prep
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire
format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers
byte-for-byte. The version bump is semantic: audit-cycle complete,
opt-in surface fully exposed, threat model refreshed for every new
surface.

Highlights:
- All 24 @shade/* packages bumped to 4.0.0 in lockstep.
- CHANGELOG 4.0.0 section is the canonical manifest of what landed.
- THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12
  Web-Worker boundary) + residual-risks table refreshed.
- OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox,
  bridge, observer, /metrics, /healthz, /ready.
- MIGRATION 0.3.x → 4.0 documented + smoke-tested against
  shade migrate-storage on a real SQLite DB.
- docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer.
- scripts/soak.ts harness for the GA-stable 2-week soak window.
- All V*.md plans archived under docs/archive/ with Status: Done.
- Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen
  non-realtime stack.

Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green.
Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports
  version 4.0.0 on /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:35:35 +02:00

180 lines
8.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Shade Production Checklist
A flat punch-list for taking a Shade prekey server from "it boots" to
"production-ready". Every item below is a hard gate — if you can't tick it,
don't ship.
The deeper "why" behind each item lives in `THREAT-MODEL.md`,
`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
checklist.
> Scope: a single Shade prekey container (`@shade/server`) plus any
> consumer apps that talk to it. For E2EE file transfer hardening
> (max-size, retention, quotas), see the **Hardening** and **Retention**
> sections of `docs/streams.md`.
---
## 1. TLS termination
- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
built-in proxy) terminates TLS in front of it.
- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
- [ ] The proxy is configured to pass the original `Host` header through
so signed payloads bound to the canonical address don't trip the
replay-window check on a mismatch.
- [ ] Internal traffic between consumer apps and the prekey container
runs on a private network (Docker bridge / VPC); the prekey port
is **not** exposed to the public internet without TLS in front.
> **Why:** identity signatures and observer bearer tokens travel in
> request bodies / headers. Without TLS, a network attacker can read
> the observer token and replay it for the full validity window, and
> can read the metadata (who registers, who fetches whose bundle).
> See `THREAT-MODEL.md § 1` (network attacker).
## 2. Backups
- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
at least daily. The `.db` file plus `-wal` and `-shm` together is
the recovery unit; never copy the bare `.db` while the container
is running without using the online backup API.
- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
daily; verify a restore at least once per quarter.
- [ ] Backups are stored on different infrastructure than the primary
volume (different host / region / provider).
- [ ] Backups are encrypted at rest (your storage provider's
server-side encryption, age, or restic with a passphrase).
- [ ] **Restore drill:** at least once before going live, restore the
backup into a fresh volume and confirm `/health` is green and a
registered identity is still resolvable.
> **Why:** prekey records contain identity public keys and one-time
> prekeys. Losing them means new sessions can't be established to those
> identities until each user re-registers. Existing sessions keep
> ratcheting on the device-side state.
## 3. Observer token rotation
- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
random data (e.g. `openssl rand -hex 32`). The server logs a
warning and disables the observer if the token is shorter.
- [ ] The token is held in your secret manager (Dokploy secret, GitHub
Actions secret, Vault, 1Password CLI), **never** committed to a
compose file or `.env` checked into git.
- [ ] The token is rotated on a schedule (recommended: every 90 days)
and immediately if it has been shared with anyone who no longer
needs access.
- [ ] If you expose the dashboard publicly, you also gate it behind
basic-auth at the proxy layer — bearer tokens are not
revocation-friendly on their own.
> **Why:** the observer dashboard exposes metadata about every active
> identity, registration timestamp, and recent activity. Anyone with
> the token can scrape the entire prekey directory.
## 4. SQLite vs PostgreSQL
Pick one and stick to it.
- [ ] **SQLite** is the default. Use it when **one** Shade container is
enough, you can tolerate downtime during backup snapshots, and
your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
default `/data/shade-prekeys.db`.
- [ ] **PostgreSQL** is for multi-replica deployments, shared
infrastructure, or when you already operate a managed Postgres
and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
Tables are auto-created with `shade_server_*` prefix.
- [ ] Whichever you pick, the database lives behind TLS for the
connection (`sslmode=require` for Postgres) and on storage that
is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
- [ ] You do **not** mix them in the same deployment. Setting
`SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
`compose.yml` and document which.
> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
> Disk-level / volume-level encryption is the operator's responsibility
> until at-rest encryption ships.
## 5. Log level and structured logs
- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
(high-traffic). Avoid `debug` in prod — it logs request bodies
including signed payloads.
- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
Datadog) with **redaction of `Authorization` headers and signed
bodies** if your sink doesn't already strip them.
- [ ] You alert on `error`-level logs and on the absence of cleanup
cycles (a stuck cleanup loop = unbounded DB growth).
> **Why:** at `debug` level the server logs signature material. While
> Ed25519 signatures are not secrets per se, leaking them widens the
> replay-window blast radius and reveals timing patterns.
## 6. Stale-identity cleanup parameters
- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
default (30 days) is right for "active chat app"; "occasional
use" apps should bump to 90+ to avoid surprise re-registration.
- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
specific reason — running cleanup more often does not free more
space, and running it less often risks one cycle missing a day.
- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
alert on a sudden 10× spike — that often signals a bug or a
deployment that broke client-side activity timestamps.
> **Why:** stale cleanup is the only thing keeping the prekey directory
> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
> nuke every identity on every cycle. Bound the value at ≥ 1 in your
> deployment config.
## 7. Secret rotation
- [ ] Identity signing keys: each consumer rotates via the documented
identity-rotation flow (7-day grace period for old sessions).
Operators do **not** touch identity keys directly.
- [ ] Observer token: see § 3.
- [ ] Database credentials (Postgres only): rotate per your standard
cadence, with the connection string supplied through the secret
manager.
- [ ] No long-lived API keys or service tokens are stored in the
container image or volume.
## 8. Rate-limit and body-size caps
- [ ] You have not lowered the built-in rate limits below the defaults
(per-IP register/bundle and per-identity replenish/delete).
- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
fit comfortably; raising the limit only enables abuse.
- [ ] Your reverse proxy enforces an additional connection / request-
rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
so a single noisy IP can't even reach Shade's per-route limits.
## 9. Health checks and metrics scrape
- [ ] Container has a Docker `HEALTHCHECK` (the official image already
ships one against `/health`).
- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
retained ≥ 30 days.
- [ ] Alerts are wired for: `/health` failing for > 2 min, request
latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
> 25 h.
## 10. OpenAPI contract drift
- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
on every PR — the spec must remain valid OpenAPI 3.1 with no
dangling `$ref`s.
- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
shipped spec on each release; mismatches between server and
client are caught at integration test time, not production.
---
## Pre-flight summary
If you can answer "yes" to every box above, ship it. If you can't,
write down which box and why before you do — that note belongs in your
runbook so the next operator inherits the gap, not the surprise.