Files
Shade/docs/PRODUCTION-CHECKLIST.md

180 lines
8.2 KiB
Markdown
Raw Normal View History

# Shade Production Checklist
A flat punch-list for taking a Shade prekey server from "it boots" to
"production-ready". Every item below is a hard gate — if you can't tick it,
don't ship.
The deeper "why" behind each item lives in `THREAT-MODEL.md`,
`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
checklist.
> Scope: a single Shade prekey container (`@shade/server`) plus any
> consumer apps that talk to it. For E2EE file transfer hardening
> (max-size, retention, quotas), see the **Hardening** and **Retention**
> sections of `docs/streams.md`.
---
## 1. TLS termination
- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
built-in proxy) terminates TLS in front of it.
- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
- [ ] The proxy is configured to pass the original `Host` header through
so signed payloads bound to the canonical address don't trip the
replay-window check on a mismatch.
- [ ] Internal traffic between consumer apps and the prekey container
runs on a private network (Docker bridge / VPC); the prekey port
is **not** exposed to the public internet without TLS in front.
> **Why:** identity signatures and observer bearer tokens travel in
> request bodies / headers. Without TLS, a network attacker can read
> the observer token and replay it for the full validity window, and
> can read the metadata (who registers, who fetches whose bundle).
> See `THREAT-MODEL.md § 1` (network attacker).
## 2. Backups
- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
at least daily. The `.db` file plus `-wal` and `-shm` together is
the recovery unit; never copy the bare `.db` while the container
is running without using the online backup API.
- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
daily; verify a restore at least once per quarter.
- [ ] Backups are stored on different infrastructure than the primary
volume (different host / region / provider).
- [ ] Backups are encrypted at rest (your storage provider's
server-side encryption, age, or restic with a passphrase).
- [ ] **Restore drill:** at least once before going live, restore the
backup into a fresh volume and confirm `/health` is green and a
registered identity is still resolvable.
> **Why:** prekey records contain identity public keys and one-time
> prekeys. Losing them means new sessions can't be established to those
> identities until each user re-registers. Existing sessions keep
> ratcheting on the device-side state.
## 3. Observer token rotation
- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
random data (e.g. `openssl rand -hex 32`). The server logs a
warning and disables the observer if the token is shorter.
- [ ] The token is held in your secret manager (Dokploy secret, GitHub
Actions secret, Vault, 1Password CLI), **never** committed to a
compose file or `.env` checked into git.
- [ ] The token is rotated on a schedule (recommended: every 90 days)
and immediately if it has been shared with anyone who no longer
needs access.
- [ ] If you expose the dashboard publicly, you also gate it behind
basic-auth at the proxy layer — bearer tokens are not
revocation-friendly on their own.
> **Why:** the observer dashboard exposes metadata about every active
> identity, registration timestamp, and recent activity. Anyone with
> the token can scrape the entire prekey directory.
## 4. SQLite vs PostgreSQL
Pick one and stick to it.
- [ ] **SQLite** is the default. Use it when **one** Shade container is
enough, you can tolerate downtime during backup snapshots, and
your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
default `/data/shade-prekeys.db`.
- [ ] **PostgreSQL** is for multi-replica deployments, shared
infrastructure, or when you already operate a managed Postgres
and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
Tables are auto-created with `shade_server_*` prefix.
- [ ] Whichever you pick, the database lives behind TLS for the
connection (`sslmode=require` for Postgres) and on storage that
is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
- [ ] You do **not** mix them in the same deployment. Setting
`SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
`compose.yml` and document which.
> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
> Disk-level / volume-level encryption is the operator's responsibility
> until at-rest encryption ships.
## 5. Log level and structured logs
- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
(high-traffic). Avoid `debug` in prod — it logs request bodies
including signed payloads.
- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
Datadog) with **redaction of `Authorization` headers and signed
bodies** if your sink doesn't already strip them.
- [ ] You alert on `error`-level logs and on the absence of cleanup
cycles (a stuck cleanup loop = unbounded DB growth).
> **Why:** at `debug` level the server logs signature material. While
> Ed25519 signatures are not secrets per se, leaking them widens the
> replay-window blast radius and reveals timing patterns.
## 6. Stale-identity cleanup parameters
- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
default (30 days) is right for "active chat app"; "occasional
use" apps should bump to 90+ to avoid surprise re-registration.
- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
specific reason — running cleanup more often does not free more
space, and running it less often risks one cycle missing a day.
- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
alert on a sudden 10× spike — that often signals a bug or a
deployment that broke client-side activity timestamps.
> **Why:** stale cleanup is the only thing keeping the prekey directory
> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
> nuke every identity on every cycle. Bound the value at ≥ 1 in your
> deployment config.
## 7. Secret rotation
- [ ] Identity signing keys: each consumer rotates via the documented
identity-rotation flow (7-day grace period for old sessions).
Operators do **not** touch identity keys directly.
- [ ] Observer token: see § 3.
- [ ] Database credentials (Postgres only): rotate per your standard
cadence, with the connection string supplied through the secret
manager.
- [ ] No long-lived API keys or service tokens are stored in the
container image or volume.
## 8. Rate-limit and body-size caps
- [ ] You have not lowered the built-in rate limits below the defaults
(per-IP register/bundle and per-identity replenish/delete).
- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
fit comfortably; raising the limit only enables abuse.
- [ ] Your reverse proxy enforces an additional connection / request-
rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
so a single noisy IP can't even reach Shade's per-route limits.
## 9. Health checks and metrics scrape
- [ ] Container has a Docker `HEALTHCHECK` (the official image already
ships one against `/health`).
- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
retained ≥ 30 days.
- [ ] Alerts are wired for: `/health` failing for > 2 min, request
latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
> 25 h.
## 10. OpenAPI contract drift
- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
on every PR — the spec must remain valid OpenAPI 3.1 with no
dangling `$ref`s.
- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
shipped spec on each release; mismatches between server and
client are caught at integration test time, not production.
---
## Pre-flight summary
If you can answer "yes" to every box above, ship it. If you can't,
write down which box and why before you do — that note belongs in your
runbook so the next operator inherits the gap, not the surprise.