Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers byte-for-byte. The version bump is semantic: audit-cycle complete, opt-in surface fully exposed, threat model refreshed for every new surface. Highlights: - All 24 @shade/* packages bumped to 4.0.0 in lockstep. - CHANGELOG 4.0.0 section is the canonical manifest of what landed. - THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12 Web-Worker boundary) + residual-risks table refreshed. - OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox, bridge, observer, /metrics, /healthz, /ready. - MIGRATION 0.3.x → 4.0 documented + smoke-tested against shade migrate-storage on a real SQLite DB. - docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer. - scripts/soak.ts harness for the GA-stable 2-week soak window. - All V*.md plans archived under docs/archive/ with Status: Done. - Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen non-realtime stack. Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green. Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports version 4.0.0 on /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
180 lines
8.2 KiB
Markdown
180 lines
8.2 KiB
Markdown
# Shade Production Checklist
|
||
|
||
A flat punch-list for taking a Shade prekey server from "it boots" to
|
||
"production-ready". Every item below is a hard gate — if you can't tick it,
|
||
don't ship.
|
||
|
||
The deeper "why" behind each item lives in `THREAT-MODEL.md`,
|
||
`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
|
||
checklist.
|
||
|
||
> Scope: a single Shade prekey container (`@shade/server`) plus any
|
||
> consumer apps that talk to it. For E2EE file transfer hardening
|
||
> (max-size, retention, quotas), see the **Hardening** and **Retention**
|
||
> sections of `docs/streams.md`.
|
||
|
||
---
|
||
|
||
## 1. TLS termination
|
||
|
||
- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
|
||
and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
|
||
built-in proxy) terminates TLS in front of it.
|
||
- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
|
||
- [ ] The proxy is configured to pass the original `Host` header through
|
||
so signed payloads bound to the canonical address don't trip the
|
||
replay-window check on a mismatch.
|
||
- [ ] Internal traffic between consumer apps and the prekey container
|
||
runs on a private network (Docker bridge / VPC); the prekey port
|
||
is **not** exposed to the public internet without TLS in front.
|
||
|
||
> **Why:** identity signatures and observer bearer tokens travel in
|
||
> request bodies / headers. Without TLS, a network attacker can read
|
||
> the observer token and replay it for the full validity window, and
|
||
> can read the metadata (who registers, who fetches whose bundle).
|
||
> See `THREAT-MODEL.md § 1` (network attacker).
|
||
|
||
## 2. Backups
|
||
|
||
- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
|
||
at least daily. The `.db` file plus `-wal` and `-shm` together is
|
||
the recovery unit; never copy the bare `.db` while the container
|
||
is running without using the online backup API.
|
||
- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
|
||
daily; verify a restore at least once per quarter.
|
||
- [ ] Backups are stored on different infrastructure than the primary
|
||
volume (different host / region / provider).
|
||
- [ ] Backups are encrypted at rest (your storage provider's
|
||
server-side encryption, age, or restic with a passphrase).
|
||
- [ ] **Restore drill:** at least once before going live, restore the
|
||
backup into a fresh volume and confirm `/health` is green and a
|
||
registered identity is still resolvable.
|
||
|
||
> **Why:** prekey records contain identity public keys and one-time
|
||
> prekeys. Losing them means new sessions can't be established to those
|
||
> identities until each user re-registers. Existing sessions keep
|
||
> ratcheting on the device-side state.
|
||
|
||
## 3. Observer token rotation
|
||
|
||
- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
|
||
random data (e.g. `openssl rand -hex 32`). The server logs a
|
||
warning and disables the observer if the token is shorter.
|
||
- [ ] The token is held in your secret manager (Dokploy secret, GitHub
|
||
Actions secret, Vault, 1Password CLI), **never** committed to a
|
||
compose file or `.env` checked into git.
|
||
- [ ] The token is rotated on a schedule (recommended: every 90 days)
|
||
and immediately if it has been shared with anyone who no longer
|
||
needs access.
|
||
- [ ] If you expose the dashboard publicly, you also gate it behind
|
||
basic-auth at the proxy layer — bearer tokens are not
|
||
revocation-friendly on their own.
|
||
|
||
> **Why:** the observer dashboard exposes metadata about every active
|
||
> identity, registration timestamp, and recent activity. Anyone with
|
||
> the token can scrape the entire prekey directory.
|
||
|
||
## 4. SQLite vs PostgreSQL
|
||
|
||
Pick one and stick to it.
|
||
|
||
- [ ] **SQLite** is the default. Use it when **one** Shade container is
|
||
enough, you can tolerate downtime during backup snapshots, and
|
||
your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
|
||
default `/data/shade-prekeys.db`.
|
||
- [ ] **PostgreSQL** is for multi-replica deployments, shared
|
||
infrastructure, or when you already operate a managed Postgres
|
||
and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
|
||
Tables are auto-created with `shade_server_*` prefix.
|
||
- [ ] Whichever you pick, the database lives behind TLS for the
|
||
connection (`sslmode=require` for Postgres) and on storage that
|
||
is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
|
||
- [ ] You do **not** mix them in the same deployment. Setting
|
||
`SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
|
||
`compose.yml` and document which.
|
||
|
||
> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
|
||
> Disk-level / volume-level encryption is the operator's responsibility
|
||
> until at-rest encryption ships.
|
||
|
||
## 5. Log level and structured logs
|
||
|
||
- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
|
||
(high-traffic). Avoid `debug` in prod — it logs request bodies
|
||
including signed payloads.
|
||
- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
|
||
Datadog) with **redaction of `Authorization` headers and signed
|
||
bodies** if your sink doesn't already strip them.
|
||
- [ ] You alert on `error`-level logs and on the absence of cleanup
|
||
cycles (a stuck cleanup loop = unbounded DB growth).
|
||
|
||
> **Why:** at `debug` level the server logs signature material. While
|
||
> Ed25519 signatures are not secrets per se, leaking them widens the
|
||
> replay-window blast radius and reveals timing patterns.
|
||
|
||
## 6. Stale-identity cleanup parameters
|
||
|
||
- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
|
||
default (30 days) is right for "active chat app"; "occasional
|
||
use" apps should bump to 90+ to avoid surprise re-registration.
|
||
- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
|
||
specific reason — running cleanup more often does not free more
|
||
space, and running it less often risks one cycle missing a day.
|
||
- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
|
||
alert on a sudden 10× spike — that often signals a bug or a
|
||
deployment that broke client-side activity timestamps.
|
||
|
||
> **Why:** stale cleanup is the only thing keeping the prekey directory
|
||
> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
|
||
> nuke every identity on every cycle. Bound the value at ≥ 1 in your
|
||
> deployment config.
|
||
|
||
## 7. Secret rotation
|
||
|
||
- [ ] Identity signing keys: each consumer rotates via the documented
|
||
identity-rotation flow (7-day grace period for old sessions).
|
||
Operators do **not** touch identity keys directly.
|
||
- [ ] Observer token: see § 3.
|
||
- [ ] Database credentials (Postgres only): rotate per your standard
|
||
cadence, with the connection string supplied through the secret
|
||
manager.
|
||
- [ ] No long-lived API keys or service tokens are stored in the
|
||
container image or volume.
|
||
|
||
## 8. Rate-limit and body-size caps
|
||
|
||
- [ ] You have not lowered the built-in rate limits below the defaults
|
||
(per-IP register/bundle and per-identity replenish/delete).
|
||
- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
|
||
fit comfortably; raising the limit only enables abuse.
|
||
- [ ] Your reverse proxy enforces an additional connection / request-
|
||
rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
|
||
so a single noisy IP can't even reach Shade's per-route limits.
|
||
|
||
## 9. Health checks and metrics scrape
|
||
|
||
- [ ] Container has a Docker `HEALTHCHECK` (the official image already
|
||
ships one against `/health`).
|
||
- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
|
||
retained ≥ 30 days.
|
||
- [ ] Alerts are wired for: `/health` failing for > 2 min, request
|
||
latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
|
||
> 25 h.
|
||
|
||
## 10. OpenAPI contract drift
|
||
|
||
- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
|
||
on every PR — the spec must remain valid OpenAPI 3.1 with no
|
||
dangling `$ref`s.
|
||
- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
|
||
shipped spec on each release; mismatches between server and
|
||
client are caught at integration test time, not production.
|
||
|
||
---
|
||
|
||
## Pre-flight summary
|
||
|
||
If you can answer "yes" to every box above, ship it. If you can't,
|
||
write down which box and why before you do — that note belongs in your
|
||
runbook so the next operator inherits the gap, not the surprise.
|