docs/PRODUCTION-CHECKLIST.md

# Shade Production Checklist

A flat punch-list for taking a Shade prekey server from "it boots" to
"production-ready". Every item below is a hard gate — if you can't tick it,
don't ship.

The deeper "why" behind each item lives in `THREAT-MODEL.md`,
`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
checklist.

> Scope: a single Shade prekey container (`@shade/server`) plus any
> consumer apps that talk to it. For E2EE file transfer hardening
> (max-size, retention, quotas), see the **Hardening** and **Retention**
> sections of `docs/streams.md`.

---

## 1. TLS termination

- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
      and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
      built-in proxy) terminates TLS in front of it.
- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
- [ ] The proxy is configured to pass the original `Host` header through
      so signed payloads bound to the canonical address don't trip the
      replay-window check on a mismatch.
- [ ] Internal traffic between consumer apps and the prekey container
      runs on a private network (Docker bridge / VPC); the prekey port
      is **not** exposed to the public internet without TLS in front.

> **Why:** identity signatures and observer bearer tokens travel in
> request bodies / headers. Without TLS, a network attacker can read
> the observer token and replay it for the full validity window, and
> can read the metadata (who registers, who fetches whose bundle).
> See `THREAT-MODEL.md § 1` (network attacker).

## 2. Backups

- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
      at least daily. The `.db` file plus `-wal` and `-shm` together is
      the recovery unit; never copy the bare `.db` while the container
      is running without using the online backup API.
- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
      daily; verify a restore at least once per quarter.
- [ ] Backups are stored on different infrastructure than the primary
      volume (different host / region / provider).
- [ ] Backups are encrypted at rest (your storage provider's
      server-side encryption, age, or restic with a passphrase).
- [ ] **Restore drill:** at least once before going live, restore the
      backup into a fresh volume and confirm `/health` is green and a
      registered identity is still resolvable.

> **Why:** prekey records contain identity public keys and one-time
> prekeys. Losing them means new sessions can't be established to those
> identities until each user re-registers. Existing sessions keep
> ratcheting on the device-side state.

## 3. Observer token rotation

- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
      random data (e.g. `openssl rand -hex 32`). The server logs a
      warning and disables the observer if the token is shorter.
- [ ] The token is held in your secret manager (Dokploy secret, GitHub
      Actions secret, Vault, 1Password CLI), **never** committed to a
      compose file or `.env` checked into git.
- [ ] The token is rotated on a schedule (recommended: every 90 days)
      and immediately if it has been shared with anyone who no longer
      needs access.
- [ ] If you expose the dashboard publicly, you also gate it behind
      basic-auth at the proxy layer — bearer tokens are not
      revocation-friendly on their own.

> **Why:** the observer dashboard exposes metadata about every active
> identity, registration timestamp, and recent activity. Anyone with
> the token can scrape the entire prekey directory.

## 4. SQLite vs PostgreSQL

Pick one and stick to it.

- [ ] **SQLite** is the default. Use it when **one** Shade container is
      enough, you can tolerate downtime during backup snapshots, and
      your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
      default `/data/shade-prekeys.db`.
- [ ] **PostgreSQL** is for multi-replica deployments, shared
      infrastructure, or when you already operate a managed Postgres
      and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
      Tables are auto-created with `shade_server_*` prefix.
- [ ] Whichever you pick, the database lives behind TLS for the
      connection (`sslmode=require` for Postgres) and on storage that
      is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
- [ ] You do **not** mix them in the same deployment. Setting
      `SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
      `compose.yml` and document which.

> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
> Disk-level / volume-level encryption is the operator's responsibility
> until at-rest encryption ships.

## 5. Log level and structured logs

- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
      (high-traffic). Avoid `debug` in prod — it logs request bodies
      including signed payloads.
- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
      Datadog) with **redaction of `Authorization` headers and signed
      bodies** if your sink doesn't already strip them.
- [ ] You alert on `error`-level logs and on the absence of cleanup
      cycles (a stuck cleanup loop = unbounded DB growth).

> **Why:** at `debug` level the server logs signature material. While
> Ed25519 signatures are not secrets per se, leaking them widens the
> replay-window blast radius and reveals timing patterns.

## 6. Stale-identity cleanup parameters

- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
      default (30 days) is right for "active chat app"; "occasional
      use" apps should bump to 90+ to avoid surprise re-registration.
- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
      specific reason — running cleanup more often does not free more
      space, and running it less often risks one cycle missing a day.
- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
      alert on a sudden 10× spike — that often signals a bug or a
      deployment that broke client-side activity timestamps.

> **Why:** stale cleanup is the only thing keeping the prekey directory
> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
> nuke every identity on every cycle. Bound the value at ≥ 1 in your
> deployment config.

## 7. Secret rotation

- [ ] Identity signing keys: each consumer rotates via the documented
      identity-rotation flow (7-day grace period for old sessions).
      Operators do **not** touch identity keys directly.
- [ ] Observer token: see § 3.
- [ ] Database credentials (Postgres only): rotate per your standard
      cadence, with the connection string supplied through the secret
      manager.
- [ ] No long-lived API keys or service tokens are stored in the
      container image or volume.

## 8. Rate-limit and body-size caps

- [ ] You have not lowered the built-in rate limits below the defaults
      (per-IP register/bundle and per-identity replenish/delete).
- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
      fit comfortably; raising the limit only enables abuse.
- [ ] Your reverse proxy enforces an additional connection / request-
      rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
      so a single noisy IP can't even reach Shade's per-route limits.

## 9. Health checks and metrics scrape

- [ ] Container has a Docker `HEALTHCHECK` (the official image already
      ships one against `/health`).
- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
      retained ≥ 30 days.
- [ ] Alerts are wired for: `/health` failing for > 2 min, request
      latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
      > 25 h.

## 10. OpenAPI contract drift

- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
      on every PR — the spec must remain valid OpenAPI 3.1 with no
      dangling `$ref`s.
- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
      shipped spec on each release; mismatches between server and
      client are caught at integration test time, not production.

---

## Pre-flight summary

If you can answer "yes" to every box above, ship it. If you can't,
write down which box and why before you do — that note belongs in your
runbook so the next operator inherits the gap, not the surprise.
-												release(v4.0.0): Shade GA — V3.x consolidation + audit prep

V3.1 → V3.12 consolidated and tagged for the first GA release. Wire
format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers
byte-for-byte. The version bump is semantic: audit-cycle complete,
opt-in surface fully exposed, threat model refreshed for every new
surface.

Highlights:
- All 24 @shade/* packages bumped to 4.0.0 in lockstep.
- CHANGELOG 4.0.0 section is the canonical manifest of what landed.
- THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12
  Web-Worker boundary) + residual-risks table refreshed.
- OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox,
  bridge, observer, /metrics, /healthz, /ready.
- MIGRATION 0.3.x → 4.0 documented + smoke-tested against
  shade migrate-storage on a real SQLite DB.
- docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer.
- scripts/soak.ts harness for the GA-stable 2-week soak window.
- All V*.md plans archived under docs/archive/ with Status: Done.
- Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen
  non-realtime stack.

Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green.
Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports
  version 4.0.0 on /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-03 18:35:35 +02:00
+								# Shade Production Checklist
 								A flat punch-list for taking a Shade prekey server from "it boots" to
 								"production-ready". Every item below is a hard gate — if you can't tick it,
 								don't ship.
 								The deeper "why" behind each item lives in `THREAT-MODEL.md`,
 								`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
 								checklist.
 								> Scope: a single Shade prekey container (`@shade/server`) plus any
 								> consumer apps that talk to it. For E2EE file transfer hardening
 								> (max-size, retention, quotas), see the **Hardening** and **Retention**
 								> sections of `docs/streams.md`.
 								---
 								## 1. TLS termination
 								- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
 								      and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
 								      built-in proxy) terminates TLS in front of it.
 								- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
 								- [ ] The proxy is configured to pass the original `Host` header through
 								      so signed payloads bound to the canonical address don't trip the
 								      replay-window check on a mismatch.
 								- [ ] Internal traffic between consumer apps and the prekey container
 								      runs on a private network (Docker bridge / VPC); the prekey port
 								      is **not** exposed to the public internet without TLS in front.
 								> **Why:** identity signatures and observer bearer tokens travel in
 								> request bodies / headers. Without TLS, a network attacker can read
 								> the observer token and replay it for the full validity window, and
 								> can read the metadata (who registers, who fetches whose bundle).
 								> See `THREAT-MODEL.md § 1` (network attacker).
 								## 2. Backups
 								- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
 								      at least daily. The `.db` file plus `-wal` and `-shm` together is
 								      the recovery unit; never copy the bare `.db` while the container
 								      is running without using the online backup API.
 								- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
 								      daily; verify a restore at least once per quarter.
 								- [ ] Backups are stored on different infrastructure than the primary
 								      volume (different host / region / provider).
 								- [ ] Backups are encrypted at rest (your storage provider's
 								      server-side encryption, age, or restic with a passphrase).
 								- [ ] **Restore drill:** at least once before going live, restore the
 								      backup into a fresh volume and confirm `/health` is green and a
 								      registered identity is still resolvable.
 								> **Why:** prekey records contain identity public keys and one-time
 								> prekeys. Losing them means new sessions can't be established to those
 								> identities until each user re-registers. Existing sessions keep
 								> ratcheting on the device-side state.
 								## 3. Observer token rotation
 								- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
 								      random data (e.g. `openssl rand -hex 32`). The server logs a
 								      warning and disables the observer if the token is shorter.
 								- [ ] The token is held in your secret manager (Dokploy secret, GitHub
 								      Actions secret, Vault, 1Password CLI), **never** committed to a
 								      compose file or `.env` checked into git.
 								- [ ] The token is rotated on a schedule (recommended: every 90 days)
 								      and immediately if it has been shared with anyone who no longer
 								      needs access.
 								- [ ] If you expose the dashboard publicly, you also gate it behind
 								      basic-auth at the proxy layer — bearer tokens are not
 								      revocation-friendly on their own.
 								> **Why:** the observer dashboard exposes metadata about every active
 								> identity, registration timestamp, and recent activity. Anyone with
 								> the token can scrape the entire prekey directory.
 								## 4. SQLite vs PostgreSQL
 								Pick one and stick to it.
 								- [ ] **SQLite** is the default. Use it when **one** Shade container is
 								      enough, you can tolerate downtime during backup snapshots, and
 								      your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
 								      default `/data/shade-prekeys.db`.
 								- [ ] **PostgreSQL** is for multi-replica deployments, shared
 								      infrastructure, or when you already operate a managed Postgres
 								      and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
 								      Tables are auto-created with `shade_server_*` prefix.
 								- [ ] Whichever you pick, the database lives behind TLS for the
 								      connection (`sslmode=require` for Postgres) and on storage that
 								      is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
 								- [ ] You do **not** mix them in the same deployment. Setting
 								      `SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
 								      `compose.yml` and document which.
 								> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
 								> Disk-level / volume-level encryption is the operator's responsibility
 								> until at-rest encryption ships.
 								## 5. Log level and structured logs
 								- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
 								      (high-traffic). Avoid `debug` in prod — it logs request bodies
 								      including signed payloads.
 								- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
 								      Datadog) with **redaction of `Authorization` headers and signed
 								      bodies** if your sink doesn't already strip them.
 								- [ ] You alert on `error`-level logs and on the absence of cleanup
 								      cycles (a stuck cleanup loop = unbounded DB growth).
 								> **Why:** at `debug` level the server logs signature material. While
 								> Ed25519 signatures are not secrets per se, leaking them widens the
 								> replay-window blast radius and reveals timing patterns.
 								## 6. Stale-identity cleanup parameters
 								- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
 								      default (30 days) is right for "active chat app"; "occasional
 								      use" apps should bump to 90+ to avoid surprise re-registration.
 								- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
 								      specific reason — running cleanup more often does not free more
 								      space, and running it less often risks one cycle missing a day.
 								- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
 								      alert on a sudden 10× spike — that often signals a bug or a
 								      deployment that broke client-side activity timestamps.
 								> **Why:** stale cleanup is the only thing keeping the prekey directory
 								> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
 								> nuke every identity on every cycle. Bound the value at ≥ 1 in your
 								> deployment config.
 								## 7. Secret rotation
 								- [ ] Identity signing keys: each consumer rotates via the documented
 								      identity-rotation flow (7-day grace period for old sessions).
 								      Operators do **not** touch identity keys directly.
 								- [ ] Observer token: see § 3.
 								- [ ] Database credentials (Postgres only): rotate per your standard
 								      cadence, with the connection string supplied through the secret
 								      manager.
 								- [ ] No long-lived API keys or service tokens are stored in the
 								      container image or volume.
 								## 8. Rate-limit and body-size caps
 								- [ ] You have not lowered the built-in rate limits below the defaults
 								      (per-IP register/bundle and per-identity replenish/delete).
 								- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
 								      fit comfortably; raising the limit only enables abuse.
 								- [ ] Your reverse proxy enforces an additional connection / request-
 								      rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
 								      so a single noisy IP can't even reach Shade's per-route limits.
 								## 9. Health checks and metrics scrape
 								- [ ] Container has a Docker `HEALTHCHECK` (the official image already
 								      ships one against `/health`).
 								- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
 								      retained ≥ 30 days.
 								- [ ] Alerts are wired for: `/health` failing for > 2 min, request
 								      latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
 								      > 25 h.
 								## 10. OpenAPI contract drift
 								- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
 								      on every PR — the spec must remain valid OpenAPI 3.1 with no
 								      dangling `$ref`s.
 								- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
 								      shipped spec on each release; mismatches between server and
 								      client are caught at integration test time, not production.
 								---
 								## Pre-flight summary
 								If you can answer "yes" to every box above, ship it. If you can't,
 								write down which box and why before you do — that note belongs in your
 								runbook so the next operator inherits the gap, not the surprise.