180 lines
8.2 KiB
Markdown
180 lines
8.2 KiB
Markdown
|
|
# Shade Production Checklist
|
|||
|
|
|
|||
|
|
A flat punch-list for taking a Shade prekey server from "it boots" to
|
|||
|
|
"production-ready". Every item below is a hard gate — if you can't tick it,
|
|||
|
|
don't ship.
|
|||
|
|
|
|||
|
|
The deeper "why" behind each item lives in `THREAT-MODEL.md`,
|
|||
|
|
`SECURITY.md`, and `docs/DEPLOYMENT.md`. This file is the operator's
|
|||
|
|
checklist.
|
|||
|
|
|
|||
|
|
> Scope: a single Shade prekey container (`@shade/server`) plus any
|
|||
|
|
> consumer apps that talk to it. For E2EE file transfer hardening
|
|||
|
|
> (max-size, retention, quotas), see the **Hardening** and **Retention**
|
|||
|
|
> sections of `docs/streams.md`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. TLS termination
|
|||
|
|
|
|||
|
|
- [ ] Public traffic is **TLS 1.2+ only** — Shade itself speaks plain HTTP
|
|||
|
|
and assumes a reverse proxy (Caddy, Traefik, nginx, Dokploy's
|
|||
|
|
built-in proxy) terminates TLS in front of it.
|
|||
|
|
- [ ] HSTS is on (`Strict-Transport-Security: max-age=15552000`).
|
|||
|
|
- [ ] The proxy is configured to pass the original `Host` header through
|
|||
|
|
so signed payloads bound to the canonical address don't trip the
|
|||
|
|
replay-window check on a mismatch.
|
|||
|
|
- [ ] Internal traffic between consumer apps and the prekey container
|
|||
|
|
runs on a private network (Docker bridge / VPC); the prekey port
|
|||
|
|
is **not** exposed to the public internet without TLS in front.
|
|||
|
|
|
|||
|
|
> **Why:** identity signatures and observer bearer tokens travel in
|
|||
|
|
> request bodies / headers. Without TLS, a network attacker can read
|
|||
|
|
> the observer token and replay it for the full validity window, and
|
|||
|
|
> can read the metadata (who registers, who fetches whose bundle).
|
|||
|
|
> See `THREAT-MODEL.md § 1` (network attacker).
|
|||
|
|
|
|||
|
|
## 2. Backups
|
|||
|
|
|
|||
|
|
- [ ] **SQLite:** scheduled `sqlite3 /data/shade-prekeys.db ".backup ..."`
|
|||
|
|
at least daily. The `.db` file plus `-wal` and `-shm` together is
|
|||
|
|
the recovery unit; never copy the bare `.db` while the container
|
|||
|
|
is running without using the online backup API.
|
|||
|
|
- [ ] **Postgres:** `pg_dump` (or your provider's snapshot) at least
|
|||
|
|
daily; verify a restore at least once per quarter.
|
|||
|
|
- [ ] Backups are stored on different infrastructure than the primary
|
|||
|
|
volume (different host / region / provider).
|
|||
|
|
- [ ] Backups are encrypted at rest (your storage provider's
|
|||
|
|
server-side encryption, age, or restic with a passphrase).
|
|||
|
|
- [ ] **Restore drill:** at least once before going live, restore the
|
|||
|
|
backup into a fresh volume and confirm `/health` is green and a
|
|||
|
|
registered identity is still resolvable.
|
|||
|
|
|
|||
|
|
> **Why:** prekey records contain identity public keys and one-time
|
|||
|
|
> prekeys. Losing them means new sessions can't be established to those
|
|||
|
|
> identities until each user re-registers. Existing sessions keep
|
|||
|
|
> ratcheting on the device-side state.
|
|||
|
|
|
|||
|
|
## 3. Observer token rotation
|
|||
|
|
|
|||
|
|
- [ ] `SHADE_OBSERVER_TOKEN` is set to **≥ 16 chars** of high-entropy
|
|||
|
|
random data (e.g. `openssl rand -hex 32`). The server logs a
|
|||
|
|
warning and disables the observer if the token is shorter.
|
|||
|
|
- [ ] The token is held in your secret manager (Dokploy secret, GitHub
|
|||
|
|
Actions secret, Vault, 1Password CLI), **never** committed to a
|
|||
|
|
compose file or `.env` checked into git.
|
|||
|
|
- [ ] The token is rotated on a schedule (recommended: every 90 days)
|
|||
|
|
and immediately if it has been shared with anyone who no longer
|
|||
|
|
needs access.
|
|||
|
|
- [ ] If you expose the dashboard publicly, you also gate it behind
|
|||
|
|
basic-auth at the proxy layer — bearer tokens are not
|
|||
|
|
revocation-friendly on their own.
|
|||
|
|
|
|||
|
|
> **Why:** the observer dashboard exposes metadata about every active
|
|||
|
|
> identity, registration timestamp, and recent activity. Anyone with
|
|||
|
|
> the token can scrape the entire prekey directory.
|
|||
|
|
|
|||
|
|
## 4. SQLite vs PostgreSQL
|
|||
|
|
|
|||
|
|
Pick one and stick to it.
|
|||
|
|
|
|||
|
|
- [ ] **SQLite** is the default. Use it when **one** Shade container is
|
|||
|
|
enough, you can tolerate downtime during backup snapshots, and
|
|||
|
|
your write rate is below ~500 req/s. Path: `SHADE_PREKEY_DB_PATH`,
|
|||
|
|
default `/data/shade-prekeys.db`.
|
|||
|
|
- [ ] **PostgreSQL** is for multi-replica deployments, shared
|
|||
|
|
infrastructure, or when you already operate a managed Postgres
|
|||
|
|
and want one fewer thing to back up. Path: `SHADE_PREKEY_PG_URL`.
|
|||
|
|
Tables are auto-created with `shade_server_*` prefix.
|
|||
|
|
- [ ] Whichever you pick, the database lives behind TLS for the
|
|||
|
|
connection (`sslmode=require` for Postgres) and on storage that
|
|||
|
|
is itself encrypted (LUKS, EBS encryption, managed-DB encryption).
|
|||
|
|
- [ ] You do **not** mix them in the same deployment. Setting
|
|||
|
|
`SHADE_PREKEY_PG_URL` overrides SQLite silently — pick one in
|
|||
|
|
`compose.yml` and document which.
|
|||
|
|
|
|||
|
|
> **Why:** Shade does **not** encrypt the database itself (V3.2 will).
|
|||
|
|
> Disk-level / volume-level encryption is the operator's responsibility
|
|||
|
|
> until at-rest encryption ships.
|
|||
|
|
|
|||
|
|
## 5. Log level and structured logs
|
|||
|
|
|
|||
|
|
- [ ] `SHADE_LOG_LEVEL` is set to `info` (production) or `warn`
|
|||
|
|
(high-traffic). Avoid `debug` in prod — it logs request bodies
|
|||
|
|
including signed payloads.
|
|||
|
|
- [ ] Logs are shipped to a retention-bounded sink (Loki, CloudWatch,
|
|||
|
|
Datadog) with **redaction of `Authorization` headers and signed
|
|||
|
|
bodies** if your sink doesn't already strip them.
|
|||
|
|
- [ ] You alert on `error`-level logs and on the absence of cleanup
|
|||
|
|
cycles (a stuck cleanup loop = unbounded DB growth).
|
|||
|
|
|
|||
|
|
> **Why:** at `debug` level the server logs signature material. While
|
|||
|
|
> Ed25519 signatures are not secrets per se, leaking them widens the
|
|||
|
|
> replay-window blast radius and reveals timing patterns.
|
|||
|
|
|
|||
|
|
## 6. Stale-identity cleanup parameters
|
|||
|
|
|
|||
|
|
- [ ] `SHADE_STALE_DAYS` is set deliberately for your product. The
|
|||
|
|
default (30 days) is right for "active chat app"; "occasional
|
|||
|
|
use" apps should bump to 90+ to avoid surprise re-registration.
|
|||
|
|
- [ ] `SHADE_CLEANUP_INTERVAL_HOURS` is left at 24 unless you have a
|
|||
|
|
specific reason — running cleanup more often does not free more
|
|||
|
|
space, and running it less often risks one cycle missing a day.
|
|||
|
|
- [ ] You watch the `shade_cleanup_purged_total` metric (Prometheus) and
|
|||
|
|
alert on a sudden 10× spike — that often signals a bug or a
|
|||
|
|
deployment that broke client-side activity timestamps.
|
|||
|
|
|
|||
|
|
> **Why:** stale cleanup is the only thing keeping the prekey directory
|
|||
|
|
> from growing forever. A misconfigured `SHADE_STALE_DAYS = 0` would
|
|||
|
|
> nuke every identity on every cycle. Bound the value at ≥ 1 in your
|
|||
|
|
> deployment config.
|
|||
|
|
|
|||
|
|
## 7. Secret rotation
|
|||
|
|
|
|||
|
|
- [ ] Identity signing keys: each consumer rotates via the documented
|
|||
|
|
identity-rotation flow (7-day grace period for old sessions).
|
|||
|
|
Operators do **not** touch identity keys directly.
|
|||
|
|
- [ ] Observer token: see § 3.
|
|||
|
|
- [ ] Database credentials (Postgres only): rotate per your standard
|
|||
|
|
cadence, with the connection string supplied through the secret
|
|||
|
|
manager.
|
|||
|
|
- [ ] No long-lived API keys or service tokens are stored in the
|
|||
|
|
container image or volume.
|
|||
|
|
|
|||
|
|
## 8. Rate-limit and body-size caps
|
|||
|
|
|
|||
|
|
- [ ] You have not lowered the built-in rate limits below the defaults
|
|||
|
|
(per-IP register/bundle and per-identity replenish/delete).
|
|||
|
|
- [ ] You have not raised the 64 KiB POST body limit. Prekey bundles
|
|||
|
|
fit comfortably; raising the limit only enables abuse.
|
|||
|
|
- [ ] Your reverse proxy enforces an additional connection / request-
|
|||
|
|
rate limit at the edge (Caddy `rate_limit`, Cloudflare, etc.)
|
|||
|
|
so a single noisy IP can't even reach Shade's per-route limits.
|
|||
|
|
|
|||
|
|
## 9. Health checks and metrics scrape
|
|||
|
|
|
|||
|
|
- [ ] Container has a Docker `HEALTHCHECK` (the official image already
|
|||
|
|
ships one against `/health`).
|
|||
|
|
- [ ] `/metrics` is scraped by Prometheus / OpenTelemetry and
|
|||
|
|
retained ≥ 30 days.
|
|||
|
|
- [ ] Alerts are wired for: `/health` failing for > 2 min, request
|
|||
|
|
latency p99 > 1 s, error rate > 1 %, cleanup cycles missing for
|
|||
|
|
> 25 h.
|
|||
|
|
|
|||
|
|
## 10. OpenAPI contract drift
|
|||
|
|
|
|||
|
|
- [ ] CI runs the OpenAPI lint (`bun test packages/shade-server/tests/openapi-lint.test.ts`)
|
|||
|
|
on every PR — the spec must remain valid OpenAPI 3.1 with no
|
|||
|
|
dangling `$ref`s.
|
|||
|
|
- [ ] Generated clients (Python, Go, Kotlin) are regenerated from the
|
|||
|
|
shipped spec on each release; mismatches between server and
|
|||
|
|
client are caught at integration test time, not production.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Pre-flight summary
|
|||
|
|
|
|||
|
|
If you can answer "yes" to every box above, ship it. If you can't,
|
|||
|
|
write down which box and why before you do — that note belongs in your
|
|||
|
|
runbook so the next operator inherits the gap, not the surprise.
|