Files
Shade/docs/observability.md
Sterister e6fdf31b49
Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
release(v4.0.0): Shade GA — V3.x consolidation + audit prep
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire
format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers
byte-for-byte. The version bump is semantic: audit-cycle complete,
opt-in surface fully exposed, threat model refreshed for every new
surface.

Highlights:
- All 24 @shade/* packages bumped to 4.0.0 in lockstep.
- CHANGELOG 4.0.0 section is the canonical manifest of what landed.
- THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12
  Web-Worker boundary) + residual-risks table refreshed.
- OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox,
  bridge, observer, /metrics, /healthz, /ready.
- MIGRATION 0.3.x → 4.0 documented + smoke-tested against
  shade migrate-storage on a real SQLite DB.
- docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer.
- scripts/soak.ts harness for the GA-stable 2-week soak window.
- All V*.md plans archived under docs/archive/ with Status: Done.
- Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen
  non-realtime stack.

Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green.
Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports
  version 4.0.0 on /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:35:35 +02:00

194 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability v2 — OpenTelemetry tracing
Shade ships an opt-in OpenTelemetry layer that wraps `TransferEngine`,
`ShadeSessionManager`, the prekey HTTP routes, and `@shade/files`
op-handlers in distributed spans. The layer is **off by default** and
PII-safe by construction — span attributes never include peer addresses,
plaintext payloads, or exact byte counts.
This complements the always-on Prometheus metrics exposed by
`@shade/server` and the structural events emitted by `@shade/core`. Use
metrics for aggregate counters and histograms, tracing for per-request
causality and tail-latency hunting.
---
## Quick start
```ts
import { trace } from '@opentelemetry/api';
import { withTracer } from '@shade/observability';
import { createShade } from '@shade/sdk';
// Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb,
// Sentry's OTel adapter, …) to register a tracer provider on the
// `@opentelemetry/api` global. Then:
const tracer = trace.getTracer('my-app');
const shade = await createShade({
prekeyServer: 'https://shade.example.com',
storage: 'sqlite:/data/shade.db',
observability: withTracer(tracer, { sample: 0.1 }),
});
```
The hook propagates automatically to:
- `ShadeSessionManager.encrypt` / `.decrypt` (per-peer mutex acquisition,
ratchet step).
- `TransferEngine.upload` / accepted incoming downloads (lane count,
retry count, partition mode).
- `@shade/files` op-handlers (per request, with op + result).
For the prekey server pass the hook to `createPrekeyRoutes`:
```ts
import { createPrekeyRoutes } from '@shade/server';
import { withTracer } from '@shade/observability';
const app = createPrekeyRoutes(store, crypto, {
observability: withTracer(tracer),
});
```
---
## Off-by-default semantics
`withTracer()` returns a no-op hook — the SDK never starts spans — when
**any** of the following are true:
1. The `tracer` argument is `undefined`/`null`.
2. The `SHADE_OTEL_ENABLED` env-var is not set to `1` or `true`. Override
with `withTracer(tracer, { force: true })`, or override the var name
with `withTracer(tracer, { envVar: 'MY_VAR' })`.
3. The configured `sample` rate is `0`.
Per-span sampling (`sample: 0.1` = 10 %) keeps trace volume bounded in
production. Default is `1` (sample everything when the hook is active).
---
## PII policy — what is safe to log, and what isn't
| Category | Status | Why |
|----------|--------|-----|
| **Peer hash** (`shade.peer.hash`) | ✅ allowed | 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself. |
| **Bytes bin** (`shade.bytes.bin`) | ✅ allowed | One of `≤4KB`, `464KB`, `64KB1MB`, `110MB`, `10100MB`, `100MB1GB`, `≥1GB`. Coarse enough to mask file-size fingerprinting. |
| **Lane count** (`shade.lane.count`) | ✅ allowed | Snapped to `{1, 4, 16, 64}`. |
| **Retry count** (`shade.retry.count`) | ✅ allowed | Integer. |
| **Error code** (`shade.error.code`) | ✅ allowed | `SHADE_*` stable string code — never the full message, which may interpolate user input. |
| **Op kind** (`shade.op`) | ✅ allowed | `list`, `read`, `write`, `custom:foo`, etc. |
| **Route template** (`shade.route`) | ✅ allowed | `/v1/keys/bundle/:address` — the template, never the resolved path. |
| **HTTP status** (`shade.http.status`) | ✅ allowed | Integer status code. |
| **Partition mode** (`shade.partition`) | ✅ allowed | `range` or `round-robin`. |
| **Direction** (`shade.direction`) | ✅ allowed | `upload` or `download`. |
| Plaintext peer addresses | ❌ forbidden | Use `peerHash()`. |
| Plaintext message/file payloads | ❌ forbidden | Encryption boundary — never log. |
| Exact byte counts | ❌ forbidden | Use `bytesBin()`. |
| User identifiers (email, DID, `device:UUID`) | ❌ forbidden | Treat as PII. |
The full attribute-key allow-list is exported from `@shade/observability`
as `ATTR_*` constants. Plug-in authors who want to attach their own tags
should pass each `(key, value)` through `safeAttribute()`, which throws
`UnsafeAttributeError` for any key/value pair that looks like the
forbidden categories above (heuristics: `@`, `device:`, `did:`, key
fragments such as `peer.address` / `bytes.exact`, oversized strings).
---
## Span surface
### `shade.session.encrypt` / `shade.session.decrypt`
Wraps each per-peer `encrypt`/`decrypt` call. Includes the time spent
waiting on the per-peer mutex (`shade.lock.wait_ms`) — handy for
diagnosing ratchet contention under load.
### `shade.transfer.upload` / `shade.transfer.upload.resume`
Wraps an outbound stream transfer end-to-end. Attributes: `peer.hash`,
`bytes.bin`, `lane.count`, `partition`, `retry.count`, `result`,
`error.code`.
### `shade.transfer.download`
Started when the consumer calls `incoming.accept(...)`, ended when the
transfer completes, aborts, or fails an integrity check. Same attribute
set as upload.
### `shade.prekey.request`
One span per HTTP request handled by `@shade/server`'s prekey routes.
Attributes: `route` (the template), `http.status`, `error.code` on
failure. The address path-parameter is **never** placed on the span.
### `shade.files.op`
One span per `@shade/files` RPC. Attributes: `peer.hash`, `op` (the
resolved op kind, e.g. `read` or `custom:foo`), `bytes.bin` (estimated
plaintext size, binned), `result`, `error.code`.
---
## Recording & testing
`@shade/observability` ships a deterministic in-memory recorder for
unit tests:
```ts
import { createRecorder } from '@shade/observability';
const rec = createRecorder();
const shade = await createShade({ ..., observability: rec });
// … exercise code under test …
const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']);
expect(hits).toHaveLength(0);
```
The Shade test suite runs this recorder over every documented entry
point — see
`packages/shade-observability/tests/integration-pii.test.ts` and
`packages/shade-transfer/tests/observability.test.ts`. Any new
instrumentation must keep the suite green.
---
## Performance characteristics
- With OTel **off** (default): every Shade hook resolves to the shared
`NOOP_HOOK` instance. The cost is one function call + an object
allocation that V8 hoists out in the steady state — measured at
< 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip
benchmark.
- With OTel **on**: cost depends entirely on the configured exporter.
Use `sample: 0.1` (or smaller) on hot paths in production.
---
## Adding new instrumentation
1. Identify a logical operation worth a span — typically anything that
crosses a network/disk boundary or contends on a lock.
2. Add an `observability?: ObservabilityHook` to the relevant config
surface, default to `NOOP_HOOK`.
3. Name the span `shade.<area>.<op>` to keep cardinality bounded.
4. Set attributes via the `ATTR_*` constants from
`@shade/observability`. **Never** introduce a new attribute key
without a PII review — if you must, run the value through
`safeAttribute()`.
5. Add a test that exercises the new instrumentation under the
`createRecorder()` recorder and asserts no PII leaks.
---
## Migration
Previous versions had no tracing — only Prometheus metrics. Adding the
`observability` field to existing configs is fully backwards-compatible
and never required. The `SHADE_OTEL_ENABLED` gate ensures forgetting to
flip the env-var in production won't surprise anyone with unexpected
overhead.