Files
Shade/docs/observability.md
Sterister e6fdf31b49
Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
release(v4.0.0): Shade GA — V3.x consolidation + audit prep
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire
format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers
byte-for-byte. The version bump is semantic: audit-cycle complete,
opt-in surface fully exposed, threat model refreshed for every new
surface.

Highlights:
- All 24 @shade/* packages bumped to 4.0.0 in lockstep.
- CHANGELOG 4.0.0 section is the canonical manifest of what landed.
- THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12
  Web-Worker boundary) + residual-risks table refreshed.
- OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox,
  bridge, observer, /metrics, /healthz, /ready.
- MIGRATION 0.3.x → 4.0 documented + smoke-tested against
  shade migrate-storage on a real SQLite DB.
- docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer.
- scripts/soak.ts harness for the GA-stable 2-week soak window.
- All V*.md plans archived under docs/archive/ with Status: Done.
- Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen
  non-realtime stack.

Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green.
Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports
  version 4.0.0 on /health.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:35:35 +02:00

7.3 KiB
Raw Permalink Blame History

Observability v2 — OpenTelemetry tracing

Shade ships an opt-in OpenTelemetry layer that wraps TransferEngine, ShadeSessionManager, the prekey HTTP routes, and @shade/files op-handlers in distributed spans. The layer is off by default and PII-safe by construction — span attributes never include peer addresses, plaintext payloads, or exact byte counts.

This complements the always-on Prometheus metrics exposed by @shade/server and the structural events emitted by @shade/core. Use metrics for aggregate counters and histograms, tracing for per-request causality and tail-latency hunting.


Quick start

import { trace } from '@opentelemetry/api';
import { withTracer } from '@shade/observability';
import { createShade } from '@shade/sdk';

// Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb,
// Sentry's OTel adapter, …) to register a tracer provider on the
// `@opentelemetry/api` global. Then:
const tracer = trace.getTracer('my-app');

const shade = await createShade({
  prekeyServer: 'https://shade.example.com',
  storage: 'sqlite:/data/shade.db',
  observability: withTracer(tracer, { sample: 0.1 }),
});

The hook propagates automatically to:

  • ShadeSessionManager.encrypt / .decrypt (per-peer mutex acquisition, ratchet step).
  • TransferEngine.upload / accepted incoming downloads (lane count, retry count, partition mode).
  • @shade/files op-handlers (per request, with op + result).

For the prekey server pass the hook to createPrekeyRoutes:

import { createPrekeyRoutes } from '@shade/server';
import { withTracer } from '@shade/observability';

const app = createPrekeyRoutes(store, crypto, {
  observability: withTracer(tracer),
});

Off-by-default semantics

withTracer() returns a no-op hook — the SDK never starts spans — when any of the following are true:

  1. The tracer argument is undefined/null.
  2. The SHADE_OTEL_ENABLED env-var is not set to 1 or true. Override with withTracer(tracer, { force: true }), or override the var name with withTracer(tracer, { envVar: 'MY_VAR' }).
  3. The configured sample rate is 0.

Per-span sampling (sample: 0.1 = 10 %) keeps trace volume bounded in production. Default is 1 (sample everything when the hook is active).


PII policy — what is safe to log, and what isn't

Category Status Why
Peer hash (shade.peer.hash) allowed 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself.
Bytes bin (shade.bytes.bin) allowed One of ≤4KB, 464KB, 64KB1MB, 110MB, 10100MB, 100MB1GB, ≥1GB. Coarse enough to mask file-size fingerprinting.
Lane count (shade.lane.count) allowed Snapped to {1, 4, 16, 64}.
Retry count (shade.retry.count) allowed Integer.
Error code (shade.error.code) allowed SHADE_* stable string code — never the full message, which may interpolate user input.
Op kind (shade.op) allowed list, read, write, custom:foo, etc.
Route template (shade.route) allowed /v1/keys/bundle/:address — the template, never the resolved path.
HTTP status (shade.http.status) allowed Integer status code.
Partition mode (shade.partition) allowed range or round-robin.
Direction (shade.direction) allowed upload or download.
Plaintext peer addresses forbidden Use peerHash().
Plaintext message/file payloads forbidden Encryption boundary — never log.
Exact byte counts forbidden Use bytesBin().
User identifiers (email, DID, device:UUID) forbidden Treat as PII.

The full attribute-key allow-list is exported from @shade/observability as ATTR_* constants. Plug-in authors who want to attach their own tags should pass each (key, value) through safeAttribute(), which throws UnsafeAttributeError for any key/value pair that looks like the forbidden categories above (heuristics: @, device:, did:, key fragments such as peer.address / bytes.exact, oversized strings).


Span surface

shade.session.encrypt / shade.session.decrypt

Wraps each per-peer encrypt/decrypt call. Includes the time spent waiting on the per-peer mutex (shade.lock.wait_ms) — handy for diagnosing ratchet contention under load.

shade.transfer.upload / shade.transfer.upload.resume

Wraps an outbound stream transfer end-to-end. Attributes: peer.hash, bytes.bin, lane.count, partition, retry.count, result, error.code.

shade.transfer.download

Started when the consumer calls incoming.accept(...), ended when the transfer completes, aborts, or fails an integrity check. Same attribute set as upload.

shade.prekey.request

One span per HTTP request handled by @shade/server's prekey routes. Attributes: route (the template), http.status, error.code on failure. The address path-parameter is never placed on the span.

shade.files.op

One span per @shade/files RPC. Attributes: peer.hash, op (the resolved op kind, e.g. read or custom:foo), bytes.bin (estimated plaintext size, binned), result, error.code.


Recording & testing

@shade/observability ships a deterministic in-memory recorder for unit tests:

import { createRecorder } from '@shade/observability';

const rec = createRecorder();
const shade = await createShade({ ..., observability: rec });

// … exercise code under test …

const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']);
expect(hits).toHaveLength(0);

The Shade test suite runs this recorder over every documented entry point — see packages/shade-observability/tests/integration-pii.test.ts and packages/shade-transfer/tests/observability.test.ts. Any new instrumentation must keep the suite green.


Performance characteristics

  • With OTel off (default): every Shade hook resolves to the shared NOOP_HOOK instance. The cost is one function call + an object allocation that V8 hoists out in the steady state — measured at < 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip benchmark.
  • With OTel on: cost depends entirely on the configured exporter. Use sample: 0.1 (or smaller) on hot paths in production.

Adding new instrumentation

  1. Identify a logical operation worth a span — typically anything that crosses a network/disk boundary or contends on a lock.
  2. Add an observability?: ObservabilityHook to the relevant config surface, default to NOOP_HOOK.
  3. Name the span shade.<area>.<op> to keep cardinality bounded.
  4. Set attributes via the ATTR_* constants from @shade/observability. Never introduce a new attribute key without a PII review — if you must, run the value through safeAttribute().
  5. Add a test that exercises the new instrumentation under the createRecorder() recorder and asserts no PII leaks.

Migration

Previous versions had no tracing — only Prometheus metrics. Adding the observability field to existing configs is fully backwards-compatible and never required. The SHADE_OTEL_ENABLED gate ensures forgetting to flip the env-var in production won't surprise anyone with unexpected overhead.