# Observability v2 — OpenTelemetry tracing Shade ships an opt-in OpenTelemetry layer that wraps `TransferEngine`, `ShadeSessionManager`, the prekey HTTP routes, and `@shade/files` op-handlers in distributed spans. The layer is **off by default** and PII-safe by construction — span attributes never include peer addresses, plaintext payloads, or exact byte counts. This complements the always-on Prometheus metrics exposed by `@shade/server` and the structural events emitted by `@shade/core`. Use metrics for aggregate counters and histograms, tracing for per-request causality and tail-latency hunting. --- ## Quick start ```ts import { trace } from '@opentelemetry/api'; import { withTracer } from '@shade/observability'; import { createShade } from '@shade/sdk'; // Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb, // Sentry's OTel adapter, …) to register a tracer provider on the // `@opentelemetry/api` global. Then: const tracer = trace.getTracer('my-app'); const shade = await createShade({ prekeyServer: 'https://shade.example.com', storage: 'sqlite:/data/shade.db', observability: withTracer(tracer, { sample: 0.1 }), }); ``` The hook propagates automatically to: - `ShadeSessionManager.encrypt` / `.decrypt` (per-peer mutex acquisition, ratchet step). - `TransferEngine.upload` / accepted incoming downloads (lane count, retry count, partition mode). - `@shade/files` op-handlers (per request, with op + result). For the prekey server pass the hook to `createPrekeyRoutes`: ```ts import { createPrekeyRoutes } from '@shade/server'; import { withTracer } from '@shade/observability'; const app = createPrekeyRoutes(store, crypto, { observability: withTracer(tracer), }); ``` --- ## Off-by-default semantics `withTracer()` returns a no-op hook — the SDK never starts spans — when **any** of the following are true: 1. The `tracer` argument is `undefined`/`null`. 2. The `SHADE_OTEL_ENABLED` env-var is not set to `1` or `true`. Override with `withTracer(tracer, { force: true })`, or override the var name with `withTracer(tracer, { envVar: 'MY_VAR' })`. 3. The configured `sample` rate is `0`. Per-span sampling (`sample: 0.1` = 10 %) keeps trace volume bounded in production. Default is `1` (sample everything when the hook is active). --- ## PII policy — what is safe to log, and what isn't | Category | Status | Why | |----------|--------|-----| | **Peer hash** (`shade.peer.hash`) | ✅ allowed | 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself. | | **Bytes bin** (`shade.bytes.bin`) | ✅ allowed | One of `≤4KB`, `4–64KB`, `64KB–1MB`, `1–10MB`, `10–100MB`, `100MB–1GB`, `≥1GB`. Coarse enough to mask file-size fingerprinting. | | **Lane count** (`shade.lane.count`) | ✅ allowed | Snapped to `{1, 4, 16, 64}`. | | **Retry count** (`shade.retry.count`) | ✅ allowed | Integer. | | **Error code** (`shade.error.code`) | ✅ allowed | `SHADE_*` stable string code — never the full message, which may interpolate user input. | | **Op kind** (`shade.op`) | ✅ allowed | `list`, `read`, `write`, `custom:foo`, etc. | | **Route template** (`shade.route`) | ✅ allowed | `/v1/keys/bundle/:address` — the template, never the resolved path. | | **HTTP status** (`shade.http.status`) | ✅ allowed | Integer status code. | | **Partition mode** (`shade.partition`) | ✅ allowed | `range` or `round-robin`. | | **Direction** (`shade.direction`) | ✅ allowed | `upload` or `download`. | | Plaintext peer addresses | ❌ forbidden | Use `peerHash()`. | | Plaintext message/file payloads | ❌ forbidden | Encryption boundary — never log. | | Exact byte counts | ❌ forbidden | Use `bytesBin()`. | | User identifiers (email, DID, `device:UUID`) | ❌ forbidden | Treat as PII. | The full attribute-key allow-list is exported from `@shade/observability` as `ATTR_*` constants. Plug-in authors who want to attach their own tags should pass each `(key, value)` through `safeAttribute()`, which throws `UnsafeAttributeError` for any key/value pair that looks like the forbidden categories above (heuristics: `@`, `device:`, `did:`, key fragments such as `peer.address` / `bytes.exact`, oversized strings). --- ## Span surface ### `shade.session.encrypt` / `shade.session.decrypt` Wraps each per-peer `encrypt`/`decrypt` call. Includes the time spent waiting on the per-peer mutex (`shade.lock.wait_ms`) — handy for diagnosing ratchet contention under load. ### `shade.transfer.upload` / `shade.transfer.upload.resume` Wraps an outbound stream transfer end-to-end. Attributes: `peer.hash`, `bytes.bin`, `lane.count`, `partition`, `retry.count`, `result`, `error.code`. ### `shade.transfer.download` Started when the consumer calls `incoming.accept(...)`, ended when the transfer completes, aborts, or fails an integrity check. Same attribute set as upload. ### `shade.prekey.request` One span per HTTP request handled by `@shade/server`'s prekey routes. Attributes: `route` (the template), `http.status`, `error.code` on failure. The address path-parameter is **never** placed on the span. ### `shade.files.op` One span per `@shade/files` RPC. Attributes: `peer.hash`, `op` (the resolved op kind, e.g. `read` or `custom:foo`), `bytes.bin` (estimated plaintext size, binned), `result`, `error.code`. --- ## Recording & testing `@shade/observability` ships a deterministic in-memory recorder for unit tests: ```ts import { createRecorder } from '@shade/observability'; const rec = createRecorder(); const shade = await createShade({ ..., observability: rec }); // … exercise code under test … const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']); expect(hits).toHaveLength(0); ``` The Shade test suite runs this recorder over every documented entry point — see `packages/shade-observability/tests/integration-pii.test.ts` and `packages/shade-transfer/tests/observability.test.ts`. Any new instrumentation must keep the suite green. --- ## Performance characteristics - With OTel **off** (default): every Shade hook resolves to the shared `NOOP_HOOK` instance. The cost is one function call + an object allocation that V8 hoists out in the steady state — measured at < 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip benchmark. - With OTel **on**: cost depends entirely on the configured exporter. Use `sample: 0.1` (or smaller) on hot paths in production. --- ## Adding new instrumentation 1. Identify a logical operation worth a span — typically anything that crosses a network/disk boundary or contends on a lock. 2. Add an `observability?: ObservabilityHook` to the relevant config surface, default to `NOOP_HOOK`. 3. Name the span `shade..` to keep cardinality bounded. 4. Set attributes via the `ATTR_*` constants from `@shade/observability`. **Never** introduce a new attribute key without a PII review — if you must, run the value through `safeAttribute()`. 5. Add a test that exercises the new instrumentation under the `createRecorder()` recorder and asserts no PII leaks. --- ## Migration Previous versions had no tracing — only Prometheus metrics. Adding the `observability` field to existing configs is fully backwards-compatible and never required. The `SHADE_OTEL_ENABLED` gate ensures forgetting to flip the env-var in production won't surprise anyone with unexpected overhead.