# Observability v2 — OpenTelemetry tracing
Shade ships an opt-in OpenTelemetry layer that wraps `TransferEngine`,
`ShadeSessionManager`, the prekey HTTP routes, and `@shade/files`
op-handlers in distributed spans. The layer is **off by default** and
PII-safe by construction — span attributes never include peer addresses,
plaintext payloads, or exact byte counts.
This complements the always-on Prometheus metrics exposed by
`@shade/server` and the structural events emitted by `@shade/core`. Use
metrics for aggregate counters and histograms, tracing for per-request
causality and tail-latency hunting.
---
## Quick start
```ts
import { trace } from '@opentelemetry/api';
import { withTracer } from '@shade/observability';
import { createShade } from '@shade/sdk';
// Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb,
// Sentry's OTel adapter, …) to register a tracer provider on the
// `@opentelemetry/api` global. Then:
const tracer = trace.getTracer('my-app');
const shade = await createShade({
prekeyServer: 'https://shade.example.com',
storage: 'sqlite:/data/shade.db',
observability: withTracer(tracer, { sample: 0.1 }),
});
```
The hook propagates automatically to:
- `ShadeSessionManager.encrypt` / `.decrypt` (per-peer mutex acquisition,
ratchet step).
- `TransferEngine.upload` / accepted incoming downloads (lane count,
retry count, partition mode).
- `@shade/files` op-handlers (per request, with op + result).
For the prekey server pass the hook to `createPrekeyRoutes`:
```ts
import { createPrekeyRoutes } from '@shade/server';
import { withTracer } from '@shade/observability';
const app = createPrekeyRoutes(store, crypto, {
observability: withTracer(tracer),
});
```
---
## Off-by-default semantics
`withTracer()` returns a no-op hook — the SDK never starts spans — when
**any** of the following are true:
1. The `tracer` argument is `undefined`/`null`.
2. The `SHADE_OTEL_ENABLED` env-var is not set to `1` or `true`. Override
with `withTracer(tracer, { force: true })`, or override the var name
with `withTracer(tracer, { envVar: 'MY_VAR' })`.
3. The configured `sample` rate is `0`.
Per-span sampling (`sample: 0.1` = 10 %) keeps trace volume bounded in
production. Default is `1` (sample everything when the hook is active).
---
## PII policy — what is safe to log, and what isn't
| Category | Status | Why |
|----------|--------|-----|
| **Peer hash** (`shade.peer.hash`) | ✅ allowed | 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself. |
| **Bytes bin** (`shade.bytes.bin`) | ✅ allowed | One of `≤4KB`, `4–64KB`, `64KB–1MB`, `1–10MB`, `10–100MB`, `100MB–1GB`, `≥1GB`. Coarse enough to mask file-size fingerprinting. |
| **Lane count** (`shade.lane.count`) | ✅ allowed | Snapped to `{1, 4, 16, 64}`. |
| **Retry count** (`shade.retry.count`) | ✅ allowed | Integer. |
| **Error code** (`shade.error.code`) | ✅ allowed | `SHADE_*` stable string code — never the full message, which may interpolate user input. |
| **Op kind** (`shade.op`) | ✅ allowed | `list`, `read`, `write`, `custom:foo`, etc. |
| **Route template** (`shade.route`) | ✅ allowed | `/v1/keys/bundle/:address` — the template, never the resolved path. |
| **HTTP status** (`shade.http.status`) | ✅ allowed | Integer status code. |
| **Partition mode** (`shade.partition`) | ✅ allowed | `range` or `round-robin`. |
| **Direction** (`shade.direction`) | ✅ allowed | `upload` or `download`. |
| Plaintext peer addresses | ❌ forbidden | Use `peerHash()`. |
| Plaintext message/file payloads | ❌ forbidden | Encryption boundary — never log. |
| Exact byte counts | ❌ forbidden | Use `bytesBin()`. |
| User identifiers (email, DID, `device:UUID`) | ❌ forbidden | Treat as PII. |
The full attribute-key allow-list is exported from `@shade/observability`
as `ATTR_*` constants. Plug-in authors who want to attach their own tags
should pass each `(key, value)` through `safeAttribute()`, which throws
`UnsafeAttributeError` for any key/value pair that looks like the
forbidden categories above (heuristics: `@`, `device:`, `did:`, key
fragments such as `peer.address` / `bytes.exact`, oversized strings).
---
## Span surface
### `shade.session.encrypt` / `shade.session.decrypt`
Wraps each per-peer `encrypt`/`decrypt` call. Includes the time spent
waiting on the per-peer mutex (`shade.lock.wait_ms`) — handy for
diagnosing ratchet contention under load.
### `shade.transfer.upload` / `shade.transfer.upload.resume`
Wraps an outbound stream transfer end-to-end. Attributes: `peer.hash`,
`bytes.bin`, `lane.count`, `partition`, `retry.count`, `result`,
`error.code`.
### `shade.transfer.download`
Started when the consumer calls `incoming.accept(...)`, ended when the
transfer completes, aborts, or fails an integrity check. Same attribute
set as upload.
### `shade.prekey.request`
One span per HTTP request handled by `@shade/server`'s prekey routes.
Attributes: `route` (the template), `http.status`, `error.code` on
failure. The address path-parameter is **never** placed on the span.
### `shade.files.op`
One span per `@shade/files` RPC. Attributes: `peer.hash`, `op` (the
resolved op kind, e.g. `read` or `custom:foo`), `bytes.bin` (estimated
plaintext size, binned), `result`, `error.code`.
---
## Recording & testing
`@shade/observability` ships a deterministic in-memory recorder for
unit tests:
```ts
import { createRecorder } from '@shade/observability';
const rec = createRecorder();
const shade = await createShade({ ..., observability: rec });
// … exercise code under test …
const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']);
expect(hits).toHaveLength(0);
```
The Shade test suite runs this recorder over every documented entry
point — see
`packages/shade-observability/tests/integration-pii.test.ts` and
`packages/shade-transfer/tests/observability.test.ts`. Any new
instrumentation must keep the suite green.
---
## Performance characteristics
- With OTel **off** (default): every Shade hook resolves to the shared
`NOOP_HOOK` instance. The cost is one function call + an object
allocation that V8 hoists out in the steady state — measured at
< 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip
benchmark.
- With OTel **on**: cost depends entirely on the configured exporter.
Use `sample: 0.1` (or smaller) on hot paths in production.
---
## Adding new instrumentation
1. Identify a logical operation worth a span — typically anything that
crosses a network/disk boundary or contends on a lock.
2. Add an `observability?: ObservabilityHook` to the relevant config
surface, default to `NOOP_HOOK`.
3. Name the span `shade..` to keep cardinality bounded.
4. Set attributes via the `ATTR_*` constants from
`@shade/observability`. **Never** introduce a new attribute key
without a PII review — if you must, run the value through
`safeAttribute()`.
5. Add a test that exercises the new instrumentation under the
`createRecorder()` recorder and asserts no PII leaks.
---
## Migration
Previous versions had no tracing — only Prometheus metrics. Adding the
`observability` field to existing configs is fully backwards-compatible
and never required. The `SHADE_OTEL_ENABLED` gate ensures forgetting to
flip the env-var in production won't surprise anyone with unexpected
overhead.