194 lines
7.3 KiB
Markdown
194 lines
7.3 KiB
Markdown
|
|
# Observability v2 — OpenTelemetry tracing
|
|||
|
|
|
|||
|
|
Shade ships an opt-in OpenTelemetry layer that wraps `TransferEngine`,
|
|||
|
|
`ShadeSessionManager`, the prekey HTTP routes, and `@shade/files`
|
|||
|
|
op-handlers in distributed spans. The layer is **off by default** and
|
|||
|
|
PII-safe by construction — span attributes never include peer addresses,
|
|||
|
|
plaintext payloads, or exact byte counts.
|
|||
|
|
|
|||
|
|
This complements the always-on Prometheus metrics exposed by
|
|||
|
|
`@shade/server` and the structural events emitted by `@shade/core`. Use
|
|||
|
|
metrics for aggregate counters and histograms, tracing for per-request
|
|||
|
|
causality and tail-latency hunting.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Quick start
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
import { trace } from '@opentelemetry/api';
|
|||
|
|
import { withTracer } from '@shade/observability';
|
|||
|
|
import { createShade } from '@shade/sdk';
|
|||
|
|
|
|||
|
|
// Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb,
|
|||
|
|
// Sentry's OTel adapter, …) to register a tracer provider on the
|
|||
|
|
// `@opentelemetry/api` global. Then:
|
|||
|
|
const tracer = trace.getTracer('my-app');
|
|||
|
|
|
|||
|
|
const shade = await createShade({
|
|||
|
|
prekeyServer: 'https://shade.example.com',
|
|||
|
|
storage: 'sqlite:/data/shade.db',
|
|||
|
|
observability: withTracer(tracer, { sample: 0.1 }),
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The hook propagates automatically to:
|
|||
|
|
|
|||
|
|
- `ShadeSessionManager.encrypt` / `.decrypt` (per-peer mutex acquisition,
|
|||
|
|
ratchet step).
|
|||
|
|
- `TransferEngine.upload` / accepted incoming downloads (lane count,
|
|||
|
|
retry count, partition mode).
|
|||
|
|
- `@shade/files` op-handlers (per request, with op + result).
|
|||
|
|
|
|||
|
|
For the prekey server pass the hook to `createPrekeyRoutes`:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
import { createPrekeyRoutes } from '@shade/server';
|
|||
|
|
import { withTracer } from '@shade/observability';
|
|||
|
|
|
|||
|
|
const app = createPrekeyRoutes(store, crypto, {
|
|||
|
|
observability: withTracer(tracer),
|
|||
|
|
});
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Off-by-default semantics
|
|||
|
|
|
|||
|
|
`withTracer()` returns a no-op hook — the SDK never starts spans — when
|
|||
|
|
**any** of the following are true:
|
|||
|
|
|
|||
|
|
1. The `tracer` argument is `undefined`/`null`.
|
|||
|
|
2. The `SHADE_OTEL_ENABLED` env-var is not set to `1` or `true`. Override
|
|||
|
|
with `withTracer(tracer, { force: true })`, or override the var name
|
|||
|
|
with `withTracer(tracer, { envVar: 'MY_VAR' })`.
|
|||
|
|
3. The configured `sample` rate is `0`.
|
|||
|
|
|
|||
|
|
Per-span sampling (`sample: 0.1` = 10 %) keeps trace volume bounded in
|
|||
|
|
production. Default is `1` (sample everything when the hook is active).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## PII policy — what is safe to log, and what isn't
|
|||
|
|
|
|||
|
|
| Category | Status | Why |
|
|||
|
|
|----------|--------|-----|
|
|||
|
|
| **Peer hash** (`shade.peer.hash`) | ✅ allowed | 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself. |
|
|||
|
|
| **Bytes bin** (`shade.bytes.bin`) | ✅ allowed | One of `≤4KB`, `4–64KB`, `64KB–1MB`, `1–10MB`, `10–100MB`, `100MB–1GB`, `≥1GB`. Coarse enough to mask file-size fingerprinting. |
|
|||
|
|
| **Lane count** (`shade.lane.count`) | ✅ allowed | Snapped to `{1, 4, 16, 64}`. |
|
|||
|
|
| **Retry count** (`shade.retry.count`) | ✅ allowed | Integer. |
|
|||
|
|
| **Error code** (`shade.error.code`) | ✅ allowed | `SHADE_*` stable string code — never the full message, which may interpolate user input. |
|
|||
|
|
| **Op kind** (`shade.op`) | ✅ allowed | `list`, `read`, `write`, `custom:foo`, etc. |
|
|||
|
|
| **Route template** (`shade.route`) | ✅ allowed | `/v1/keys/bundle/:address` — the template, never the resolved path. |
|
|||
|
|
| **HTTP status** (`shade.http.status`) | ✅ allowed | Integer status code. |
|
|||
|
|
| **Partition mode** (`shade.partition`) | ✅ allowed | `range` or `round-robin`. |
|
|||
|
|
| **Direction** (`shade.direction`) | ✅ allowed | `upload` or `download`. |
|
|||
|
|
| Plaintext peer addresses | ❌ forbidden | Use `peerHash()`. |
|
|||
|
|
| Plaintext message/file payloads | ❌ forbidden | Encryption boundary — never log. |
|
|||
|
|
| Exact byte counts | ❌ forbidden | Use `bytesBin()`. |
|
|||
|
|
| User identifiers (email, DID, `device:UUID`) | ❌ forbidden | Treat as PII. |
|
|||
|
|
|
|||
|
|
The full attribute-key allow-list is exported from `@shade/observability`
|
|||
|
|
as `ATTR_*` constants. Plug-in authors who want to attach their own tags
|
|||
|
|
should pass each `(key, value)` through `safeAttribute()`, which throws
|
|||
|
|
`UnsafeAttributeError` for any key/value pair that looks like the
|
|||
|
|
forbidden categories above (heuristics: `@`, `device:`, `did:`, key
|
|||
|
|
fragments such as `peer.address` / `bytes.exact`, oversized strings).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Span surface
|
|||
|
|
|
|||
|
|
### `shade.session.encrypt` / `shade.session.decrypt`
|
|||
|
|
|
|||
|
|
Wraps each per-peer `encrypt`/`decrypt` call. Includes the time spent
|
|||
|
|
waiting on the per-peer mutex (`shade.lock.wait_ms`) — handy for
|
|||
|
|
diagnosing ratchet contention under load.
|
|||
|
|
|
|||
|
|
### `shade.transfer.upload` / `shade.transfer.upload.resume`
|
|||
|
|
|
|||
|
|
Wraps an outbound stream transfer end-to-end. Attributes: `peer.hash`,
|
|||
|
|
`bytes.bin`, `lane.count`, `partition`, `retry.count`, `result`,
|
|||
|
|
`error.code`.
|
|||
|
|
|
|||
|
|
### `shade.transfer.download`
|
|||
|
|
|
|||
|
|
Started when the consumer calls `incoming.accept(...)`, ended when the
|
|||
|
|
transfer completes, aborts, or fails an integrity check. Same attribute
|
|||
|
|
set as upload.
|
|||
|
|
|
|||
|
|
### `shade.prekey.request`
|
|||
|
|
|
|||
|
|
One span per HTTP request handled by `@shade/server`'s prekey routes.
|
|||
|
|
Attributes: `route` (the template), `http.status`, `error.code` on
|
|||
|
|
failure. The address path-parameter is **never** placed on the span.
|
|||
|
|
|
|||
|
|
### `shade.files.op`
|
|||
|
|
|
|||
|
|
One span per `@shade/files` RPC. Attributes: `peer.hash`, `op` (the
|
|||
|
|
resolved op kind, e.g. `read` or `custom:foo`), `bytes.bin` (estimated
|
|||
|
|
plaintext size, binned), `result`, `error.code`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recording & testing
|
|||
|
|
|
|||
|
|
`@shade/observability` ships a deterministic in-memory recorder for
|
|||
|
|
unit tests:
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
import { createRecorder } from '@shade/observability';
|
|||
|
|
|
|||
|
|
const rec = createRecorder();
|
|||
|
|
const shade = await createShade({ ..., observability: rec });
|
|||
|
|
|
|||
|
|
// … exercise code under test …
|
|||
|
|
|
|||
|
|
const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']);
|
|||
|
|
expect(hits).toHaveLength(0);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The Shade test suite runs this recorder over every documented entry
|
|||
|
|
point — see
|
|||
|
|
`packages/shade-observability/tests/integration-pii.test.ts` and
|
|||
|
|
`packages/shade-transfer/tests/observability.test.ts`. Any new
|
|||
|
|
instrumentation must keep the suite green.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance characteristics
|
|||
|
|
|
|||
|
|
- With OTel **off** (default): every Shade hook resolves to the shared
|
|||
|
|
`NOOP_HOOK` instance. The cost is one function call + an object
|
|||
|
|
allocation that V8 hoists out in the steady state — measured at
|
|||
|
|
< 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip
|
|||
|
|
benchmark.
|
|||
|
|
- With OTel **on**: cost depends entirely on the configured exporter.
|
|||
|
|
Use `sample: 0.1` (or smaller) on hot paths in production.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Adding new instrumentation
|
|||
|
|
|
|||
|
|
1. Identify a logical operation worth a span — typically anything that
|
|||
|
|
crosses a network/disk boundary or contends on a lock.
|
|||
|
|
2. Add an `observability?: ObservabilityHook` to the relevant config
|
|||
|
|
surface, default to `NOOP_HOOK`.
|
|||
|
|
3. Name the span `shade.<area>.<op>` to keep cardinality bounded.
|
|||
|
|
4. Set attributes via the `ATTR_*` constants from
|
|||
|
|
`@shade/observability`. **Never** introduce a new attribute key
|
|||
|
|
without a PII review — if you must, run the value through
|
|||
|
|
`safeAttribute()`.
|
|||
|
|
5. Add a test that exercises the new instrumentation under the
|
|||
|
|
`createRecorder()` recorder and asserts no PII leaks.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Migration
|
|||
|
|
|
|||
|
|
Previous versions had no tracing — only Prometheus metrics. Adding the
|
|||
|
|
`observability` field to existing configs is fully backwards-compatible
|
|||
|
|
and never required. The `SHADE_OTEL_ENABLED` gate ensures forgetting to
|
|||
|
|
flip the env-var in production won't surprise anyone with unexpected
|
|||
|
|
overhead.
|