Some checks failed
Test / test (push) Has been cancelled
Cross-platform vectors / TypeScript vectors (bun) (push) Has been cancelled
Cross-platform vectors / Kotlin vectors (gradle) (push) Has been cancelled
Docker build and publish / docker (push) Has been cancelled
Publish / publish (push) Has been cancelled
V3.1 → V3.12 consolidated and tagged for the first GA release. Wire format unchanged from 0.4.x — 4.0 peers interoperate with 0.4.x peers byte-for-byte. The version bump is semantic: audit-cycle complete, opt-in surface fully exposed, threat model refreshed for every new surface. Highlights: - All 24 @shade/* packages bumped to 4.0.0 in lockstep. - CHANGELOG 4.0.0 section is the canonical manifest of what landed. - THREAT-MODEL extended (§10 fingerprint gates, §11 WebRTC P2P, §12 Web-Worker boundary) + residual-risks table refreshed. - OpenAPI now covers all 27 routes: prekey, transfer, KT, inbox, bridge, observer, /metrics, /healthz, /ready. - MIGRATION 0.3.x → 4.0 documented + smoke-tested against shade migrate-storage on a real SQLite DB. - docs/audit/REVIEW-BUNDLE.md + SCOPE.md ready for external reviewer. - scripts/soak.ts harness for the GA-stable 2-week soak window. - All V*.md plans archived under docs/archive/ with Status: Done. - Voice/Video carved out into V5.0; 4.0 audit focuses on the frozen non-realtime stack. Tests: TS 1000/1000 + Kotlin 11/11 cross-platform vectors green. Docker: gt.zyon.no/stian/shade-prekey:4.0.0 builds and reports version 4.0.0 on /health. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
194 lines
7.3 KiB
Markdown
194 lines
7.3 KiB
Markdown
# Observability v2 — OpenTelemetry tracing
|
||
|
||
Shade ships an opt-in OpenTelemetry layer that wraps `TransferEngine`,
|
||
`ShadeSessionManager`, the prekey HTTP routes, and `@shade/files`
|
||
op-handlers in distributed spans. The layer is **off by default** and
|
||
PII-safe by construction — span attributes never include peer addresses,
|
||
plaintext payloads, or exact byte counts.
|
||
|
||
This complements the always-on Prometheus metrics exposed by
|
||
`@shade/server` and the structural events emitted by `@shade/core`. Use
|
||
metrics for aggregate counters and histograms, tracing for per-request
|
||
causality and tail-latency hunting.
|
||
|
||
---
|
||
|
||
## Quick start
|
||
|
||
```ts
|
||
import { trace } from '@opentelemetry/api';
|
||
import { withTracer } from '@shade/observability';
|
||
import { createShade } from '@shade/sdk';
|
||
|
||
// Use the OTel SDK of your choice (NodeSDK + OTLP exporter, Honeycomb,
|
||
// Sentry's OTel adapter, …) to register a tracer provider on the
|
||
// `@opentelemetry/api` global. Then:
|
||
const tracer = trace.getTracer('my-app');
|
||
|
||
const shade = await createShade({
|
||
prekeyServer: 'https://shade.example.com',
|
||
storage: 'sqlite:/data/shade.db',
|
||
observability: withTracer(tracer, { sample: 0.1 }),
|
||
});
|
||
```
|
||
|
||
The hook propagates automatically to:
|
||
|
||
- `ShadeSessionManager.encrypt` / `.decrypt` (per-peer mutex acquisition,
|
||
ratchet step).
|
||
- `TransferEngine.upload` / accepted incoming downloads (lane count,
|
||
retry count, partition mode).
|
||
- `@shade/files` op-handlers (per request, with op + result).
|
||
|
||
For the prekey server pass the hook to `createPrekeyRoutes`:
|
||
|
||
```ts
|
||
import { createPrekeyRoutes } from '@shade/server';
|
||
import { withTracer } from '@shade/observability';
|
||
|
||
const app = createPrekeyRoutes(store, crypto, {
|
||
observability: withTracer(tracer),
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
## Off-by-default semantics
|
||
|
||
`withTracer()` returns a no-op hook — the SDK never starts spans — when
|
||
**any** of the following are true:
|
||
|
||
1. The `tracer` argument is `undefined`/`null`.
|
||
2. The `SHADE_OTEL_ENABLED` env-var is not set to `1` or `true`. Override
|
||
with `withTracer(tracer, { force: true })`, or override the var name
|
||
with `withTracer(tracer, { envVar: 'MY_VAR' })`.
|
||
3. The configured `sample` rate is `0`.
|
||
|
||
Per-span sampling (`sample: 0.1` = 10 %) keeps trace volume bounded in
|
||
production. Default is `1` (sample everything when the hook is active).
|
||
|
||
---
|
||
|
||
## PII policy — what is safe to log, and what isn't
|
||
|
||
| Category | Status | Why |
|
||
|----------|--------|-----|
|
||
| **Peer hash** (`shade.peer.hash`) | ✅ allowed | 8-hex-char pseudonym derived via SHA-256. Stable across spans for a given address but does not expose the address itself. |
|
||
| **Bytes bin** (`shade.bytes.bin`) | ✅ allowed | One of `≤4KB`, `4–64KB`, `64KB–1MB`, `1–10MB`, `10–100MB`, `100MB–1GB`, `≥1GB`. Coarse enough to mask file-size fingerprinting. |
|
||
| **Lane count** (`shade.lane.count`) | ✅ allowed | Snapped to `{1, 4, 16, 64}`. |
|
||
| **Retry count** (`shade.retry.count`) | ✅ allowed | Integer. |
|
||
| **Error code** (`shade.error.code`) | ✅ allowed | `SHADE_*` stable string code — never the full message, which may interpolate user input. |
|
||
| **Op kind** (`shade.op`) | ✅ allowed | `list`, `read`, `write`, `custom:foo`, etc. |
|
||
| **Route template** (`shade.route`) | ✅ allowed | `/v1/keys/bundle/:address` — the template, never the resolved path. |
|
||
| **HTTP status** (`shade.http.status`) | ✅ allowed | Integer status code. |
|
||
| **Partition mode** (`shade.partition`) | ✅ allowed | `range` or `round-robin`. |
|
||
| **Direction** (`shade.direction`) | ✅ allowed | `upload` or `download`. |
|
||
| Plaintext peer addresses | ❌ forbidden | Use `peerHash()`. |
|
||
| Plaintext message/file payloads | ❌ forbidden | Encryption boundary — never log. |
|
||
| Exact byte counts | ❌ forbidden | Use `bytesBin()`. |
|
||
| User identifiers (email, DID, `device:UUID`) | ❌ forbidden | Treat as PII. |
|
||
|
||
The full attribute-key allow-list is exported from `@shade/observability`
|
||
as `ATTR_*` constants. Plug-in authors who want to attach their own tags
|
||
should pass each `(key, value)` through `safeAttribute()`, which throws
|
||
`UnsafeAttributeError` for any key/value pair that looks like the
|
||
forbidden categories above (heuristics: `@`, `device:`, `did:`, key
|
||
fragments such as `peer.address` / `bytes.exact`, oversized strings).
|
||
|
||
---
|
||
|
||
## Span surface
|
||
|
||
### `shade.session.encrypt` / `shade.session.decrypt`
|
||
|
||
Wraps each per-peer `encrypt`/`decrypt` call. Includes the time spent
|
||
waiting on the per-peer mutex (`shade.lock.wait_ms`) — handy for
|
||
diagnosing ratchet contention under load.
|
||
|
||
### `shade.transfer.upload` / `shade.transfer.upload.resume`
|
||
|
||
Wraps an outbound stream transfer end-to-end. Attributes: `peer.hash`,
|
||
`bytes.bin`, `lane.count`, `partition`, `retry.count`, `result`,
|
||
`error.code`.
|
||
|
||
### `shade.transfer.download`
|
||
|
||
Started when the consumer calls `incoming.accept(...)`, ended when the
|
||
transfer completes, aborts, or fails an integrity check. Same attribute
|
||
set as upload.
|
||
|
||
### `shade.prekey.request`
|
||
|
||
One span per HTTP request handled by `@shade/server`'s prekey routes.
|
||
Attributes: `route` (the template), `http.status`, `error.code` on
|
||
failure. The address path-parameter is **never** placed on the span.
|
||
|
||
### `shade.files.op`
|
||
|
||
One span per `@shade/files` RPC. Attributes: `peer.hash`, `op` (the
|
||
resolved op kind, e.g. `read` or `custom:foo`), `bytes.bin` (estimated
|
||
plaintext size, binned), `result`, `error.code`.
|
||
|
||
---
|
||
|
||
## Recording & testing
|
||
|
||
`@shade/observability` ships a deterministic in-memory recorder for
|
||
unit tests:
|
||
|
||
```ts
|
||
import { createRecorder } from '@shade/observability';
|
||
|
||
const rec = createRecorder();
|
||
const shade = await createShade({ ..., observability: rec });
|
||
|
||
// … exercise code under test …
|
||
|
||
const hits = rec.scanForPII(['alice@example.com', 'plaintext-secret']);
|
||
expect(hits).toHaveLength(0);
|
||
```
|
||
|
||
The Shade test suite runs this recorder over every documented entry
|
||
point — see
|
||
`packages/shade-observability/tests/integration-pii.test.ts` and
|
||
`packages/shade-transfer/tests/observability.test.ts`. Any new
|
||
instrumentation must keep the suite green.
|
||
|
||
---
|
||
|
||
## Performance characteristics
|
||
|
||
- With OTel **off** (default): every Shade hook resolves to the shared
|
||
`NOOP_HOOK` instance. The cost is one function call + an object
|
||
allocation that V8 hoists out in the steady state — measured at
|
||
< 1 % overhead vs the pre-V3.4 baseline in the upload roundtrip
|
||
benchmark.
|
||
- With OTel **on**: cost depends entirely on the configured exporter.
|
||
Use `sample: 0.1` (or smaller) on hot paths in production.
|
||
|
||
---
|
||
|
||
## Adding new instrumentation
|
||
|
||
1. Identify a logical operation worth a span — typically anything that
|
||
crosses a network/disk boundary or contends on a lock.
|
||
2. Add an `observability?: ObservabilityHook` to the relevant config
|
||
surface, default to `NOOP_HOOK`.
|
||
3. Name the span `shade.<area>.<op>` to keep cardinality bounded.
|
||
4. Set attributes via the `ATTR_*` constants from
|
||
`@shade/observability`. **Never** introduce a new attribute key
|
||
without a PII review — if you must, run the value through
|
||
`safeAttribute()`.
|
||
5. Add a test that exercises the new instrumentation under the
|
||
`createRecorder()` recorder and asserts no PII leaks.
|
||
|
||
---
|
||
|
||
## Migration
|
||
|
||
Previous versions had no tracing — only Prometheus metrics. Adding the
|
||
`observability` field to existing configs is fully backwards-compatible
|
||
and never required. The `SHADE_OTEL_ENABLED` gate ensures forgetting to
|
||
flip the env-var in production won't surprise anyone with unexpected
|
||
overhead.
|