release(v4.8.5): kill flushOnce 15s success-backoff + per-recipient parallel drain

Prism filed a per-recipient-flush-concurrency FR pointing at serial-per-flush. Investigation surfaced the actual culprit: `scheduleFlush` was using a 15 s backoff on **both** the success and failure paths, so envelopes enqueued *during* an in-flight flush sat ~15 s behind the next drain — visible as "10 s of silence then 25-frame burst" on the receiving side under sustained sender output. Two fixes: 1. `scheduleFlush` now uses 0 ms delay when `flushOnce` delivered ≥1 envelope and more is queued (network healthy → drain remainder immediately). 15 s reserved for the actual failure case where every attempt this round failed. `flushOnce` returns `{ delivered, remaining } | null` so concurrent-flush early returns don't double-schedule. 2. `flushOnce` groups the outgoing queue by `recipientAddress` and drains buckets via `Promise.all`. Per-peer order preserved (sequential within a bucket); a slow POST to recipient A no longer head-of-line-blocks frames bound for B. `Inbox.tick` public shape unchanged. `OutgoingQueueStore` implementations see the same per-entry list/remove/bumpAttempts/ size contract; only cross-recipient interleaving changes. Tests cover (1) 25-envelope burst behind a 100 ms slow PUT drains within 1 s, and (2) carol's PUT lands within 150 ms even when bob's PUT stalls 200 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:56:27 +02:00
parent a98ea8a1bd
commit 3c0db14904
28 changed files with 334 additions and 59 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,85 @@ All notable changes to Shade are documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [4.8.5] — 2026-05-08 — `Inbox.flushOnce`: kill the 15 s success-backoff + per-recipient parallel drain
+
+Prism filed a "typing-into-a-chatty-shell" UX FR pointing at
+serial-per-flush behavior. The investigation surfaced a more
+important latent bug: `scheduleFlush` was using a 15 s backoff timer
+on **both** the success and failure paths, so any envelopes enqueued
+*during* an in-flight flush had to wait ~15 s for the next drain to
+fire — visible to Prism's web client as "10 s of silence then a
+25-frame burst" whenever the PC sidecar was emitting steady output.
+
+Two fixes ship together:
+
+**(1) `scheduleFlush` distinguishes healthy-drain from all-failed.**
+After `flushOnce` returns, if the round delivered ≥1 envelope and
+items are still queued, the next flush fires with **0 ms** delay
+(network is fine — drain whatever piled up immediately). The 15 s
+backoff is reserved for the actual failure case (every attempt this
+round threw / was rejected). `flushOnce` now returns
+`{ delivered, remaining } | null` so the scheduler can also tell
+"someone else is flushing, don't double-schedule" apart from
+"queue is empty, idle." Externally-visible API unchanged
+(`Inbox.tick()` still returns `{ flushed, received }`).
+
+**(2) Per-recipient parallel drain inside `flushOnce`.** The queue
+is grouped by `recipientAddress`; each bucket is drained
+sequentially (preserves per-peer enqueue order — the relay assigns
+`receivedAt` on PUT arrival, so concurrent PUTs to the same peer
+would let the second one land first), but distinct buckets run
+concurrently via `Promise.all`. Pre-fix, a slow POST to recipient A
+head-of-line-blocked every other recipient's frames. Future N-peer
+broadcast fan-outs (multiple devices viewing the same Prism PTY)
+benefit immediately; single-recipient deployments are unaffected
+since N=1 is the trivial parallel case.
+
+Reported by Prism (multi-device E2EE terminal). Acceptance: under
+sustained typing, web's `recv` rate is roughly proportional to PC's
+emit rate, no multi-second silences punctuated by burst catch-ups.
+
+### Fixed
+
+#### `@shade/inbox` — `scheduleFlush` 15 s success-backoff
+- After a successful drain, the next flush is rescheduled with
+  `delayMs=0` when `delivered > 0`. The 15 s timer is reserved for
+  rounds where every attempt failed (no progress, avoid tight retry
+  loop).
+- Concurrent `scheduleFlush` calls during an in-flight flush are
+  detected via `flushOnce` returning `null`; the no-op early return
+  no longer double-schedules a 15 s retry for a flush that's
+  already running.
+
+#### `@shade/inbox` — `flushOnce` per-recipient parallelism
+- Outgoing queue is grouped by `recipientAddress`; buckets drain
+  via `Promise.all`. Per-peer order preserved (sequential within a
+  bucket); cross-peer order has no guarantee in Shade's wire model
+  to begin with.
+- Failure handling unchanged: per-entry `bumpAttempts` /
+  `maxAttempts` semantics are identical to V4.8.4.
+
+### Tests
+- `packages/shade-inbox/tests/client.test.ts`:
+  1. "burst enqueued during a flush drains immediately, not after
+     15 s backoff" — slow first PUT (100 ms), pile 24 more during,
+     assert `pendingCount === 0` within 1 s.
+  2. "per-recipient parallel drain — slow POST to A does not block
+     POSTs to B" — `bob` PUT stalls 200 ms; `carol` envelope queued
+     after; assert `inbox.message_delivered` for carol fires within
+     150 ms (would be ≥200 ms pre-fix).
+
+### Migration
+
+None. `Inbox.flushOnce` is a private method; the
+`{ delivered, remaining } | null` shape is internal. `Inbox.tick`
+public return `{ flushed, received }` is unchanged. Apps that hand
+custom `OutgoingQueueStore` implementations to `Inbox` see no
+contract change — `list()` / `remove()` / `bumpAttempts()` / `size()`
+are called the same way per entry; only the *order* of `remove()`
+calls across distinct recipients changes (interleaved instead of
+strictly sequential).
+
 ## [4.8.4] — 2026-05-08 — Server-side cross-channel dedup via `BridgeDeliveryLog`

 V4.8.3 shipped the *client-side* cross-channel dedup hook