Blogs
Building a Real-Time Chat State Engine·Part 4 of 6View series →

Streaming and Chaos Injection

2026-07-049 min read

How token streaming bends the design without breaking it, and what actually survives when you break Redis and Postgres on purpose. The failure test that found a real bug.

Streaming and Chaos Injection

The Postgres-first design is clean until two things from the real world hit it: an LLM reply that arrives token by token, and infrastructure that fails at the worst moment. This post is both. First, how streaming fits without becoming a new source of truth. Then, what happens when you break Redis and Postgres deliberately, including the bug that only showed up under failure.

Streaming: tokens are not events

An LLM reply streams token by token. Committing every token to Postgres, with a disk sync every few characters, would be absurd. So does streaming force Redis to be the source of truth after all?

No. It refines the rule. Tokens are ephemeral deltas, not durable events. The durable thing is the finished message. So streaming gets its own lane: tokens stream straight to Redis, transient and best-effort, and only the completed message commits to Postgres through the normal outbox path.

Two lanes: streaming tokens go straight to Redis and are transient; only the finished message commits to Postgres through the outboxTwo lanes: streaming tokens go straight to Redis and are transient; only the finished message commits to Postgres through the outbox

This does not reintroduce the divergence problem, because the "orphan" rule is about durable truth, and tokens are not truth. If the model dies mid-stream, the partial tokens in Redis are simply discarded, and Postgres never got an incomplete message. On the client, the streaming bubble fills in live, then locks to the durable message when it commits.

There is one wrinkle: a client that reconnects mid-stream. Postgres does not have the message yet (it commits on completion), so keep the in-flight partial in Redis, and the reconnect is snapshot-from-Postgres plus the in-flight partial plus the tail. Streaming is just the extreme case of the "live but disposable" lane, the same category as a typing indicator.

A design is a hypothesis until you attack it

To trust any of this, I built the two systems for real and added a switch to break Redis or Postgres for a chosen number of seconds. Software chaos injection: while the gate is open, every call to that component raises, simulating an outage without touching the container.

The first thing it found was a bug in my own code.

The bug that only appeared under failure

My token loop published each token to Redis with no error handling. So if Redis broke mid-stream, that publish threw, the streaming task crashed, and the finished message was never committed to Postgres. That is the exact opposite of the design: a Redis blip was silently dropping durable messages.

The fix is two rules:

  • Tokens are best-effort. A failed token publish must never abort the message being generated.
  • The final commit always happens, and a background relay worker retries any pending publish once Redis recovers.

It is a small fix, but I would never have written it from the diagram. The failure test wrote it for me.

What survives when you break Redis

Break Redis mid-stream: the animation freezes, but the finished message still commits to Postgres, and a reconnecting client recovers it from the snapshotBreak Redis mid-stream: the animation freezes, but the finished message still commits to Postgres, and a reconnecting client recovers it from the snapshot

Break Redis in the middle of a streaming reply and:

  • The live token animation freezes. Those tokens are gone, because they are disposable.
  • The model keeps generating, and the finished message still commits to Postgres.
  • When Redis recovers, the relay republishes the pending message, and a reconnecting client pulls the complete message from the Postgres snapshot.

The honest nuance: you recover the message, not the animation. That is the whole thesis made concrete. Lose the transient thing, never the durable thing.

What survives when you break Postgres

The mirror image. Break Postgres and writes are rejected, on purpose, because the source of truth cannot accept anything it cannot make durable. A user's send fails and they retry, which is the correct behaviour for an outage of the system of record. Nothing is silently half-written.

Why this is the part that convinced me

Diagrams argue that a design is correct. Watching it survive an outage you triggered by hand proves it, and occasionally proves it wrong in a way you can then fix. Chaos testing turned "I think it recovers" into "I watched it recover," and turned one confident assumption into a real bug report. If you build one of these, build the break switch too.

What's next

  • The cost breakdown. What this stack costs on managed cloud, and the one decision (Redis persistence) worth the largest line item. (Repo coming soon.)
  • The cloud deployment, where the streaming lane and the relay run over a real network, and the latencies stop being loopback numbers.
Series · Building a Real-Time Chat State EnginePart 4 of 6

Where a conversation lives, how every message stays durable, and how it fans out live to every screen. Built from the naive design up to Postgres-first with an outbox, then streaming, chaos testing, and the cost.

  1. 1Building a Real-Time Chat State Engine13 min read
  2. 2The Benchmark Harness: Ten Million Events in Postgres11 min read
  3. 3The Outbox and Relay, in Detail9 min read
  4. 4Streaming and Chaos Injectionyou are here
  5. 5The Cost Breakdown: Postgres and Redis on Managed Cloud9 min read
  6. 6Deploying to the Cloud, and What the Latencies Actually Look Likecoming soon