Building a Real-Time Chat State Engine

I have been building the state engine behind a real time chat product: the layer that decides where a conversation lives, how every message becomes durable, and how it shows up on every screen the instant it happens. This post builds that engine from the ground up. We start with the naive design, watch it break, and evolve it step by step into the pattern I actually shipped.

It is the first in a short series. The deep dives and the runnable repos (the benchmark harness, the outbox internals, the chaos tests, the cost breakdown, and a real cloud deployment) come in later parts. This post is the spine that ties them together.

Here is the whole thing in one picture: one conversation state that has to be both durable and live, seen by many viewers at once.

The system we are building: one conversation state, both durable and live, seen by many viewers at once

Why this is worth building carefully

Before any code, it is worth being honest about why this is a real problem and not a solved one you can copy paste.

A conversation has two jobs that pull in opposite directions:

It has to be durable. Nothing is ever lost, and you can load the full history back later.
It has to be live. Every message appears on the user's screen, and on any human agent's screen, the instant it happens.

Those two jobs want different tools. Durability wants a database. Live fan out wants something fast and push based. The entire difficulty of this system is making one event satisfy both without the two halves ever disagreeing. Get it wrong and you get the bugs everyone has seen: a message that shows up and then vanishes on refresh, a duplicate reply, a live view that silently drifts from what is actually stored.

So let us build it, starting with the most obvious design, and let the problems teach us the next step.

What we are building

The requirements, concretely:

Actors: a user, an AI assistant, and sometimes a human live agent, all in the same conversation.
Durable: every message and every state change is persisted and reloadable.
Live: new messages fan out to all connected clients in real time, including a second tab or an agent joining midway.
Honest scale: a few million events a month today, growing. Not web scale. This matters, because it means we should not over engineer for a load we do not have.
Cheap: the running bill should be small, and it should not balloon just to make the live layer reliable.

Keep that scale line in mind. Half the good decisions here come from refusing to build for a scale we are nowhere near.

Attempt 1: the naive design

The first thing almost everyone reaches for: write each event to both stores directly. Push it to Redis so it fans out live, and write it to Postgres so it is durable. Two writes, one for each job.

Attempt 1, the naive dual write: two independent writes with no shared transaction, so they can diverge

In a demo this works beautifully. You send a message, it appears live, and it is in the database. Ship it.

Then it breaks, and it breaks in the worst way: quietly, under partial failure. The two writes are independent. There is no transaction spanning Redis and Postgres, and there cannot be one. So:

The Redis write succeeds and the Postgres write fails. The message was shown live, and then it is gone on the next reload. Delivered but not durable.
Or Postgres succeeds and Redis fails. The message is safe, but nobody saw it live.
And under concurrency, the two stores can even disagree on ordering.

This is the classic dual write problem, and it is an anti pattern, not a design. You will spend the rest of the project writing reconciliation code to paper over it. The lesson: you cannot make two independent writes atomic. One of them has to be the source of truth, and the other has to be derived from it.

Attempt 2: make Redis the source of truth

Fine, one source of truth. Which one? For a live system the instinct is Redis: the conversation lives in Redis while it is active, and a background process drains it to Postgres later. This is the write behind pattern, and it does fix the divergence, because now there is only one authoritative write.

Attempt 2, write-behind: Redis is the source of truth and drains to Postgres asynchronously, but it must persist and has a flush window

But making the cache authoritative drags three new problems in with it:

Redis now has to be durable. If it is the truth, a crash cannot lose it. That means turning on persistence, which on a managed cloud usually means jumping to a premium tier at several times the price.
There is a flush window. Between "written to Redis" and "drained to Postgres," the data lives in exactly one place. A crash in that window loses it.
There are now two sources of truth over the conversation's life (hot in Redis, cold in Postgres) with a handoff to get right on every read.

It works, but look at what it is carrying: a more expensive Redis, a data loss window, and a lifecycle handoff. That is a lot of weight. Before accepting it, I wanted to challenge the assumption underneath the whole thing.

The question that reframes everything

The only reason we made Redis the source of truth is the belief that Postgres is too slow to be the live read path. Loading a conversation on every turn would hammer the database. Right?

I did not want to argue about it, so I measured it. I seeded a Postgres table to ten million events, shaped like real conversation data, and benchmarked the one operation the whole design leans on: loading a single user's conversation, cold and warm, at that depth.

Operation	Cold	Warm
Load one conversation (at 10M rows)	~2.6 ms	~0.5 ms
Load a full user timeline + metadata	~4.6 ms	~1 ms

The number is not even the interesting part. The interesting part is why it stays fast: a conversation load is a bounded index scan of that user's own rows, not a scan of the ten million. The table could hold a hundred million and this read would look the same. Depth does not hurt it.

That one measurement pulled the rug out from under Attempt 2. Postgres is not the slow store we were routing around. It is fast enough to just be the truth. So instead of paying to make Redis durable, we can flip the whole thing.

Attempt 3: Postgres-first with an outbox

Here is the design the benchmark unlocks:

Postgres is the source of truth. Redis is a transient transport for the live view, and it owns nothing durable.

The only clever piece is getting an event into both places without falling back into the dual write trap. You use a transactional outbox. In a single Postgres transaction, you write two rows: the event itself, and a small "publish me" note into an outbox table.

BEGIN;
  INSERT INTO events (...);              -- the durable event
  INSERT INTO outbox (event_id, ...);    -- "publish this to the live view"
COMMIT;                                  -- both land, or neither does

Attempt 3, Postgres-first with an outbox: one atomic write, a relay drains it to Redis, and clients snapshot from Postgres

Because both writes are in the same database, they are atomic for free, no cross system coordination needed. Then a tiny background worker, the relay, reads unpublished outbox rows and pushes them to Redis, deleting each once it is published.

Now walk back through everything that hurt in Attempts 1 and 2:

No divergence. There is only one write (to Postgres). The publish to Redis is derived from what committed, so it can never contradict the truth.
No flush window. Postgres is written first, always.
Redis is disposable. It holds nothing that is not already durable, so it needs no persistence and runs on the cheap tier.
A Redis outage never blocks writes. The commit does not wait on Redis; the relay just retries later.
Reconnects are trivial. A client reads a snapshot from Postgres, then tails Redis. Refresh, a second tab, or an agent joining all use the same path.

The naming is worth burning in, because I explained it to teammates a dozen times: the outbox is not a store of events (Postgres is that). It is a reliable hand off queue from the source of truth to the transport, so the transport can be down without losing anything.

There is a duplicate story too (the relay is at least once, so a crash between publish and delete can re publish), but it costs nothing, because clients already dedupe by a sequence id they track for the live view. At least once delivery plus an idempotent consumer equals effectively once. Old, well trodden ground, which is exactly what you want load bearing.

Attempt 3 is the spine. Three real world details bend it without breaking it, and each is its own post in this series:

Streaming. An LLM reply arrives token by token, and committing every token to Postgres would be absurd. The fix: tokens are ephemeral deltas, not durable events. They stream straight to Redis on their own lane, and only the finished message commits through the outbox. Streaming is just the extreme case of the "live but disposable" idea.
Failure. A design is a hypothesis until you attack it. I built the two systems for real and added a switch to break Redis or Postgres on command. Break Redis mid stream and the animation freezes, but the finished message still commits to Postgres and the client recovers from the snapshot. Break Postgres and writes are correctly rejected, because the source of truth cannot accept what it cannot make durable.
Cost. Because Redis holds no durable truth, it never needs the premium persistence tier. A decision made purely for correctness turned out to remove the single largest line item from the budget. Good architecture and a cheaper bill were the same choice.

What building it taught me

Stripped of the specifics:

Let the naive design fail on purpose, then follow the failure. Each attempt's problem pointed directly at the next attempt. That is a better way to arrive at a pattern than reciting it from a book.
Measure before you add a moving part. One benchmark reversed the central decision. Opinions are cheap; a number ends the argument.
Every new part has to pay its rent, and every new part has second order effects. A durable, highly available cache is not one decision, it is a sync problem, a failure mode, and a permanent tax on everyone who touches the system.
Separate the durable from the ephemeral, and let each be simple. Streaming, outages, and reconnects all got easy the moment I stopped treating "the message" and "the live push" as one thing.
Sometimes the cheapest, most robust, and simplest choice are the same choice. When that happens, take it and do not get clever.

None of this is novel computer science. It is the transactional outbox, event sourcing, and information hiding, patterns with names and history. The value was not inventing anything. It was refusing to add a moving part until it earned its place, and being willing to measure instead of assume.

What's next in this series

This post was the design story. The parts I will publish next go deep, each with a git repo you can run yourself:

The benchmark harness. How I seeded ten million realistic events and measured read latency at depth. (Repo coming soon.)
The outbox and relay, in detail. Delete versus mark and purge, duplicate handling, how the relay wakes up without hammering Postgres. (Repo coming soon.)
Streaming and chaos injection. The two lane token design and the failure tests, with the software switch to break Redis on demand. (Repo coming soon.)
The full cost breakdown. The per service pricing and the candidate stacks, so you can plug in your own numbers.
Deploying to the cloud. Standing the whole thing up on managed Postgres and Redis, and measuring how the numbers shift once real network hops are in the path (warm reads that were sub millisecond locally will not be, and that is the point). (Repo coming soon.)

If you are staring at this same problem in your own product, I hope the short version saves you a week: build the naive thing in your head, notice exactly how it breaks, measure your database before you route around it, and give Redis only the job it is genuinely the best tool for.