◆Building a Real-Time Chat State Engine·Part 3 of 6View series →

The Outbox and Relay, in Detail

2026-07-049 min read

The internals of the pattern that carries the whole design: two tables, a tiny relay worker, why duplicates are harmless, and how the relay wakes up without hammering Postgres.

In part one the design landed on Postgres-first with an outbox: Postgres owns the truth, Redis is a disposable transport, and a small worker called the relay carries events from one to the other. That post said what the outbox is. This one is the mechanics: the two tables, the relay loop, how you clean it up, why duplicates do not matter, and how the relay stays cheap.

The producer commits the event and an outbox row in one transaction; the relay drains the outbox to Redis and cleans it

Why the outbox exists at all

The problem it solves is small and sharp: you cannot make one write to Postgres and one write to Redis atomic. There is no transaction spanning two systems. So if you write both directly, they can disagree, and you are back to the dual-write bug.

The outbox sidesteps this with a trick: the only thing that has to be atomic is two writes to the same database, which Postgres gives you for free. So instead of "write to Postgres and Redis," you "write to Postgres twice, in one transaction": the event, and a note saying publish it. Then a separate worker does the actual publishing, later, and retryably.

The two tables, and the atomic write

Two tables with very different lifecycles:

The event log is the durable source of truth. It grows forever.
The outbox is a small queue of events still waiting to be published. In steady state it is nearly empty.

The write is one transaction:

BEGIN;
  INSERT INTO user_events (...);              -- the durable event
  INSERT INTO outbox (event_id, payload);     -- "publish this to the live view"
COMMIT;                                        -- both land, or neither

That single commit is the whole guarantee. A durable event always has a matching "publish me" note, and a note always has its event. You can never have one without the other, because they commit together.

The relay loop

The relay is not a framework or a product. It is a tiny always-on worker holding one connection, doing four things forever:

Read pending. Select unpublished outbox rows. With a partial index this is a microsecond probe that returns nothing in steady state.
Publish. XADD each row's payload to the conversation's Redis stream, in order.
Clean. Remove the row now that it is published.
Wait. Sleep briefly, or block on a notification, then loop.

The important property: the relay can crash, restart, or fall behind at any point and lose nothing, because its to-do list is durable in Postgres. That is the entire reason the outbox is worth having.

Cleaning the outbox: delete vs purger

"Clean the row" has two implementations, and the choice is a real one.

Delete inline (the default). The relay deletes each row the moment it publishes it. The outbox stays tiny on its own, the query stays simple, and there is no extra process. The durable history already lives in the event log, so there is nothing to keep.

Mark published, then purge (for scale or audit). The relay only flips a published flag (fast), and a separate purger deletes published rows on a schedule. You get a short-lived audit trail of what was published when, at the cost of a second worker and a partial index. Bulk deletes are also gentler on vacuum than many single-row deletes, which is why this variant shows up at high throughput.

Start with delete inline. Reach for mark-and-purge only when you want the audit trail or you are cleaning at a rate where bulk deletes matter.

Duplicates are harmless

The relay is at-least-once: if it crashes between publishing and cleaning, it re-publishes the same event on restart. That is not a bug to eliminate, it is a property to design around, because exactly-once delivery is essentially impossible in distributed systems.

The relay crashes between publish and clean, so it re-publishes; the client dedupes by sequence id, so the duplicate is a no-op

It costs nothing because the consumer already dedupes by a stable id. Every event carries a sequence id the client tracks for the live view, so a re-delivered event arrives with a seq it has already applied, and it is dropped. At-least-once delivery plus an idempotent consumer equals effectively once. You can push it one layer earlier too, by using the event's seq as the Redis stream entry id, so a re-publish is a no-op at Redis itself.

Waking the relay without hammering Postgres

"A worker polling the database in a loop" sounds expensive. It is not, and you can avoid polling entirely.

Poll. A partial index makes "is there work?" a tiny probe returning zero rows in steady state, on a single held connection with no per-poll handshake. Polling every 100 to 500 ms is a handful of trivial queries a second, which Postgres does not notice.
LISTEN / NOTIFY. Flip from pull to push: the relay sleeps, and Postgres pings it the instant an outbox row commits. Zero idle polls, near-instant fan-out. Add a slow safety poll as a backstop.
CDC. For very high scale, tail the write-ahead log directly (change data capture) and skip the table entirely. More infrastructure, rarely needed.

At a normal request rate, plain polling is fine. LISTEN/NOTIFY is the clean default when you want it snappier.

What's next

The outbox handles the durable-to-live handoff. The next parts push on the edges of it:

Streaming and chaos injection. What happens when an LLM reply streams token by token (tokens skip the outbox on purpose), and what survives when you break Redis or Postgres deliberately. (Repo coming soon.)
The cost breakdown, then the cloud deployment where the relay runs over a real network.

Series · Building a Real-Time Chat State EnginePart 3 of 6

Where a conversation lives, how every message stays durable, and how it fans out live to every screen. Built from the naive design up to Postgres-first with an outbox, then streaming, chaos testing, and the cost.

1Building a Real-Time Chat State Engine13 min read
2The Benchmark Harness: Ten Million Events in Postgres11 min read
3The Outbox and Relay, in Detailyou are here
4Streaming and Chaos Injection9 min read
5The Cost Breakdown: Postgres and Redis on Managed Cloud9 min read
6Deploying to the Cloud, and What the Latencies Actually Look Likecoming soon