Published on

The Outbox Is Not Optional: Message Reliability from First Principles, Through the Crash Window in "Write the DB, Then Send the Message"

Authors
  • avatar
    Name
    Jack Qin
    Twitter

"Write the database, then send a message to notify someone else" — this is the most common two-step operation in backend systems, so common that almost nobody stops to look at it. But between those two steps lies an invisible crack: the process can crash exactly when step one has succeeded and step two hasn't run yet. The data lands; the notification is never sent. A tank hits a critical level, the database records it, but the alert email never goes out — this kind of bug is catastrophic in production, and it doesn't only happen "if you wrote the code wrong." It's that the two-step structure itself is bound to fail at some moment.

This post uses an environmental-monitoring platform as a worked example, covering how the venerable Web-Queue-Worker pattern lands in a .NET 10 backend. But I don't want to merely describe "we used RabbitMQ + MassTransit"; I want to unpack a few key decisions down to first principles: why the Web tier must "publish, never wait," why the Outbox is not a skippable optimization but a precondition for correctness, why scheduled jobs use Quartz instead of a bare Timer, and where failed messages should ultimately go. At the core of each is the same question — why this choice, and not the simpler-looking one.


First, Be Clear About What Problem This Pattern Solves

This backend handles two kinds of work with completely different natures:

  1. Synchronous HTTP requests: a user opens the dashboard, checks dust readings, configures alert thresholds — these must be fast, and their response time can't be dragged down by background work;
  2. Slow background work: triggering external data scraping, sending scheduled emails, calling AI to generate chart descriptions, recomputing heatmaps — these are slow and can be async, and nobody is staring at a browser waiting for them to finish.

Cram both into the same request thread and the result is a disaster you can sum up in one sentence: "One user clicks export, and HTTP slows down site-wide." The entire point of Web-Queue-Worker is to physically isolate these two kinds of work — fast stays with fast, slow with slow, neither dragging the other.

The constraints are equally pragmatic: a 1–2 person team, no heavy infrastructure that needs a dedicated operator; a single PostgreSQL, already in place; the message middleware has to run in one lightweight Docker container. Three roles, each minding its own business:

RoleResponsibilityImplementation
WebHTTP requests, auth, publishing commands/events to the queueApi (ASP.NET Core)
QueueDecoupling Web from Worker, buffering async workRabbitMQ (via MassTransit)
WorkerConsuming messages, running scheduled jobs, calling external APIsWorker (.NET background service)
flowchart LR
    Client[Client] --> Web[Web API]
    Web -->|Publish command| Queue[Queue / RabbitMQ]
    Queue -->|Consume| Worker[Worker]
    Web -.->|200 / 202 returns immediately, non-blocking| Client
    Worker -.->|Email / alert / scrape, long-running async| Done[Done]

"Publish, Never Wait": Eliminating an Entire Class of Performance Problems at the Root

This is the soul of the whole pattern, and its most counterintuitive point. The Web tier is publish-only — it drops a command into the queue and returns immediately, never waiting for the background work to finish. Whether the background job runs for 1 second or 60, the HTTP response time is unaffected.

Example: the frontend requests "generate AI chart descriptions"; the Web tier publishes a GenerateAiDescriptions command, returns a jobId, and it's done. The thing that actually calls the AI service is the Worker, which might run for tens of seconds. The user gets a 202 Accepted rather than a wait.

Worth getting clear on is the level at which this rule solves the problem. It's not an optimization that "makes slow requests a bit faster"; it's a structural decoupling of two variables — "how long background work takes" and "HTTP response time." Once they're decoupled, the entire class of "background work slows down and chokes the foreground" can no longer happen — not mitigated, eliminated. This is the difference between a structural fix and an optimization fix: an optimization tunes a problematic axle a little better; a structural fix removes that axle from the problem.

The Worker can be woken up two ways:

TriggerMechanismExamples
Message-drivenMassTransit consuming from RabbitMQSend email, AI descriptions, trigger scrape
Schedule-drivenQuartz.NET cron jobsTank alerts (every 5 min), calibration reminders (8am daily)

Message-driven is "woken by an external event," schedule-driven is "working on its own clock." Both run in the Worker process.


Why MassTransit, and Not the Bare Client

Choosing MassTransit over calling the RabbitMQ client directly is fundamentally a "build it yourself vs. take a mature component" judgment, but the basis isn't "convenience" — it's that correctly implementing reliability mechanisms is extremely expensive:

OptionVerdict
MassTransitBuilt-in retry policies, circuit breaking, dead-letter queues; EF Core transactional Outbox; transport-agnostic, swappable middleware
Bare RabbitMQ clientEvery reliability mechanism, you write yourself

Retry backoff, dead-lettering, transactional Outbox — each one isn't hard to write alone, but writing all of them correctly, and correctly under concurrency and crash scenarios, is a large and error-prone undertaking. This is exactly where you should use a mature component.

The middleware itself was compared too:

OptionVerdict
RabbitMQMature, natively supported by MassTransit, lightweight Docker container, built-in management UI
Redis StreamsWorthwhile only if Redis is already in the stack
PostgreSQL-based approachCouples queue infrastructure into the database, harming database portability
Cloud-hosted messagingIntroduces vendor lock-in and cost for a self-hosted deployment

Core idea: use MassTransit to separate "business code" from "the concrete transport." Module code only calls ICommandDispatcher.SendAsync and IEventPublisher.PublishAsync, never importing MassTransit directly. Want to swap RabbitMQ for another middleware? You change only the Infrastructure.Messaging project, with the business code untouched — the blast radius of a middleware swap is locked inside one project.


Outbox: Not an Optimization, but Correctness Itself

This is the most important point in the whole post, and the direct answer to that crack in the opening. Without an Outbox, the moment a crash lands between the two steps of "write the DB, then send the message," the system enters an inconsistent state:

// Without Outbox — the crash window:
await dbContext.SaveChangesAsync();  // succeeds
// ← process crashes here
await publisher.Publish(event);      // never sent — the Worker never gets notified

The data is written, but the message notifying the Worker is lost. A tank hits a critical level, the database records it, but the alert email is never sent. What you have to see clearly here is: this crash window is not a "low-probability accident" but an inherent property of the two-step structure. As long as "write the DB" and "send the message" are two separate, non-atomic operations, there is always a moment for the process to die in between. You can't eliminate it by "writing more carefully," because it doesn't live at the level of code correctness — it lives at the structural level.

The transactional Outbox approach folds "send the message" into the same atomic operation as "write the DB": write the message into an outbox table in the same database transaction. A relay asynchronously delivers it to RabbitMQ, retrying on failure. Business data and message either commit together or roll back together — there is no intermediate state — and the crash window is eliminated, because two steps became one. There's a bonus too: even if RabbitMQ is unavailable at the time, the message isn't lost; it sits in the outbox table waiting to be delivered.

So "the Outbox can't be skipped" isn't a maxim — it's a corollary: whenever you need "the data lands" and "the event notification" to stay consistent, and the two are written to two different storage systems, you must have some mechanism to fold the two writes into one atomic operation — and the Outbox is that mechanism. Skip it, and what you skip isn't code, it's correctness. This is the fundamental difference in nature between the Outbox and "reliability enhancements" like retries and dead-lettering: those make the system more robust, while the Outbox makes the system correct.


Scheduled Jobs: Why a Bare Timer Isn't Enough

OptionVerdict
Quartz.NETJob state persisted to PostgreSQL; scheduled jobs don't re-fire after a Worker restart
HangfireTies job persistence to the database, plus the burden of an extra dashboard
IHostedService + TimerNo persistence, no cron expressions

The key is "persistence + restart idempotency." A bare Timer keeps the scheduling state in memory — when the Worker restarts, memory is wiped, and it either forgets jobs that haven't fired yet or re-fires jobs that already fired. Quartz stores job state in PostgreSQL, so on restart it knows which jobs have run and which are still pending, and won't re-fire.

This is another case where "where the state lives" decides everything: in-process state is guaranteed to be lost on a process restart, so any state that must survive across restarts has to land in out-of-process durable storage. A scheduled job's "where did I get to last time" is exactly that kind of state. A bare Timer isn't badly implemented — its design premise (the process doesn't restart) simply doesn't hold in real deployments.


Implementation Details: Landing the Mechanism Ledger in Code

Three Kinds of Messages, Three Semantics

TypeInterfaceDirectionExamples
CommandICommandDispatcher.SendAsync<T>Api → Worker (point-to-point)SendScheduledEmail, ProcessTankAlert
Integration EventIEventPublisher.PublishAsync<T>Publisher → all subscribers (fan-out)TankLevelCritical, DeviceCalibrationExpired
Scheduled JobQuartz IJobInside the Worker (cron-triggered)Tank level check, calibration reminder

The difference between commands and events isn't a technical detail — it's the direction of coupling: a command is "have a specified someone do one thing" (point-to-point, the sender knows who should do it); an event is "something happened, whoever cares can take it" (fan-out, the publisher neither knows nor cares who's listening). Choose wrong and you write loose coupling as tight coupling.

Queue Naming and Retry/Dead-Letter

Use KebabCaseEndpointNameFormatter to turn the type name into a queue name: SendScheduledEmailsend-scheduled-email.

Failed messages have a backoff policy, ending in a dead-letter queue rather than vanishing:

flowchart TD
    Deliver[Message delivered] --> OK{Worker processes}
    OK -->|Success| Done[Remove ✓]
    OK -->|Failure| Retry[Immediate retry × 3]
    Retry --> Backoff[Backoff retry 5s → 15s → 30s]
    Backoff --> Error[Still failing → goes to _error queue<br/>Visible in RabbitMQ management UI :15672]

The _error queue is operations' "scene of the accident" — go there to investigate why a consumer keeps failing. The accompanying alert threshold is "page someone when the _error queue depth > 10." Consumers needing deduplication use EventId for idempotency. The design stance here: failures shouldn't vanish silently; they should pile up somewhere visible, queryable, and alarmed. Messages evaporating into thin air is the hardest fault to diagnose; the dead-letter queue turns "vanished" into "visible."

Consumer Registration (Explicit, Not Scanning)

services.AddMassTransit(x =>
{
    x.SetKebabCaseEndpointNameFormatter();
    MonitoringModule.AddConsumers(x);
    TankManagementModule.AddConsumers(x);
    EmailModule.AddConsumers(x);
    // ... all modules
    x.UsingRabbitMq((ctx, cfg) => { cfg.Host(uri); cfg.ConfigureEndpoints(ctx); });
});

Register consumers explicitly, module by module, with no assembly scanning — same reason as host endpoint registration: readability and traceability beat "a few fewer lines" in a small team.

A Partial List of Consumers and Scheduled Jobs

Consumers (message-driven), for example:

ModuleConsumerHandles
MonitoringScrapeSensorDataConsumerTriggers external scraping
TankManagementProcessTankAlertConsumerOn TankLevelCritical, sends an alert email
GeospatialScrapingCompletedConsumerAfter scraping completes, triggers heatmap recompute
EmailSendEmailConsumerSends scheduled/transactional email
ReportingGenerateAiDescriptionsConsumerCalls AI to generate descriptions

Scheduled jobs (Quartz cron), for example:

ModuleJobcronPurpose
TankManagementCheckTankLevelsJobEvery 5 minCheck levels; publish TankLevelCritical if out of bounds
EmailProcessEmailSchedulesJobEvery minuteFind due schedules, dispatch SendEmail
AssetsCalibrationReminderJob8am dailyFind assets overdue for calibration, dispatch reminders
MonitoringScrapeDeviceDataJobEvery 15 minTrigger external scraping
ReportingWeeklyReportJob7am MondayGenerate the weekly report

All cron jobs are configured with TimeZoneInfo.FindSystemTimeZoneById("Australia/Perth") — a scheduled job's "8am" is site-local time, not UTC. State this explicitly or the cron follows the server's timezone, and migrating machines or changing timezones causes a silent drift; because it throws no error, it's often discovered only much later.

Unified Constraints for External Integrations

All external dependencies (external scraping API, AI service, SMTP, object storage, OIDC provider) go through unified rules:

  • All registered as named clients via IHttpClientFactory, with a typed wrapper;
  • Credentials loaded only from environment variables, never hardcoded;
  • Each provider has an interface in its owning module's Infrastructure/ layer; the Application layer never calls HttpClient directly;
  • Failures are reported via Result<T>the Worker doesn't crash on non-critical failures. For example, if the AI description call fails, return an empty description without affecting the main flow.

That last point reflects a judgment: external-dependency failures should be graded. An AI description failure shouldn't drag down the entire Worker — it's "nice to have," so on failure it degrades (empty description) rather than letting an exception on a non-critical path bubble up into a process crash.


The Transferable Layer

Setting aside the concrete RabbitMQ and MassTransit APIs, two genuinely transferable insights come out of this pattern.

The first is about structural fixes vs. optimization fixes: a publish-only Web tier isn't "making slow requests a bit faster" — it's removing the variable "background duration" from "HTTP response time" entirely. When you face a recurring class of performance problems, it's worth asking first — am I tuning a problematic axle, or can I remove that axle from the problem?

The second is about cross-store consistency: any time an operation needs to write to two independent storage systems at once (database + message queue, database + cache, database + external API), there is always a crash window between them, and the only way to eliminate it is to fold the two writes into one atomic operation — the Outbox is the standard answer for the database+queue pair. Whenever you see "write A then write B," be alert: what happens if it crashes between these two steps?

Finally, honestly mark the boundary — this pattern doesn't apply everywhere:

  • No genuinely long-running background work: if every operation can complete quickly within a single HTTP request, introducing a queue and Worker is pure overhead;
  • Tiny task volume that can tolerate loss: occasionally losing an inconsequential background task means reliability machinery like Outbox and dead-lettering may be over-engineering.

The sweet spot for Web-Queue-Worker is: the system has both "synchronous requests that must be fast" and "asynchronous work that can be slow," and the reliable delivery of the latter has business meaning — alerts can't be lost, emails can't be dropped. This monitoring platform is exactly that scenario. Once reliable delivery of background work has no business meaning, the Outbox machinery degrades from "precondition for correctness" to "over-engineering" — and that's where the boundary lies.