- Published on
Deployment Shouldn't Make Judgments: From "Gates" to "Exceptions," the Design Tradeoffs of a Self-Hosted Infrastructure
- Authors

- Name
- Jack Qin
There's a role assignment in a CI/CD pipeline that's easy to get wrong: letting the "deploy" step itself judge "should we deploy right now." It sounds natural — add a few ifs in the deploy script, check whether tests passed and the build succeeded, ship if so. But written this way, the deploy step takes on two responsibilities that should be separate: judging whether quality is up to par, and executing the release. Once merged, "tests failed but the deploy script's judgment logic has a bug, so it ships broken" becomes a real failure mode.
The right split is: deployment makes no judgment at all; it's a pure action permitted to execute only after all quality gates pass. The judgment responsibility is lifted up to an orchestrator, and deployment only "does the work after being approved." This section is the post's lead-in, but it represents a recurring line of thought across the whole infrastructure — separate "judgment" from "execution," separate "the ideal" from "the real-world exception," separate "unused day to day" from "lifesaving when it matters."
This post uses an environmental-monitoring platform as a worked example, unpacking several key tradeoffs of this self-hosted deployment system: how the topology partitions the network, how a multi-stage Dockerfile squeezes the image, why a unified Alpine must leave an escape hatch, how CI/CD gating is orchestrated, and the design of backup/rollback/alerting — those "inconspicuous day to day, decisive in a crisis" things.
How Constraints Shape the Topology
This is a small-team self-hosted system, and the constraints directly shape the choices:
- 1–2 person team, hosting cost must be controllable, no piling on cloud-hosted services that need a dedicated operator;
- The database is an existing single PostgreSQL 15.8, self-hosted, not migrating;
- Images must be small — running on cloud servers, Alpine shrinks attack surface and size;
- Deployment must not "ship even if tests didn't pass" — there has to be a gate.
The deployment topology is split in two: the static frontend goes through a CDN, the dynamic backend runs on Docker Compose on cloud server A.
flowchart TD
Net[Internet]
Net --> CDN[CDN static hosting<br/>React SPA static files<br/>Auto-deployed after CI build]
Net --> SrvA[Cloud server A self-hosted]
SrvA --> Nginx[Nginx reverse proxy :443]
Nginx --> ApiC[Docker API container :8080]
SrvA --> Compose[Docker Compose]
Compose --> ApiC
Compose --> WorkerC[worker container]
Compose --> RMQ[rabbitmq :5672 / :15672]
Compose -.->|PostgreSQL 15 self-hosted, not in compose| PG[(PostgreSQL)]
Note that PostgreSQL is not in Compose — it's a standalone, existing self-hosted instance, and Compose manages only the three containers api / worker / rabbitmq.
The first tradeoff is plain but correct: frontend on CDN, backend in containers — each on its optimal carrier. A static SPA needn't occupy server resources; throw it on a CDN and enjoy global edge caching and automatic HTTPS. The backend has stateful sessions and background processes, so it runs in containers. The heuristic behind this: static assets and dynamic services have fundamentally different characteristics (one immutable, infinitely edge-replicable; one stateful, needs a resident process), so they shouldn't be crammed into the same carrier.
Multi-Stage Builds: Cleanly Separating "Needed at Build Time" from "Needed at Runtime"
The API image is a 4-stage build, with one core idea: the heap of tools needed at build time must never enter the final runtime image.
flowchart TD
S1[Stage 1: docs-build<br/>Node Alpine → docs static output]
S2[Stage 2: csproj<br/>dotnet SDK Alpine<br/>extract only .csproj/.sln/Directory.*props]
S3[Stage 3: build<br/>dotnet SDK Alpine<br/>restore + publish → /app/publish]
S4[Stage 4: final<br/>dotnet ASP.NET Alpine<br/>copy only publish output + docs<br/>non-root user appuser, ENTRYPOINT]
S1 --> S4
S2 --> S3 --> S4
A few details worth learning:
- Stage 2 extracts the csproj separately to exploit Docker layer caching — as long as project dependencies don't change, the
dotnet restorelayer hits the cache and needn't rerun every time (restore is slow). This separates "the rarely-changing (dependency manifest)" from "the often-changing (business code)" into different layers so the cache hits precisely — another instance of "separate things that change at different frequencies"; - The final image is Alpine-based (~5 MB), minimal attack surface;
- Runs as the non-root user
appuser, a security best practice; tzdatais installed — this system heavily relies on timezone computation (data stored in UTC, displayed in site-local timezone), so the IANA timezone library must be present at runtime;- Docs static files are baked into the API image, served by the backend at
/docs/after auth.
The Escape Hatch: the Ideal Yields to Reality
There's one pragmatic exception: the Worker image's runtime stage uses a glibc-based ASP.NET base image instead of Alpine. The reason is that a certain browser-automation driver (a Node implementation) the Worker uses won't run on musl libc.
This exception deserves its own mention, because it represents a mature engineering attitude: "unified Alpine" is the ideal, "the dependency isn't musl-compatible" is reality, and on conflict, let the ideal yield to reality rather than forcing "uniformity." The cost of forcing it (fixing a third-party driver's musl compatibility yourself) far exceeds the cost of the exception (one image on glibc). Recognizing "this is where to make an exception" is itself judgment — consistency is a means, not an end; making an exception is right only when the cost of maintaining consistency exceeds its benefit. Record the exception explicitly (rather than quietly swapping the base image) so the next person knows it's intentional, not an oversight.
Dual-Network Isolation: Making Internal Components Unreachable at the Network Layer
networks:
edge: # external-facing (API container only)
backend: # internal (API + Worker + RabbitMQ)
Only the API container is on the edge network, externally reachable; the Worker and RabbitMQ are on the internal backend network only, unreachable from outside. All port mappings bind to 127.0.0.1, and external TLS is terminated by Nginx.
This is defense in depth, but the key is the layer at which the defense happens: internal components (queue, Worker) aren't "kept out by a firewall rule" but unreachable by the network topology itself. There's a fundamental difference between the two — a firewall rule is "an added gate that may be misconfigured or bypassed," while network isolation is "the road simply doesn't exist." The most reliable security boundary is making the attack surface structurally non-existent, rather than plugging it after it exists. This is the same line of thought as "an HttpOnly Cookie makes the token mechanically invisible to JS" in the auth post: eliminating the attack surface beats guarding it.
CI/CD Gating: Separating Judgment from Execution
The platform is GitHub Actions + self-hosted runners (running on cloud server B, 4 runners). Workflows are split by responsibility, with path filters triggering only the relevant ones:
| Workflow | Trigger | Responsibility |
|---|---|---|
web-lint | apps/web/** | ESLint + type check |
web-test | apps/web/** | Unit tests + coverage |
web-build | apps/web/** | Type check + build + bundle-size tracking |
web-contract | push | OpenAPI schema drift detection |
secret-scan | push | Repo-wide secret scan |
backend-test | apps/api/** | 4-tier .NET test suite (architecture/contract/module/host) |
backend-build | apps/api/** | .NET build + Docker image verification |
main-deploy | push to main | Orchestrator: gates all tests + builds |
semgrep | push | Security scanning (SAST) |
The deploy flow is an explicit "gate" structure:
flowchart TD
Push[push to main]
Push --> WL[web-lint + web-test + web-build in parallel]
Push --> BT[backend-test + backend-build in parallel]
WL --> Gate[main-deploy gate: all of the above must pass]
BT --> Gate
Gate --> BD[backend-deploy reusable workflow]
BD --> SSH[SSH to cloud server A → docker compose pull + up -d]
Gate --> CDNDeploy[CDN auto-deploys from main]
Back to the opening: main-deploy is a pure orchestrator; it gates on test + build success, and deployment itself makes no judgment — it's only permitted to execute once all quality gates pass. "Tests failed but it shipped anyway" becomes impossible by the process — because the one making the judgment is the orchestrator (whose sole job is to gate), and the one executing the deploy is a separate reusable workflow invoked by the orchestrator (which assumes everything is ready when it's called). Separate the responsibilities, and the broken-ship failure mode has nowhere to hide.
Implementation Details: the "Unused Day to Day, Lifesaving in a Crisis" Designs
Two Kinds of Health Check
| Endpoint | Purpose | What it checks |
|---|---|---|
GET /health/live | Liveness probe — the process is still alive | Nothing (return 200 if running) |
GET /health/ready | Readiness probe — can take traffic | PostgreSQL + RabbitMQ |
{
"status": "Healthy",
"checks": { "database": "Healthy", "masstransit-bus": "Healthy" }
}
Distinguishing liveness from readiness matters: a process being alive doesn't mean it can do work (it might not connect to the DB). Docker Compose's healthcheck uses /health/ready to decide whether the container is marked healthy — don't admit traffic until dependencies are ready. Treating "alive" and "able to work" as two independent signals avoids routing traffic to a container that's up but still can't reach the database.
Configuration All via Environment Variables, Secrets Never Committed
Config is layered: appsettings.json → appsettings.{Env}.json → environment variables (nested keys use double underscores). All secrets are injected only from environment variables, never committed to source, with Docker Compose injecting via env_file + an environment block. The repo also has a secret-scan workflow scanning the whole repo for leaked secrets — config discipline (don't commit) and an automated check (scan) are a double safety net, because discipline gets occasionally violated, and the automated check is the backstop.
Backup and Recovery: a Backup Not Rehearsed Is No Backup
PostgreSQL backups:
- A daily cron
pg_dumpon cloud server A; - Retain 7 daily + 4 weekly snapshots;
- Store to object storage separate from the application data (avoiding "data and backup perish together");
- A monthly recovery rehearsal to a staging instance;
- RTO < 4 hours, RPO < 24 hours.
Two points worth unpacking. First, the backup must be physically separated from the application data — if the backup sits on the same disk as the data, one disk failure takes both, and the entire point of the backup (coping with data loss) is zeroed out on the spot. Second, and most easily overlooked: dumping without recovery rehearsal equals no backup. A backup that's never been restored — you simply don't know whether it can be restored; it may have been silently corrupted long ago, and you'd find out only when disaster strikes and you're scrambling to recover. A monthly recovery rehearsal turns the assumption "the backup works" into the fact "the backup is verified." This is the chronic flaw of "lifesaving in a crisis" infrastructure: it isn't exercised in normal times, so its defects never surface — you must actively rehearse it periodically to guarantee it actually works.
RabbitMQ is stateless from the business-data standpoint: in-flight messages may be lost on a crash, but the MassTransit transactional Outbox has already persisted messages to PostgreSQL, so undelivered ones can be replayed from the outbox table. Queue/exchange definitions are re-declared by MassTransit at startup, requiring no manual backup.
Rollback Graded by Impact Scope; Destructive Migrations Keep a Manual Gate
| Scope | Action | Data loss? |
|---|---|---|
| Single container | docker compose up -d --no-deps api with the previous image SHA | None |
| App rollback | Change the image tag back to the previous git SHA, up -d | None |
| Migration rollback | Restore the PostgreSQL backup; EF Core migrations default to forward-only | Possible — destructive migrations need pre-review |
One hard rule: destructive migrations (DROP COLUMN, table rename) require a manual approval in CI before merge.
This rule seems to contradict the earlier "deployment shouldn't judge, it should be fully automated," but they're two faces of the same judgment framework: the boundary of automation is drawn between "reversible" and "irreversible." Code deployment is reversible (roll back by swapping images, zero data loss), so it's fully automated; a destructive migration is irreversible (rollback relies on restoring a backup and may lose data), so it's the only place in the automated pipeline that should keep a manual checkpoint. Not all manual gates are inefficient — the one before an irreversible operation trades a little manual delay for "irreversible errors don't happen automatically."
Alert Thresholds
| Signal | Threshold | Action |
|---|---|---|
/health/ready failing | 3 consecutive times | Page on-call; Docker auto-restarts via restart: unless-stopped |
RabbitMQ _error queue depth | > 10 | Page on-call, investigate consumer failures |
| Disk usage | > 80% | Alert, archive old reports |
| PostgreSQL active connections | > 80 | Alert, investigate connection leaks |
| Worker last heartbeat | > 10 min without update | Alert, the container may have silently crashed |
That last one specifically targets the Worker's silent crash. It exposes a monitoring blind spot: an HTTP service crashing is noticed immediately (requests start failing); a background process crashing isn't — it doesn't serve requests, so there's no "request failed" signal. A background Worker's death is silent. So you give it a proactive "I'm still alive" heartbeat, and use "the heartbeat stopped" to detect a fault that would otherwise make no noise at all. A component with no natural failure signal must have a heartbeat signal manufactured for monitoring.
Where It Applies: the Sweet Spot and Ceiling of Self-Hosted Single-Machine
What it buys is clear: each carrier in its proper place, lean and secure images, gated deployment, lifesaving infrastructure in place. But this self-hosted + CDN + Compose has a clear ceiling:
- You need high availability / multiple replicas: currently it's a single server, single instance; HA requires introducing orchestration (Kubernetes) and replicas, with complexity jumping an order of magnitude;
- Traffic outgrows a single machine: the self-hosted single-machine ceiling is clear, and beyond it you need a managed database + horizontal scaling;
- Compliance requires geo-redundant active-active: a single cloud server's disaster recovery can only rely on backup restore (RTO < 4h), and strong-compliance scenarios need genuine multi-region redundancy.
The sweet spot is: a small team, cost-sensitive, controllable traffic, able to accept an RTO of a few hours. Just right for this monitoring platform — no shouldering K8s operational debt upfront for "scale that might come someday."
The transferable layer: nearly every decision in this post lands on some axis of the same judgment framework — separate judgment from execution (the gate), allow exceptions of ideal vs. reality (Alpine/glibc), let reversible vs. irreversible decide the automation boundary (manual gate for destructive migrations), let "has a natural signal vs. none" decide the monitoring approach (Worker heartbeat), let assumption vs. verification decide whether lifesaving infrastructure can be trusted (backup rehearsal). A great many infrastructure questions of "should this be automated, should this be unified, should this be trusted" become clear once dropped onto these axes.