How to read this sheet
Every entry is one of two archetypes. Both end in an Anti-pattern, so you always know what "wrong" looks like.
Decision — a fork in the road
Context → Options → Trade-offs → Default → Signals you chose wrong → Anti-pattern
Pattern — a named solution you implement
Problem → Mechanism → When it earns its cost → Anti-pattern
01 Should You? The Foundational Call
Microservices are an organizational technology before they are a technical one. The first three cards decide whether you should be on this page at all.
The modular monolith is the honest middle
A modular monolith enforces the same bounded contexts as microservices — via project references, module visibility, and dependency rules — but keeps a single process and a single deploy. You get testability and clear ownership without correlation IDs, sagas, or a mesh. This is the same dependency discipline described in Clean Architecture; microservices simply promote a module boundary into a network boundary — and every network boundary you add is a distributed-systems tax you now pay forever.
The honest counterweight to every "microservices scale better" pitch. Before you split, price the bill you're signing up for — it is paid per service, forever.
| Cost line | What the monolith gave you free | What you now build/operate |
|---|---|---|
| Calls | In-process method call, ~ns, can't fail | Network hop: latency budget, timeouts, retries, partial failure |
| Data | ACID transaction across all tables | Eventual consistency, sagas, outbox, no cross-service JOIN (§8) |
| Debugging | One stack trace | Distributed tracing across N hops (§13) |
| Deploy | One artifact | N pipelines, versioned contracts, canary infra (see DevOps Pipelines) |
| On-call | One thing that's up or down | N services × dependency graph = combinatorial failure modes |
Rule of thumb: if you can't articulate which specific line above you're buying relief from, you're paying the tax for nothing.
02 Getting There: The Strangler Fig
"Should I?" and "how do I start?" are the same reader's back-to-back questions. You almost never greenfield microservices — you extract them from a monolith incrementally.
Sequence
- 1. Seam it. Find a bounded context with few inbound dependencies. Wrap its data access behind an interface inside the monolith first.
- 2. Front it. Add the gateway/façade so routing is a config change, not a code change.
- 3. Split the data. The service gets its own store; migrate/replicate data, then cut writes over. This is the hard part, not the code.
- 4. Cut over & delete. Route 100%, monitor, then delete the monolith's copy. An extraction that never deletes the old path is just a distributed monolith in progress.
03 Service Boundaries
The single most important — and most often botched — decision. Draw boundaries around business capabilities, not database tables.
Where the seams actually are
- Ubiquitous language shifts. When the same word means different things to different people, you've found a context boundary.
- Aggregate roots are the unit of consistency; a transaction should never span two aggregates in different services. See the aggregate/entity modelling in Clean Architecture.
- Rate of change. Things that change together belong together; things that change for different reasons and cadences want to be apart.
04 Communication: Sync vs Async
Decide this per interaction, not once per system. The wrong default here is what multiplies your outages.
Why "just call the other service" is a trap. Availability in a synchronous chain multiplies: if each hop is independently up 99.9% of the time, four of them in series is:
0.9994 ≈ 0.9960 → 99.6%
| Topology | Combined uptime | Downtime / month (30d) |
|---|---|---|
| 1 service @ 99.9% | 99.9% | ~43 min |
| 4 in a sync chain | 99.6% | ~173 min (≈ 2.9 hrs) |
| 10 in a sync chain | 99.0% | ~432 min (≈ 7.2 hrs) |
Fix: break the chain with async messaging (the caller's uptime stops depending on the callee's), cache reads, or collapse the chain by fixing the boundary. Every sync hop you remove from the critical path is downtime you delete.
05 The Edge: Gateway, Discovery & Mesh
Cross-cutting concerns — auth, routing, retries, mTLS — belong at the edge and in the platform, not copy-pasted into every service.
06 Event-Driven Core Patterns
Three "event" things people constantly conflate, plus how work coordinates across services.
All three are called "events." They are different tools with different coupling and storage implications.
| Pattern | What the event carries | Consumer does | Use when |
|---|---|---|---|
| Event Notification | "OrderPlaced #123" — a bare fact + ID | Calls back to fetch details if needed | Low coupling; consumers rarely need the full payload |
| Event-Carried State Transfer | The full order snapshot | Keeps a local read replica; no callback | Consumers need the data and you want to kill sync read chatter |
| Event Sourcing | Every state change, as the system of record | Rebuilds state by replaying the log | You need a full audit log / temporal queries — rarely; high complexity |
07 Messaging Infrastructure
Log, queue, and push are three different jobs. Pick by delivery/ordering/replay requirements, not by popularity.
| Tool | Model | Superpower | Reach for it when | Not for |
|---|---|---|---|---|
| Kafka (log) | Append-only partitioned log; consumers track offsets | Replay + high throughput + ordered per partition | Event streaming, event sourcing, replay, analytics, >100k msg/s | Simple task queues; you'll drown in operational complexity |
| Azure Service Bus / SQS+SNS (queue/broker) | Broker-managed queues & topics; broker tracks acks | Rich queue semantics: DLQ, dedup, sessions, delayed delivery | Commands, work queues, sagas, most business messaging | Replaying months of history; ultra-high streaming volume |
| SignalR / WebSockets (push) | Server→client real-time push over persistent connections | Low-latency push to end users' browsers/apps | Live dashboards, notifications, chat, presence | Service-to-service integration; durability/replay |
08 Data Patterns
This is where microservices are won or lost. See also the Databases cheatsheet for storage-engine choices.
You don't JOIN — you replicate or compose
- API composition: the caller queries each service and joins in memory (fine for small result sets).
- Read replica via events: event-carried state transfer (§6) builds a local denormalized copy of what you need — this is CQRS on the read side.
- Reporting/analytics: stream all services' events into a warehouse/lake; do cross-domain JOINs there, never in the operational path.
outbox table in the same DB transaction as the business change. A separate relay (CDC or a poller) reads the outbox and publishes to the broker, marking rows sent.-- one transaction, two writes, zero dual-write risk
BEGIN;
UPDATE orders SET status='PLACED' WHERE id = 123;
INSERT INTO outbox (id, aggregate_id, type, payload, created_at, sent_at)
VALUES (gen_random_uuid(), 123, 'OrderPlaced',
'{"orderId":123,"total":4999}', now(), NULL);
COMMIT;
-- relay: SELECT * FROM outbox WHERE sent_at IS NULL ORDER BY created_at;
-- publish → mark sent_at = now() (retries are safe: consumer dedups)
09 Consistency & Idempotency
Distributed systems trade instant consistency for availability. Design for it explicitly — and communicate it to non-engineers.
10 Failure Handling
In a monolith a dependency call can't half-fail. Across the network, partial failure is the normal case — design for it.
11 Contracts & Versioning
Independent deployability is a promise you can only keep if changing your API/events can't silently break a consumer.
12 Service-to-Service Security
The network between services is not trusted. "Inside the firewall" is not an authorization model.
- Secrets: per-service secrets from a vault (Key Vault / Secrets Manager), rotated; never baked into images or shared across services.
- Confused deputy: propagate the user's token so downstream services enforce the user's permissions — don't let a service act with its own god-mode identity on a user's behalf.
- See the broader posture in Linux Server Hardening.
13 Observability
You cannot attach a debugger to a distributed system. Observability is how you replace the single stack trace you gave up.
14 Anti-Pattern Catalog
The canonical, full treatment of every anti-pattern flagged above — this is the single source of truth; the inline mentions are teasers that link here. Each row: the smell that precedes it → why it happens → the fix.
| Anti-pattern | Smell (leading indicator) | Why it happens | Fix |
|---|---|---|---|
| Premature / résumé-driven microservices | Splitting before there's more than one deploy cadence; installing a mesh for 3 services. | "Microservices are modern"; tech-résumé incentives. | Start with a modular monolith (§1); split only against a concrete force. |
| Ignoring Conway's Law | Cross-team approvals on every release; a boundary that two teams both own. | Architecture drawn without regard to team topology. | One service = one stream-aligned team; use the Inverse Conway Maneuver (§1). |
| Big-bang rewrite | A months-long "v2" branch; feature freeze on the monolith. | Belief that a clean rewrite is faster than incremental extraction. | Strangler fig — extract one context at a time behind a façade (§2). |
| Nano-services (one per table) | Every use case fans out to 3+ services; services that can't do anything alone. | Boundaries drawn by data schema, not business capability. | Boundary by bounded context / aggregate (§3). |
| Distributed monolith | Services deploy in lockstep; a shared "Common"/"Entities" package. | A split that shared DB, model, or release. | Kill shared DB & model libs; enforce "deployable alone" (§3). |
| Synchronous call chains | One user request blocks on A→B→C→D; thread pools exhausting. | Reaching for a sync call because it's the obvious tool. | Async messaging, caching, or fix the boundary — do the availability math (§4). |
| Fat gateway | Every team must edit the gateway to ship; business rules in routing config. | The gateway is the easy place to "just add this." | Keep edge concerns only; push logic into services (§5). |
| Event sourcing everywhere | "We're event-sourced" as a decree; CRUD entities modelled as event streams. | Conflating event sourcing with pub/sub; conference-driven design. | Event sourcing is per-aggregate persistence, used rarely; integrate with notification/state-transfer (§6). |
| Choreography sprawl | No one can answer "what happens after OrderPlaced?"; logic spread across 8 handlers. | Choreography chosen for decoupling, past its complexity limit. | Orchestrate workflows with 3+ steps; make the saga explicit (§6). |
| Infra-by-popularity | Kafka adopted, but what you use is a DLQ and delayed retry. | Choosing the trendy tool over the requirement. | Pick by delivery/ordering/replay needs (§7). |
| Shared database | Two services querying the same tables; migrations need a change freeze. | "Just this one JOIN" convenience. | Database-per-service; compose or replicate via events (§8). |
| Dual write | "Save to DB, then publish event" in two steps; occasional missing events. | No atomic transaction spans DB + broker. | Outbox pattern — event and state in one transaction (§8). |
| 2PC across services | Distributed locks held across network calls; throughput collapses under load. | Wanting ACID semantics across service boundaries. | Saga with compensating transactions (§8). |
| CQRS by default | Separate read/write models + event store on a simple CRUD form. | Cargo-culting a pattern beyond its niche. | One model first; CQRS only where read/write genuinely diverge (§8). |
| Assuming strong consistency | UX that shows stale data as if live; "why isn't it updated yet?" bug reports. | Carrying monolith mental models into a distributed system. | Design for eventual consistency; state the lag as a contract (§9). |
| Believing in exactly-once delivery | No dedup logic; duplicate charges/emails under retry. | Transport marketing ("exactly-once") taken literally. | At-least-once + idempotency keys = exactly-once processing (§9). |
| Retry storm | A recovering service gets knocked back down; synchronized retry spikes. | Fixed-interval, uncapped retries with no jitter or breaker. | Exponential backoff + jitter + circuit breaker + bulkhead (§10). |
| No DLQ / silent failure | A poison message loops forever; failures vanish with no trace. | Happy-path-only consumer design. | Dead-letter after N tries; alert on DLQ depth > 0 (§10). |
| Giant shared E2E gate | Nobody can deploy until one flaky suite with everything running goes green. | Testing a distributed system like a monolith. | Consumer-driven contract tests as the gate; thin E2E (§11). |
| Breaking schema change | A field renamed/removed; lagging consumers break on deploy. | Treating events/APIs as internal, mutable structures. | Additive-only + tolerant reader + versioned parallel run (§11). |
| Trusted internal network | No service-to-service auth; flat, any-to-any connectivity. | "It's behind the firewall" as an authz model. | Zero-trust: mTLS + propagated user tokens + network policy (§12). |
| Grep-the-logs debugging | Incidents diagnosed by hand-correlating five log files. | Tracing treated as optional infra. | Correlation IDs + OpenTelemetry distributed tracing from ~3 services on (§13). |
If you read only one section, read this one — it's the most-linked page for a reason.