How to read this sheet

Every entry is one of two archetypes. Both end in an Anti-pattern, so you always know what "wrong" looks like.

Decision — a fork in the road

ContextOptionsTrade-offsDefaultSignals you chose wrongAnti-pattern

Pattern — a named solution you implement

ProblemMechanismWhen it earns its costAnti-pattern

01 Should You? The Foundational Call

Microservices are an organizational technology before they are a technical one. The first three cards decide whether you should be on this page at all.

Monolith → Modular Monolith → Microservices Decision
Context You have a system and a team, and someone said "microservices." What granularity actually fits?
Options Monolith (one deployable) · Modular monolith (one deployable, enforced internal module boundaries) · Microservices (many independently deployable services).
Trade-offs Microservices buy independent deployability and fault isolation and pay for them in network calls, distributed data, and operational surface. A modular monolith gives you ~80% of the boundary discipline at ~10% of the operational cost.
Default Start with a modular monolith. Split a module into a service only when a concrete force demands it: a team needs an independent deploy cadence, a component needs to scale or fail independently, or a bounded context has genuinely diverged. Threshold heuristics: more than ~2 teams stepping on one deploy pipeline, or a single team above the ~8-person "two-pizza" line.
Signals you chose wrong You split too early if every feature touches 3+ services and no team can ship alone. You split too late if merge queues and release trains are the bottleneck.
Anti-pattern Adopting microservices for résumé/"modern" reasons before the org has more than one deploy cadence. → catalog
The modular monolith is the honest middle

A modular monolith enforces the same bounded contexts as microservices — via project references, module visibility, and dependency rules — but keeps a single process and a single deploy. You get testability and clear ownership without correlation IDs, sagas, or a mesh. This is the same dependency discipline described in Clean Architecture; microservices simply promote a module boundary into a network boundary — and every network boundary you add is a distributed-systems tax you now pay forever.

Conway's Law as a Forcing Function Decision
Context Your architecture will mirror your communication structure whether or not you designed it to.
Options Let the org chart shape the architecture accidentally, or apply the Inverse Conway Maneuver — design the team topology you want the architecture to reflect, then let the services follow.
Trade-offs Service boundaries that cut across team boundaries generate constant cross-team coordination — the exact cost microservices were supposed to remove.
Default One service (or a small cluster of services) owned end-to-end by one stream-aligned team. If two teams must coordinate to ship one service, the boundary is wrong.
Signals you chose wrong Cross-team PR approvals on every release; a "shared services" team that becomes everyone's bottleneck.
Anti-pattern Drawing service lines on a whiteboard while ignoring who owns what. → catalog
The Microservices Tax Reality Check

The honest counterweight to every "microservices scale better" pitch. Before you split, price the bill you're signing up for — it is paid per service, forever.

Cost lineWhat the monolith gave you freeWhat you now build/operate
CallsIn-process method call, ~ns, can't failNetwork hop: latency budget, timeouts, retries, partial failure
DataACID transaction across all tablesEventual consistency, sagas, outbox, no cross-service JOIN (§8)
DebuggingOne stack traceDistributed tracing across N hops (§13)
DeployOne artifactN pipelines, versioned contracts, canary infra (see DevOps Pipelines)
On-callOne thing that's up or downN services × dependency graph = combinatorial failure modes

Rule of thumb: if you can't articulate which specific line above you're buying relief from, you're paying the tax for nothing.

02 Getting There: The Strangler Fig

"Should I?" and "how do I start?" are the same reader's back-to-back questions. You almost never greenfield microservices — you extract them from a monolith incrementally.

Strangler Fig Migration Pattern
Problem A big-bang rewrite of a monolith into services fails: it freezes feature work for months and lands untested.
Mechanism Put a routing façade (a gateway or reverse proxy) in front of the monolith. Extract one bounded context at a time into a new service, route just that path to it, and delete the old code once traffic is fully cut over. The monolith shrinks as the "fig" grows around it.
When it earns its cost Any migration of a system that must keep shipping. Extract by seam: start with a low-risk, read-mostly context (e.g. notifications, reporting) to build the deploy/observability muscle before touching the order pipeline.
Anti-pattern The big-bang rewrite; or extracting a service that still reaches back into the monolith's database. → catalog
Sequence
  • 1. Seam it. Find a bounded context with few inbound dependencies. Wrap its data access behind an interface inside the monolith first.
  • 2. Front it. Add the gateway/façade so routing is a config change, not a code change.
  • 3. Split the data. The service gets its own store; migrate/replicate data, then cut writes over. This is the hard part, not the code.
  • 4. Cut over & delete. Route 100%, monitor, then delete the monolith's copy. An extraction that never deletes the old path is just a distributed monolith in progress.
Anti-Corruption Layer Pattern
Problem A new service must talk to the legacy monolith without inheriting its data model and vocabulary.
Mechanism A translation layer (the ACL) at the service edge maps the legacy model to the new domain model in both directions, so the legacy schema never leaks inward.
When it earns its cost Whenever you integrate with a legacy system or third-party API whose model you don't control. It's the same dependency-inversion instinct as Clean Architecture's ports & adapters, applied at the service boundary.
Anti-pattern Letting the legacy DTOs become your new service's domain model — you've moved the coupling, not removed it.

03 Service Boundaries

The single most important — and most often botched — decision. Draw boundaries around business capabilities, not database tables.

DDD Bounded Contexts Decision
Context You need a rule for where one service ends and the next begins.
Options Boundary by bounded context (a capability with its own model & language) vs by technical layer (an "auth service", a "database service") vs by entity/table (one service per noun).
Trade-offs Context boundaries minimize the chatter that crosses the network because a business operation completes mostly inside one service. Table/layer boundaries maximize it — every use case fans out.
Default One service per bounded context, owning its data and its model. A "Customer" means different things to Billing and to Support — that's two contexts, not one shared Customer service.
Signals you chose wrong Most features require synchronized changes to 3+ services; you keep adding fields to a shared model to satisfy one consumer.
Anti-pattern "One service per database table" — nano-services that can't do anything alone. → catalog
Where the seams actually are
  • Ubiquitous language shifts. When the same word means different things to different people, you've found a context boundary.
  • Aggregate roots are the unit of consistency; a transaction should never span two aggregates in different services. See the aggregate/entity modelling in Clean Architecture.
  • Rate of change. Things that change together belong together; things that change for different reasons and cadences want to be apart.
The Distributed Monolith Anti-Decision
Context The worst of both worlds: microservices' operational cost with the monolith's coupling. It's the default failure mode of a bad split.
Tells Services that (a) must be deployed together in lockstep, (b) share a database or schema, or (c) share a domain model / DTO library. Any one of these means it's one service wearing a costume.
Test Can this service be deployed, on its own, without coordinating a release with any other? If no, it is not a microservice — merge it back or fix the boundary.
Anti-pattern A shared "Common" or "Entities" NuGet/npm package that every service depends on — it recreates lockstep deploys through the back door. → catalog

04 Communication: Sync vs Async

Decide this per interaction, not once per system. The wrong default here is what multiplies your outages.

Sync (REST/gRPC) vs Async (Messaging) Decision
Context Service A needs something from Service B. Blocking call or message?
Options Sync request-response (REST, gRPC) · Async fire-and-forget (event/command on a broker) · Async request-reply (send command, correlate a reply message).
Trade-offs Sync is simple and immediate but temporally couples A to B's availability. Async decouples availability and smooths load, at the cost of eventual consistency and harder debugging.
Default Sync for queries the caller must have answered now (read a price); async for commands/notifications that can complete later (place order → fulfil). Prefer gRPC over REST for internal high-volume sync (binary, streaming, ~contract-first).
Signals you chose wrong A user request blocks on a chain of 4 sync calls; a "quick" sync call to a slow dependency is eating your thread pool.
Anti-pattern Synchronous call chains A→B→C→D — availability multiplies against you (see next card). → catalog
The Availability Math Do the Numbers

Why "just call the other service" is a trap. Availability in a synchronous chain multiplies: if each hop is independently up 99.9% of the time, four of them in series is:

0.9994 ≈ 0.996099.6%

TopologyCombined uptimeDowntime / month (30d)
1 service @ 99.9%99.9%~43 min
4 in a sync chain99.6%~173 min (≈ 2.9 hrs)
10 in a sync chain99.0%~432 min (≈ 7.2 hrs)

Fix: break the chain with async messaging (the caller's uptime stops depending on the callee's), cache reads, or collapse the chain by fixing the boundary. Every sync hop you remove from the critical path is downtime you delete.

05 The Edge: Gateway, Discovery & Mesh

Cross-cutting concerns — auth, routing, retries, mTLS — belong at the edge and in the platform, not copy-pasted into every service.

API Gateway / BFF Decision
Context Where do edge concerns (authn, rate-limit, TLS termination, request aggregation) live?
Options Single API gateway · a Backend-for-Frontend per client type (web/mobile) · direct-to-service (no edge).
Default A gateway for shared edge concerns; add a BFF when web and mobile need genuinely different aggregation/shaping.
Anti-pattern Business logic creeping into the gateway — it becomes a new monolith every team must change. → catalog
Service Discovery Decision
Context How does a sync caller find a healthy instance of the callee when IPs are ephemeral?
Options Client-side (registry lookup + client load-balances) · server-side (LB/DNS in front) · platform-native (Kubernetes Service + DNS).
Default Let the platform do it — Kubernetes DNS / cloud service discovery. Don't hand-roll a registry unless you're off-platform.
Anti-pattern Hard-coded hostnames/IPs and config files you redeploy to change a route.
Service Mesh Decision
Context Retries, mTLS, timeouts, and traffic-shaping are needed by every service. Library or infra?
Options Sidecar mesh (Istio, Linkerd) moves it to the platform · in-process libraries (Polly, resilience4j) · nothing (each team reinvents it).
Default Libraries until ~10+ services; adopt a mesh when uniform mTLS, retries, and traffic policy across many services outweigh the sidecar's latency/complexity cost. Linkerd if you want light; Istio if you want every knob.
Anti-pattern Installing Istio for 3 services because it's on the CNCF landscape. → catalog

06 Event-Driven Core Patterns

Three "event" things people constantly conflate, plus how work coordinates across services.

Notification vs State Transfer vs Sourcing Decision

All three are called "events." They are different tools with different coupling and storage implications.

PatternWhat the event carriesConsumer doesUse when
Event Notification"OrderPlaced #123" — a bare fact + IDCalls back to fetch details if neededLow coupling; consumers rarely need the full payload
Event-Carried State TransferThe full order snapshotKeeps a local read replica; no callbackConsumers need the data and you want to kill sync read chatter
Event SourcingEvery state change, as the system of recordRebuilds state by replaying the logYou need a full audit log / temporal queries — rarely; high complexity
Default Event notification or state transfer for integration between services. Reserve event sourcing for the few aggregates that genuinely need an audit trail — it is a persistence strategy for one service, not an architecture mandate for all of them.
Anti-pattern "We're event-sourced" as a company-wide decree; conflating event sourcing with pub/sub. → catalog
Pub/Sub vs Point-to-Point Queue Decision
Context Should a message go to everyone interested or to exactly one worker?
Options Pub/sub (topic): one publish, many independent subscribers each get a copy. Queue (point-to-point): many workers compete, each message handled once.
Default Pub/sub for events ("something happened", N consumers care); queue for commands/work ("do this once", scale workers). Add competing consumers to a queue to scale throughput horizontally.
Anti-pattern Fanning a command out to a topic so two workers both do it (double-charge); or a queue where you actually needed everyone to react.
Choreography vs Orchestration (Sagas) Decision
Context A multi-step workflow spans services (order → payment → inventory → shipping). Who drives it?
Options Choreography: each service reacts to events, no central brain. Orchestration: a saga orchestrator issues commands and tracks the workflow explicitly.
Trade-offs Choreography is loosely coupled but the workflow is implicit — no one place tells you what happens next. Orchestration is explicit and traceable but the orchestrator is a coupling point.
Default Orchestration once a workflow has 3+ steps or needs compensation/visibility; choreography for simple 1–2-hop reactions.
Anti-pattern Choreography sprawl — the workflow is implicit across 8 event handlers and no one can trace it. → catalog

07 Messaging Infrastructure

Log, queue, and push are three different jobs. Pick by delivery/ordering/replay requirements, not by popularity.

Kafka vs Service Bus/SQS-SNS vs SignalR Decision
ToolModelSuperpowerReach for it whenNot for
Kafka (log) Append-only partitioned log; consumers track offsets Replay + high throughput + ordered per partition Event streaming, event sourcing, replay, analytics, >100k msg/s Simple task queues; you'll drown in operational complexity
Azure Service Bus / SQS+SNS (queue/broker) Broker-managed queues & topics; broker tracks acks Rich queue semantics: DLQ, dedup, sessions, delayed delivery Commands, work queues, sagas, most business messaging Replaying months of history; ultra-high streaming volume
SignalR / WebSockets (push) Server→client real-time push over persistent connections Low-latency push to end users' browsers/apps Live dashboards, notifications, chat, presence Service-to-service integration; durability/replay
Default A managed broker (Service Bus / SQS+SNS) covers ~90% of service-to-service messaging with the least ops burden. Add Kafka only when you specifically need replay, log retention, or streaming throughput. SignalR is a client-facing tool, orthogonal to the other two.
Design Topic/queue design decides your ordering & scaling: Kafka orders within a partition (key by aggregate ID to keep an entity's events ordered); queues scale via competing consumers but then lose global ordering.
Anti-pattern Picking Kafka because it's trendy when you needed a DLQ and delayed retry; picking by popularity instead of delivery/ordering/replay needs. → catalog

08 Data Patterns

This is where microservices are won or lost. See also the Databases cheatsheet for storage-engine choices.

Database-per-Service Decision
Context Can services share a database?
Options Database-per-service (private store, accessed only via the owning service's API) vs shared database (multiple services read/write the same schema).
Default Database-per-service, non-negotiable. A shared database is the fastest route to a distributed monolith: it couples deploys through the schema and lets one service's migration break three others.
Signals you chose wrong You're writing cross-service JOINs in a reporting shim; a schema migration requires a cross-team change freeze.
Anti-pattern Two services reaching into the same tables "just for this one query." → catalog
You don't JOIN — you replicate or compose
  • API composition: the caller queries each service and joins in memory (fine for small result sets).
  • Read replica via events: event-carried state transfer (§6) builds a local denormalized copy of what you need — this is CQRS on the read side.
  • Reporting/analytics: stream all services' events into a warehouse/lake; do cross-domain JOINs there, never in the operational path.
Outbox Pattern Pattern
Problem The dual-write problem: commit to the DB and publish an event, atomically. A crash between them loses the event or fakes one.
Mechanism Write the event into an outbox table in the same DB transaction as the business change. A separate relay (CDC or a poller) reads the outbox and publishes to the broker, marking rows sent.
When it earns its cost Any time a state change must reliably produce an event — i.e. almost every command handler in an event-driven system. Delivery is at-least-once, so consumers must be idempotent (§9).
Anti-pattern "Commit DB, then publish" in two steps and hoping — the classic dual-write bug. → catalog
-- one transaction, two writes, zero dual-write risk
BEGIN;
  UPDATE orders SET status='PLACED' WHERE id = 123;
  INSERT INTO outbox (id, aggregate_id, type, payload, created_at, sent_at)
  VALUES (gen_random_uuid(), 123, 'OrderPlaced',
          '{"orderId":123,"total":4999}', now(), NULL);
COMMIT;
-- relay: SELECT * FROM outbox WHERE sent_at IS NULL ORDER BY created_at;
--        publish → mark sent_at = now()  (retries are safe: consumer dedups)
Saga (Compensating Transactions) Pattern
Problem A business transaction spans services, but there's no distributed ACID transaction to roll them all back.
Mechanism Model the workflow as a sequence of local transactions; if step N fails, run compensating actions to semantically undo steps N-1…1 (refund the payment, release the inventory). Driven by an orchestrator or by choreography (§6).
When it earns its cost Any multi-service workflow needing all-or-nothing business outcome — checkout, booking, onboarding. Compensations are business logic ("refund"), not technical rollbacks.
Anti-pattern Reaching for two-phase commit (2PC) across service boundaries — it locks resources across the network and doesn't scale. → catalog
CQRS Decision
Context Should reads and writes use the same model?
Options Single model for both vs CQRS — separate write model (commands, invariants) and read model(s) (denormalized, query-shaped), often kept in sync by events.
Default Don't — start with one model. Adopt CQRS only for the specific aggregates where read and write shapes have genuinely diverged (complex queries, very different read/write scale). This is usually later than people think.
Anti-pattern CQRS + event sourcing on every entity by default because a conference talk said so — you've tripled the code for a CRUD form. → catalog

09 Consistency & Idempotency

Distributed systems trade instant consistency for availability. Design for it explicitly — and communicate it to non-engineers.

Eventual Consistency as the Contract Decision
Context Once data lives in many services, "the instant everyone agrees" no longer exists for free.
Options Strong consistency (coordinate on every read/write — expensive, limits availability) vs eventual consistency (replicas converge after a short lag).
Default Eventual consistency between services; reserve strong consistency for within a single aggregate/service. Make the lag a stated product contract ("inventory reflects orders within seconds"), not a silent surprise.
Communicate it Give product/support the language: "the dashboard is a few seconds behind, and that's by design." Show pending/optimistic UI states rather than pretending it's instant.
Anti-pattern Assuming reads are instantly consistent across services and building UX that lies about it. → catalog
Idempotency Keys Pattern
Problem At-least-once delivery means every message can arrive twice. Without protection, "charge card" runs twice.
Mechanism The producer stamps each message/request with a unique idempotency key. The consumer records processed keys and short-circuits duplicates, so re-processing is a no-op. Keep a dedupe window (e.g. 24–72h) sized to your max retry horizon.
When it earns its cost Everything at-least-once — i.e. everything. Any message consumer, any retried API call (payments, especially). Mandatory, not optional.
Anti-pattern Assuming "exactly-once delivery" exists at the transport layer and skipping dedup. It doesn't; you get exactly-once processing only by making handlers idempotent. → catalog

10 Failure Handling

In a monolith a dependency call can't half-fail. Across the network, partial failure is the normal case — design for it.

Retry, Backoff & Circuit Breakers Pattern
Problem Transient failures (blips, timeouts) should be retried; a genuinely down dependency should not be hammered.
Mechanism Retry with exponential backoff + jitter (e.g. 3 attempts, capped ~30s) for transient errors. Wrap the call in a circuit breaker: after a failure threshold (e.g. 50% of calls in a 10s window) it opens and fails fast; after a cool-down (~30s) it half-opens to probe recovery.
When it earns its cost Every synchronous cross-service call. Add a bulkhead (isolated connection/thread pool per dependency) so one slow callee can't exhaust your whole thread pool.
Anti-pattern Retrying non-idempotent calls without an idempotency key; infinite retries with no cap — you turn a blip into a retry storm. → catalog
Dead-Letter & Poison Messages Pattern
Problem A message that can never be processed (bad schema, referenced entity gone) will retry forever and block the queue behind it.
Mechanism After N failed deliveries (e.g. 5), the broker moves the poison message to a dead-letter queue (DLQ) — off the hot path, retained for inspection and manual/automated replay.
When it earns its cost Every durable queue/topic. Alert on DLQ depth > 0 — a filling DLQ is a bug, not a shrug. Redrive after fixing the cause.
Anti-pattern Infinite retry with no DLQ, silently eating failures; or a DLQ nobody ever looks at. → catalog

11 Contracts & Versioning

Independent deployability is a promise you can only keep if changing your API/events can't silently break a consumer.

Consumer-Driven Contract Testing Decision
Context How do you know a change to your service won't break its consumers before you deploy?
Options Full end-to-end integration tests (slow, flaky, need everything running) vs consumer-driven contracts (Pact): each consumer publishes the shape it depends on; the provider's CI verifies it still satisfies every consumer's contract.
Default Contract tests as the deployability gate; keep a thin layer of true end-to-end tests for critical journeys only. The test pyramid inverts in microservices — lots of unit + contract, few E2E.
Anti-pattern A giant shared E2E suite that must pass before anyone ships — you've re-coupled all your deploys. → catalog
Schema Evolution Pattern
Problem Events and APIs must evolve without a synchronized "everyone upgrade at once" deploy.
Mechanism Change additively: new fields are optional with defaults; never remove or repurpose a field in place. Consumers apply tolerant reading (ignore unknown fields). For breaking changes, version the message type and run old + new in parallel until consumers migrate.
When it earns its cost Every published event and public API. A schema registry (e.g. for Kafka/Avro) can enforce backward compatibility in CI.
Anti-pattern Renaming/removing a field and deploying — every lagging consumer breaks at once. → catalog

12 Service-to-Service Security

The network between services is not trusted. "Inside the firewall" is not an authorization model.

Zero-Trust: mTLS + Token Propagation Decision
Context How does service B know the caller is really service A, and on whose behalf it's acting?
Options Network trust ("it's in the VPC") vs zero-trust: mTLS for service identity + a propagated, signed user token (JWT) carrying the end-user's authorization.
Default Zero-trust. mTLS between every service (a mesh, §5, automates this); the gateway validates the user token at the edge and each service re-validates and propagates it — never re-issues trust from thin air.
Anti-pattern "The network is internal so no auth needed" — one compromised pod then owns everything. → catalog
  • Secrets: per-service secrets from a vault (Key Vault / Secrets Manager), rotated; never baked into images or shared across services.
  • Confused deputy: propagate the user's token so downstream services enforce the user's permissions — don't let a service act with its own god-mode identity on a user's behalf.
  • See the broader posture in Linux Server Hardening.
Blast-Radius Containment Pattern
Problem A compromised or misbehaving service shouldn't be able to reach everything.
Mechanism Least-privilege network policies (default-deny east-west traffic; allow only declared dependencies), scoped tokens with narrow audiences, and per-service data access. The mesh or K8s NetworkPolicy enforces the graph.
When it earns its cost Any multi-tenant or regulated system; really, any production system past a handful of services.
Anti-pattern Flat network where every service can call every other and read every DB.

13 Observability

You cannot attach a debugger to a distributed system. Observability is how you replace the single stack trace you gave up.

Correlation IDs & Distributed Tracing Pattern
Problem A single user action touches N services. When it's slow or broken, which hop?
Mechanism Propagate a correlation ID (the whole request's trace) and causation ID (which message caused this one) across every hop and log line — via W3C Trace Context / OpenTelemetry. A tracing backend (Jaeger, Tempo, Datadog) stitches the spans into one waterfall.
When it earns its cost Mandatory once you're past ~3 services — not a nice-to-have. Emit the three pillars: traces (where), metrics (how much), logs (what), all tagged with the correlation ID.
Anti-pattern Debugging a saga by grepping five log files by hand and eyeballing timestamps. → catalog
Health, SLOs & the Golden Signals Pattern
Problem With many services, "is it healthy?" needs a machine-checkable answer for routing and alerting.
Mechanism Each service exposes liveness (am I running?) and readiness (can I take traffic?) probes. Alert on the four golden signals — latency, traffic, errors, saturation — against SLOs, not on raw CPU.
When it earns its cost Every service. Readiness gates rollouts and load-balancer membership; SLO burn-rate alerts page humans only when user impact is real.
Anti-pattern Alerting on every CPU spike (alert fatigue) while no alert fires when the user-facing error rate triples.

14 Anti-Pattern Catalog

The canonical, full treatment of every anti-pattern flagged above — this is the single source of truth; the inline mentions are teasers that link here. Each row: the smell that precedes it → why it happens → the fix.

Anti-patternSmell (leading indicator)Why it happensFix
Premature / résumé-driven microservicesSplitting before there's more than one deploy cadence; installing a mesh for 3 services."Microservices are modern"; tech-résumé incentives.Start with a modular monolith (§1); split only against a concrete force.
Ignoring Conway's LawCross-team approvals on every release; a boundary that two teams both own.Architecture drawn without regard to team topology.One service = one stream-aligned team; use the Inverse Conway Maneuver (§1).
Big-bang rewriteA months-long "v2" branch; feature freeze on the monolith.Belief that a clean rewrite is faster than incremental extraction.Strangler fig — extract one context at a time behind a façade (§2).
Nano-services (one per table)Every use case fans out to 3+ services; services that can't do anything alone.Boundaries drawn by data schema, not business capability.Boundary by bounded context / aggregate (§3).
Distributed monolithServices deploy in lockstep; a shared "Common"/"Entities" package.A split that shared DB, model, or release.Kill shared DB & model libs; enforce "deployable alone" (§3).
Synchronous call chainsOne user request blocks on A→B→C→D; thread pools exhausting.Reaching for a sync call because it's the obvious tool.Async messaging, caching, or fix the boundary — do the availability math (§4).
Fat gatewayEvery team must edit the gateway to ship; business rules in routing config.The gateway is the easy place to "just add this."Keep edge concerns only; push logic into services (§5).
Event sourcing everywhere"We're event-sourced" as a decree; CRUD entities modelled as event streams.Conflating event sourcing with pub/sub; conference-driven design.Event sourcing is per-aggregate persistence, used rarely; integrate with notification/state-transfer (§6).
Choreography sprawlNo one can answer "what happens after OrderPlaced?"; logic spread across 8 handlers.Choreography chosen for decoupling, past its complexity limit.Orchestrate workflows with 3+ steps; make the saga explicit (§6).
Infra-by-popularityKafka adopted, but what you use is a DLQ and delayed retry.Choosing the trendy tool over the requirement.Pick by delivery/ordering/replay needs (§7).
Shared databaseTwo services querying the same tables; migrations need a change freeze."Just this one JOIN" convenience.Database-per-service; compose or replicate via events (§8).
Dual write"Save to DB, then publish event" in two steps; occasional missing events.No atomic transaction spans DB + broker.Outbox pattern — event and state in one transaction (§8).
2PC across servicesDistributed locks held across network calls; throughput collapses under load.Wanting ACID semantics across service boundaries.Saga with compensating transactions (§8).
CQRS by defaultSeparate read/write models + event store on a simple CRUD form.Cargo-culting a pattern beyond its niche.One model first; CQRS only where read/write genuinely diverge (§8).
Assuming strong consistencyUX that shows stale data as if live; "why isn't it updated yet?" bug reports.Carrying monolith mental models into a distributed system.Design for eventual consistency; state the lag as a contract (§9).
Believing in exactly-once deliveryNo dedup logic; duplicate charges/emails under retry.Transport marketing ("exactly-once") taken literally.At-least-once + idempotency keys = exactly-once processing (§9).
Retry stormA recovering service gets knocked back down; synchronized retry spikes.Fixed-interval, uncapped retries with no jitter or breaker.Exponential backoff + jitter + circuit breaker + bulkhead (§10).
No DLQ / silent failureA poison message loops forever; failures vanish with no trace.Happy-path-only consumer design.Dead-letter after N tries; alert on DLQ depth > 0 (§10).
Giant shared E2E gateNobody can deploy until one flaky suite with everything running goes green.Testing a distributed system like a monolith.Consumer-driven contract tests as the gate; thin E2E (§11).
Breaking schema changeA field renamed/removed; lagging consumers break on deploy.Treating events/APIs as internal, mutable structures.Additive-only + tolerant reader + versioned parallel run (§11).
Trusted internal networkNo service-to-service auth; flat, any-to-any connectivity."It's behind the firewall" as an authz model.Zero-trust: mTLS + propagated user tokens + network policy (§12).
Grep-the-logs debuggingIncidents diagnosed by hand-correlating five log files.Tracing treated as optional infra.Correlation IDs + OpenTelemetry distributed tracing from ~3 services on (§13).

If you read only one section, read this one — it's the most-linked page for a reason.