Hitting five-nines on a payment settlement service
2023-09-04 / 3 min / payments / reliability / postgres / production
What 99.999% actually means at 20M transactions a month. The Postgres patterns, the idempotency surface, and the operational tax that nobody talks about until they have already missed an SLA.
What five-nines costs you
99.999% availability is roughly five minutes of downtime per year. At 20M transactions per month, every minute of full outage is roughly 460 transactions. Every minute of degraded write path can be much worse, because retries stack up and queues fill behind the failure.
The bar is unreachable by adding redundancy alone. You hit it by making the system safe to retry, safe to degrade, and safe to deploy on a Tuesday afternoon.
Where 99.99% breaks down at this volume
The biggest failure class was not raw outage. It was correctness under retry. Settlement is a write path with money attached. A naive retry on a 5xx can double-charge, double-credit, or settle the same batch twice if the upstream did not give you a clean error contract.
The second was tail latency on the database. Postgres p99 latency on the settlement tables crept up steadily as the dataset grew. A 100ms p99 at year one became 800ms at year three. Once p99 crosses your retry timeout, every minor blip cascades.
What we shipped
Idempotency tokens on every state-changing API, scoped per merchant per operation, with a 24-hour window. The token plus a content hash decided whether the request was a replay. Replays returned the original response, byte-for-byte, without touching state.
Outbox pattern on settlement events. Writes to the settlement table and writes to the downstream-event topic happened in the same Postgres transaction via an outbox table. A separate worker shipped outbox rows to Kafka with at-least-once semantics. The consumers were idempotent against the event id.
Postgres tuning that actually mattered: aggressive autovacuum on settlement tables, BRIN indexes on the time-partitioned audit tables, and a hard rule against unbounded queries from any service path. We added a query review step to PRs that touched settlement.
The operational tax
Five-nines is not a code property. It is a deployment-and-on-call property. We banned breaking schema changes during business hours and used expand-and-contract migrations for settlement-touching changes: add the new shape, dual-write while still reading the old path, backfill and verify, then switch reads to the new path before removing the old one.
The 40% figure is a year-over-year count of settlement-related Sev-1 incidents, comparing the year before and after the operational changes. Most of the drop came from the retro discipline, not from the code changes. Patterns repeat. If you do not write the postmortem, you will hit the same incident next quarter.
What I would protect at all costs next time
The outbox. It is one of those patterns that looks like extra work until the day a downstream consumer is down for three hours and your settlement write went through but the event did not. With the outbox, you replay. Without it, you reconcile by hand.
The idempotency surface. Make idempotency a first-class contract in every write endpoint from day one, even before you think you need it. Bolting it on later requires migrations and client changes that nobody wants to do at 2am.
If you run settlement or any high-value write path and your reliability is being measured in nines, this is the kind of work I take on. Send a brief.
Read next
- An AI underwriting assistant adopted by a 120-person credit operation in 10 weeks
Not a model demo. A workflow tool the credit team actually opened every morning. Built in 10 weeks, took manual review off the top decile of cases, and saved roughly five minutes of handling time per accepted draft against the pre-launch six-minute baseline. Here is how it shipped without an LLM-replaces-humans pitch.
- Routing inference across LLM providers without breaking latency
An orchestration layer that picks the right provider per request. 28% lower provider/API spend against the prior single-provider baseline, normalised for request volume and token mix. p95 latency stayed sub-second. Caller code never changed.
