Hitting five-nines on a payment settlement service

2023-09-04 / 3 min / payments / reliability / postgres / production

What 99.999% actually means at 20M transactions a month. The Postgres patterns, the idempotency surface, and the operational tax that nobody talks about until they have already missed an SLA.

What five-nines costs you

99.999% availability is roughly five minutes of downtime per year. At 20M transactions per month, every minute of full outage is roughly 460 transactions. Every minute of degraded write path can be much worse, because retries stack up and queues fill behind the failure.

The bar is unreachable by adding redundancy alone. You hit it by making the system safe to retry, safe to degrade, and safe to deploy on a Tuesday afternoon.

Where 99.99% breaks down at this volume

The biggest failure class was not raw outage. It was correctness under retry. Settlement is a write path with money attached. A naive retry on a 5xx can double-charge, double-credit, or settle the same batch twice if the upstream did not give you a clean error contract.

The second was tail latency on the database. Postgres p99 latency on the settlement tables crept up steadily as the dataset grew. A 100ms p99 at year one became 800ms at year three. Once p99 crosses your retry timeout, every minor blip cascades.

What we shipped

Idempotency tokens on every state-changing API, scoped per merchant per operation, with a 24-hour window. The token plus a content hash decided whether the request was a replay. Replays returned the original response, byte-for-byte, without touching state.

Outbox pattern on settlement events. Writes to the settlement table and writes to the downstream-event topic happened in the same Postgres transaction via an outbox table. A separate worker shipped outbox rows to Kafka with at-least-once semantics. The consumers were idempotent against the event id.

Postgres tuning that actually mattered: aggressive autovacuum on settlement tables, BRIN indexes on the time-partitioned audit tables, and a hard rule against unbounded queries from any service path. We added a query review step to PRs that touched settlement.

The operational tax

Five-nines is not a code property. It is a deployment-and-on-call property. We banned breaking schema changes during business hours and used expand-and-contract migrations for settlement-touching changes: add the new shape, dual-write while still reading the old path, backfill and verify, then switch reads to the new path before removing the old one.

The 40% figure is a year-over-year count of settlement-related Sev-1 incidents, comparing the year before and after the operational changes. Most of the drop came from the retro discipline, not from the code changes. Patterns repeat. If you do not write the postmortem, you will hit the same incident next quarter.

What I would protect at all costs next time

The outbox. It is one of those patterns that looks like extra work until the day a downstream consumer is down for three hours and your settlement write went through but the event did not. With the outbox, you replay. Without it, you reconcile by hand.

The idempotency surface. Make idempotency a first-class contract in every write endpoint from day one, even before you think you need it. Bolting it on later requires migrations and client changes that nobody wants to do at 2am.

If you run settlement or any high-value write path and your reliability is being measured in nines, this is the kind of work I take on. Send a brief.