Idempotency Keys: Why Retrying Without Them Is Financial Suicide
You're standing at an ATM at 11:47 PM and you press "Withdraw $100." The screen freezes where the loading spinner spins... and spins... and spins. A minute later, the machine is silent. Did the transaction go through? Do you press the button again? Will you ever see your $100!? This is not a hypothetical. This is the Two Generals Problem, dressed in a bank lobby. And if you're building a payment API without understanding this then you're not building a financial system but you're building a slot machine where every retry pulls the lever on your customers' accounts.
Let me show you why idempotency keys are the only thing standing between "reliable payments" and "oops... we accidentally charged you three times."
The Network Is Your Enemy (And It Always Lies)
Here's the problem: when you make an HTTP request to charge a credit card, one of the three things can happen.
- The request never reaches the server. Network jitter, DNS failure or a router hiccupped in Frankfurt. The payment never happened.
- The server processes the request, but it's thumps-up aka 200 back never reaches you. The money changed db-entries, the charge did go through. But your client thinks it failed.
- The server crashes mid-processing. The debit happened but the credit didn't. (Or vice versa, pick your nightmare.)
From your client's perspective, all three scenarios end up the same i.e. with timeout. The loading spinner never gets a signal to stop and you're left with a choice to either retry the request and risk a double-charge, or abandon it and risk the money vanishing into the ether.
This is the Two Generals Problem, a classic distributed computing paradox. Two generals need to coordinate an attack between them but they can only communicate via messengers who might get captured. General A sends a messenger: "Attack at noon." Did it reach? General B sends back: "Acknowledged." Did that arrive? No amount of back-and-forth acknowledgments can provide 100% certainty. We see this everywhere from HTTP protocol to DB commits. In our payment system, the client and server are the Generals. The network doesn't owe them reliability of any sort.
At-Least-Once Delivery
Modern systems don't just give up when a request times out. They retry, and if built rightly with some sort of backoff. Everything from Apache Kafka, the HTTP client library you use and even your mobile app retries. This is called at-least-once delivery, and it's the industry standard for one simple reason: losing a message is worse than duplicating it. But in some cases, especially in financial operations, duplication is not better than loss. It's categorically worse.
Let's say a customer clicks "Pay $100" on your checkout page. The request hits your server, charges their card, and updates the ledger, but before the server can respond with "200: Success," a network glitch eats up the acknowledgment. The client thinks the request failed. So it retries, maybe multiple times. Without idempotency, your backend sees three separate requests and rightfully so it charges the card three times. The customer's bank account: $300 lighter and all of a sudden the discount coupon makes no sense.
The Holy Grail of Exactly-Once Processing
You might be thinking: "Can't we just build exactly once delivery and avoid this whole mess?" No. It's mathematically impossible. The Two Generals Problem proves that you cannot guarantee delivery over an unreliable network. What is achievable, and what you must build is exactly-once processing. Here's the distinction:
- Exactly-once delivery would mean the message arrives at the recipient exactly one time. In theory the network guarantees this, in reality it can't.
- Exactly-once processing means the side effects of a message occur exactly once, even if the message itself is delivered multiple times.
This is the entire game. You accept the fact that the network will deliver duplicates at-least-once, but you make your server smart enough to recognize: "Looks like I've already processed this request. Here's the result I cached last time."
That intelligence is called idempotency. And the key to unlocking it is... well.. the idempotency key.
The Idempotency Key: A Deterministic Anchor in a Chaotic Sea
An idempotency key is a unique identifier the client generates before making a request. It's typically a UUID v4, attached as an HTTP header:
POST /api/payments
Idempotency-Key: 7f3e5c1a-9b2d-4f6e-8ae4-3f4e5f6a7b8c
Content-Type: application/json
{
"amount": 10000,
"currency": "USD",
"user_id": "cust_12345"
}
When the server receives this request, it follows a specific protocol:
- Check for existence. Query the idempotency store (database or cache): "Have I seen this key before?"
- If yes, return the cached result. We don't re-execute the charge or touch the database.
- If no, process the request. Execute the business logic, save the response in the idempotency store, and then return it to the client.
Now, when the client retries because it didn't receive the response (scenario #2 from earlier), the server recognizes the key. It skips the charge logic entirely. It returns the cached success response. The customer is charged once. The ledger is consistent. The idempotency key converts an unreliable at-least-once delivery into deterministic exactly-once processing.
Why the Client Generates the Key (Not the Server)
You might wonder: "Why doesn't the server just generate the key and return it to the client?"
Because that would reintroduce the problem at the key-exchange layer. If the server generates the key and sends it to the client, and that response gets lost, the client is blind to the key. It's as if it never existed. We've moved the problem around and not solved it. By having the client generate the key before the first attempt, the key becomes a deterministic anchor that survives network failures, app crashes, and even client-side restarts. The client can persist the key to local storage or a cookie. So if the app or webpage crashes mid-request, the next session can reload the key and retry safely.
Stripe does this. Airbnb does this. Every competent payment API does this. (And if your API doesn't, please stop reading and go implement it. I'll wait.)
Request Fingerprinting: Preventing Key Reuse Exploits
Here's a nasty edge case: what if a malicious (or buggy, but mostly malicious) client reuses an idempotency key from a $10 payment for a $1,000 payment? If your server just blindly returns the cached $10 success response for the $1,000 request, you just handed a "$990 off" coupon, and without any coupon code. The fix is to store a cryptographic hash of the request payload alongside the idempotency key.
import hashlib
import json
def compute_request_hash(method, uri, body):
# Combine everything that makes this request unique
payload = f"{method}:{uri}:{json.dumps(body, sort_keys=True)}"
return hashlib.sha256(payload.encode()).hexdigest() # Deterministic fingerprint
When a request with a known idempotency key arrives, compare the stored hash to the incoming request's hash:
- If they match: Return the cached response.
- If they don't match: Return
HTTP 409 Conflictand force the client to use a new key.
This ensures that an idempotency key is uniquely and irrevocably bound to a specific set of parameters. You can retry the same charge safely but you cannot reuse the key for a different charge. This also means anything that can change with time, say current_timestamp cannot be part of this hash.
HTTP Status Codes: When to Cache, When to Retry
Not all errors should be cached under an idempotency key. Here's a common policy:
| Status Code | Should Cache? | Client Action |
|---|---|---|
| 2xx Success | ✅ Yes | Done. Transaction succeeded. |
| 400 Bad Request | ✅ Yes | Invalid parameters. Fix data, use new key. |
| 409 Conflict | ❌ No | Idempotency key reused incorrectly. Use new key. |
| 429 Rate Limit | ❌ No | Retry with same key after backoff. |
| 5xx Server Error | Usually ❌ No | Transient failure. Retry with same key unless your API explicitly persists failures. |
Why cache 400s? Because if the client sent invalid data and retrying with the same key won't magically fix it. Caching the error prevents pointless retries and clarifies: "This request is fundamentally broken." So why not cache 5xx by default? Because the server might have crashed mid-transaction or a microservice was unavailable. A retry might succeed once the database recovers. That said, some APIs intentionally persist the first 5xx result to preserve strict idempotency semantics, so this part is a policy choice, not a law of physics.
Key Expiration
Storing idempotency keys forever is impractical and by design not required in the majority of cases. Your database would grow unbounded, and query performance would degrade. Most systems implement a time-to-live (TTL) for such keys. Say the key expires after 24 hours. This window is sufficient for any reasonable retry scenario. If a client retries after 24 hours then the key is gone, and the request is treated as new. Which is probably fine and if your retry loop is still running a day later, you have bigger problems to solve.
For high-stakes operations (like inter-bank transfers), you might extend the TTL to 7 days or store keys indefinitely in cold storage. The trade-off is between storage cost and the risk of re-executing a very old request.
Case Study: Airbnb's Orpheus Framework
Airbnb's transition to microservices required a centralized idempotency solution. They built Orpheus, a library that models every API request in three phases:
- Pre-RPC Phase: Record the intent to perform an action. This happens in a database transaction alongside local state changes.
- RPC Phase: Make the external call (e.g., to Stripe or Braintree). This is inherently non-atomic because the external service is a separate system.
- Post-RPC Phase: Record the result of the external call. Update the idempotency record with the final response.
If a crash occurs during the Pre-RPC phase then the transaction rolls back and no side effects occurred. The client can retry safely. If a crash occurs during or after the RPC phase, the idempotency key allows recovery. On retry, the library checks the database and if it finds a recorded RPC result, it skips the external call and proceeds directly to the Post-RPC phase. This three-phase model ensures that guests are never double charged and hosts are always paid correctly, even when the network is actively conspiring against such holidays.
The Secret: Idempotency Is a Pinky Promise, Not a Feature
Idempotency keys aren't just a technical safeguard but they're a promise to your users that you understand the network is unreliable, and you've designed your system to be resilient anyway. When a customer clicks "Pay," they're trusting you with their money. They're trusting that your system is deterministic, even when the network is chaotic. They're trusting that you've done your homework. That trust is the foundation of every financial platform - Stripe, PayPal, Airbnb, Shopify, Amazon... you name it. These all use idempotency keys because they understand that in a distributed system, retrying without idempotency is not a technical risk but it's a betrayal of user trust.
That's not just good engineering. That's the only engineering that matters.
