The Stateless Lie: How AWS Lambda Finally Learned to Remember

The Stateless Lie: How AWS Lambda Finally Learned to Remember

Remember making long-distance phone calls in the 1990s? You'd have to plan the conversation like it's a military operation, rehearse your key points, and watch the clock with existential dread. Every minute costed real money. Being disconnected midway meant that you'd have to call back, re-explain the context, and hope you remembered where you left off. The anxiety wasn't about the technology failing but the architectural constraint that made natural conversations feel like a series of frantic sprints against time.

I remember discussing this exact constraint with a Principal Engineer from AWS during one of my interviews. We were talking about Lambda's 15-minute limit, and they were very clear: "We're smart enough to remove that limit, but there has to be a good reason behind it." Then they paused and added, "I'm sure needs will evolve ...and so will Lambda." At the time, I nodded like I understood the depth of what they meant. (well I didn't.)

For well over a decade, AWS Lambda has lived in that same 1990s phone call universe. Every function had a hard limit of 15 minutes to say what it needed to say, and if your business logic took longer ...well, you'd split it. You built external orchestrators, wrote "Lambda calling Lambda" chains. We had the whole of AWS Step Functions come in. We learned to express the application logic in Amazon States Language. Because nothing says "developer productivity" like maintaining separate infrastructure-as-code files for what should've been a simple while loop. We termed this "stateless computing" and treated it like some sort of immutable law of physics rather. In reality it was an architectural guardrail we'd outgrown.

At AWS re:Invent 2025, that guardrail came down.


The 15-Minute Wall Wasn't Technical ...It Was Philosophical

When Lambda launched in 2014, the 15-minute execution limit was presented as a feature, not a bug. We all brought it. Short-lived, atomic, idempotent functions were the doctrine. The reasoning made sense at the time: prevent runaway processes all while recycling the compute resources efficiently, and force developers to think in stateless, event-driven terms.

This was the era of microservices ascendant, where decoupling and independence were the answers to every architectural question. I caved in all well. The term "Serverless" was in my LinkedIn profile for years.

But real-world business processes are inconveniently continuous. Rarely does order fulfillment doesn't fit in 15 minutes, or that CSV processing. Multi-stage approval workflows laugh at arbitrary time limits. Chaining multiple Large Language Model calls together where each model might take 2-3 minutes and fail intermittently due to rate limiting. You'd checkpoint progress in DynamoDB, overengineer SQS queues for continuation signals, and write enough error-handling logic to make a Haskell programmer nod approvingly. (Okay, I take that back.)

The cognitive overhead was real. We called it the "context switching tax". While your business logic may have lived in code, but the flow of that logic lived in Step Functions configuration. Want to add a simple retry with exponential backoff? That's not a try/catch block anymore; that's a state machine definition update. The code and the coordination were architecturally divorced, and developers paid alimony in YAML ...or JSON.

AWS Lambda Durable Functions changes the game from ground up. A single function can now run for up to one year!!! All of this while preserving state across interruptions. The "pause" and "resume" is handled directly in the runtime. This isn't just a longer timeout but instead it's a fundamentally different programming model.


How Do You Pause a Function for Six Months Without Burning Money?

AWS came through with an elegant solution here. The checkpoint and replay mechanism. Unlike a traditional Lambda that executes linearly and discards its memory state when done, a durable function represents a complete lifecycle that can survive multiple physical invocations.

Here's the trick: When your function hits a "durable operation", say, a context.wait(Duration.days(180)) then the Lambda doesn't keep a warm compute instance idling. Instead, it:

  1. Captures a checkpoint of the execution state (which variables hold what values, where in the code you are)
  2. Terminates the compute environment (you stop paying for idle time)
  3. Schedules a wake up event for 180 days later
  4. Re-invokes the function when the wait expires

Now here's where it gets super interesting. When the function wakes back up, Lambda doesn't magically restore memory. Instead it replays the entire function from the beginning!! The SDK intercepts calls to durable operations like step() or wait() and lies to them. Instead of re-executing, it returns the cached results from the checkpoint store. The code speeds through its own history like a student cramming before an exam, catching up to where it was suspended, then continues with fresh compute.

Let's see what this looks like in practice:

import { DurableClient } from "@aws/durable-execution-sdk";

export const handler = DurableClient.handle(async (context) => {
  // Gets replayed every time, so better be quick about it
  const orderId = context.input.orderId;

  // Step 1: Call payment service (the SDK remembers this if we crash)
  const payment = await context.step("process-payment", async () => {
    // Only THIS step retries on failure, not the whole saga
    return await chargeCustomer(orderId);
  });

  // Step 2: Now we hibernate. No compute running. Just... waiting.
  await context.waitForCallback("warehouse-check");

  // Step 3: We wake up here when the warehouse finally responds
  await context.step("ship-order", async () => {
    return await scheduleShipment(orderId);
  });

  return { status: "completed", orderId };
});

That waitForCallback() is doing something magical. It generates a unique callback token, hibernates the function, and waits potentially for hours or days. This wait ends when an external system invokes a specific API with that token. When the warehouse system finally confirms availability, Lambda wakes the function up, replays it past the first two operations, and continues. No polling. No SQS queue management. No separate state machine tracking which step you're on.


The Determinism Tax: Your Code Runs Multiple Times, So It Better Get The Same Answer

This replay architecture introduces a constraint that will make some developers uncomfortable: your code must be deterministic. The function is re-executed multiple times for the same logical workflow, any logic outside of a durable step() must produce exactly identical results on every replay.

Consider the failure mode: You call Math.random() outside a step. On the first execution it generates "abc123". The function completes the step and checkpoints. Later when it replays, Math.random() generates "xyz789". The SDK is now panicking trying to reconcile checkpointed history with different data. The execution fails with a "replay divergence" error, and you're debugging distributed systems behavior at 2 AM. (Your on-call rotation thanks you.)

The fix: move non-deterministic operations inside durable steps. Wrap Math.random(), Date.now(), external API calls or anything that might change in context.step(). The result is checkpointed once, then cached for all subsequent replays.

Deterministic Hazard Why It Breaks Solution
Math.random() Different value on every replay Wrap in context.step()
Date.now() Time keeps moving forward Use context.timestamp or wrap in a step
Global variables Might change between replays Pass state through function arguments
External API calls Network is a lie Always wrap in context.step()
Iterating over Map or Set Iteration order can vary by runtime Use arrays or ensure stable ordering

This determinism requirement is the new tax of admission. You're getting a function that can pause for months and in exchange you're writing code with re-entrancy discipline. For teams used to stateless Lambdas where every invocation is a fresh slate, this is a mindset shift.


The Economic Pivot: When Durable Functions Are Cheaper Than Step Functions

Let's talk money, because architecture decisions that ignore cost are just expensive blog posts.

AWS Step Functions bill on state transitions at $25 per million transitions. Lambda Durable Functions use a different model: $8 per million durable operations plus standard Lambda compute cost. While you're waiting, you pay nothing for compute.

For a concrete example: Processing 10,000 items in a loop costs ~$0.75 with Step Functions (3 state transitions per item) versus ~$0.10 with Durable Functions (compute + durable operations). For high-frequency loops with minimal CPU overhead, Durable Functions can be ~8x cheaper. Step Functions charges you for every state transition while Durable Functions charges you for the durable checkpoints and actual compute consumed.

(Of course, if your workflow is 3 steps with a 6-month wait between each, both cost almost nothing. At that "scale", you're optimizing for developer sanity and not pennies.)


The AI Catalyst: Why This Matters Now

Timing of Durable Functions isn't just coincidental. The primary driver for this "stateless-to-stateful" shift is Generative AI and agentic workflows. AI-powered applications fundamentally break the 15-minute model in ways that traditional business logic never did.

LLM Chaining and Resilience

Chaining multiple Large Language Model calls together often takes several minutes per call and is subject to intermittent API failures or rate limiting. In a stateless Lambda, a failure at the third step of a five-step chain means restarting the entire process. This means wasting both time and token costs (which for frontier models are not trivial).

With Durable Functions, we checkpoint after each model call:

async def my_billion_dollar_funded_ai_research_agent(context: DurableContext):
    # Step 1: Generate questions (2-3 min, $0.15 in tokens we'd rather not waste)
    questions = await context.step("generate-questions",
        lambda: call_claude_opus(prompt="Generate 5 research questions"))

    # Step 2: Fan out to search APIs (they WILL rate limit us)
    results = await context.parallel([
        context.step(f"search-{i}", lambda q=q: search_api(q))
        for i, q in enumerate(questions)
    ])

    # Step 3: Synthesize (another 3-5 min, another $0.20 we don't want to repeat)
    synthesis = await context.step("synthesize",
        lambda: call_claude_opus(prompt=f"Synthesize: {results}"))

    return synthesis

If the synthesis step fails due to rate limiting, only that step retries. We don't re-run the expensive question generation or the search API calls. The previous results are cached in the checkpoint. For AI workflows where individual LLM calls cost dollars in API fees, this resilience model is a direct cost saver.

Human-in-the-Loop AI

Agentic AI often requires a "human-in-the-loop" to verify a high-stakes decision or provide domain expertise. Historically, this required complex external state machines to pause the workflow, send a notification, and resume hours or days later when the human responded.

With Durable Functions, the AI agent simply waits:

export const handler = DurableClient.handle(async (context) => {
  const analysis = await context.step("analyze-contract", async () => {
    // AI analyzes a legal contract, takes 5 minutes
    return await llm.analyze(context.input.contractText);
  });

  if (analysis.confidence < 0.85) {
    // Send notification to lawyer, wait for their review
    const reviewUrl = context.getCallbackUrl("legal-review");
    await sendEmail(lawyer, `Review needed: ${reviewUrl}`);

    // Function hibernates here (could be hours or days)
    const humanReview = await context.waitForCallback("legal-review");

    // Resume with human feedback
    return await context.step("final-decision", async () => {
      return await llm.finalizeDecision(analysis, humanReview);
    });
  }

  return analysis;
});

The lawyer clicks the link, provides feedback via a web form and the function resumes exactly where it left off. No Step Functions state machine tracking "which branch are we in." No DynamoDB table storing intermediate state. Just - plain - simple - code - that - works.


The Coin Flip: When Durable Functions Are The Wrong Choice

This wouldn't be a balanced architectural analysis without acknowledging where Durable Functions absolutely shouldn't be used.

Please don't over-engineer them for simple infrastructure orchestration. If your workflow is "write to DynamoDB, then invoke Lambda, then send SNS notification," Step Function's direct AWS service integrations are cleaner. Step Functions can do this entirely in a state machine definition and are built for this - now more so than ever.

Don't use them when non-technical stakeholders need to debug failures. The graphical Step Functions execution graph is irreplaceable for on-call engineers or support teams who didn't write the code. Durable Functions require logging, reading logs and understanding code structure. (Good luck explaining CloudWatch Insights queries to your VP of Operations.) If your workflows span multiple business units and require "management-friendly" observability, Step Function with their visual language justifies the cost premium.

Don't use them if your team isn't ready for the determinism discipline. The replay mechanism is elegant but can be unforgiving. If your developers aren't comfortable with re-entrancy, checkpointing semantics, and the constraints of deterministic code, the onboarding curve is steep. Organizations should pilot Durable Functions with a small tasks before migrating critical workflows.

Finally - Don't use them for workflows that are already well-served by Step Functions. If you have mature, battle-tested state machines that work - then let it work! The migration cost (rewriting ASL in code, testing replay behavior, training teams) might not justify the benefits. Durable Functions shine for new workflows or workflows where Step Functions was always an awkward fit.


Well ...Serverless Has Grown Up

The "stateless serverless" paradigm was never about technical necessity. It was an architectural training wheel. For a decade, we've been forced to fragment naturally continuous logic across multiple services all while accepting the context-switching tax. AWS Lambda Durable Functions represent the maturity of serverless computing. The programming model is now closer to how we think about business processes: as continuous flows with pauses, retries, and branches, not as disconnected atomic steps.

The 15-minute was never a constraint - it was a philosophy. And philosophies change.

The path forward is tactical:

  1. Consolidate micro-orchestration. Migrate code heavy workflows from Step Functions to Durable Functions. If your state machine is 80% ASL and 20% Lambda then flip it.
  2. Embrace AI agents as first-class workloads. Durable execution is the substrate for long-running AI workflows. Checkpoint after expensive LLM calls and use waitForCallback() for human-in-the-loop scenarios.
  3. Mandate determinism audits in code review. Teams moving to Durable Functions need new habits. Are non-deterministic operations wrapped in steps? Is global state avoided? This intuition needs to be built.
  4. Keep Step Functions for macro-orchestration. Don't throw out the visual state machine baby with the ASL bathwater. Use Step Functions for high-level infrastructure coordination and workflows that require non-technical observability. Use Durable Functions for application logic.

The future of serverless is durable. Not because AWS declared it, but because we finally have the tools to write code the way we actually think - continuously, resiliently, and without watching the clock. Let's use them....

AI Disclosure: Rumor has it that this article was crafted by a real human named Mayank (mayankraj.com fame) — but who's to say? The artwork, in some cases, took quite a few virtual brushstroke from Bing Image Generator. Claude and Gemini kindly helped in hunting down typos and grammatical oopsies. But all the sentences (including this very one), the bad puns, quirky ideas, and alleged human charm? That's (probably) all Mayank... if he even exists!