The Serverless Tax: Why Your LLM Wrapper is Bleeding You Dry
You shipped the shiny new "AI Powered" feature. Your users are happy. ChatGPT is answering questions, Claude is summarizing documents, and your product feels like magic - you can finally adda a new subscription tier. Then the AWS bill arrives and it's triple what you expected. You check the Lambda metrics and nothing looks unusual. The functions are fast, the memory usage is reasonable. But the charges keep climbing, and you can't figure out why.
Now imagine this - You hired a plumber to fix your clogged sink. They show up, do their checks and diagnose the issue in 90 seconds. They then sit in your kitchen for the next 45 minutes waiting for a specialty part to arrive from the warehouse. This whole time, their $200/hour meter is running. You're paying them to make small talks, sip water and scroll. This is what happens when you wrap an LLM in a Lambda function. The function does maybe 5 milliseconds of actual work of parsing the request, calling the API, formatting the response. Then it sits there, idle, while GPT-12.9-mini-high thinks for 20 seconds and streams back tokens at a leisurely 100 per second. AWS doesn't care that your function is waiting. The meter is running.
This for you - is the Serverless Tax.
The GB-Second Reckoning
Let's talk about how Lambda actually bills you as it reveals the entire problem. AWS charges you for two things: the number of requests (a flat $0.20 per million) and the duration of those requests, measured in gigabyte-seconds.
A gigabyte-second (GB-s) is a product of allocated memory of the function (in gigabytes) and how long it runs (in seconds). A function with 1 GB of memory running for 10 seconds consumes 10 GB-s. Simple enough. The current rate is about $0.0000166667 per GB-second for x86 instances (Arm Graviton2 is about 20% cheaper, but let's keep the math simple).
Nowhere in here did we talk about utilization. That's left on you. So if you use 1% or 90% of CPU - you're charged the same. CPU power is not a separate dimension. When you allocate memory to a Lambda function, you signal AWS to proportionally allocates vCPU. Want a full vCPU? Allocate 1,769 MB of memory. Want more compute? Crank the memory higher. This design made perfect sense in the era of compute heavy workloads like image processing, data transformation, video transcoding. More memory meant more CPU, and you paid for what you used.
But LLMs broke the model.
The 53x Problem
Let me show you a typical Retrieval-Augmented Generation (RAG) request. Your user asks a question. Your Lambda function:
- Receives the HTTP request (2ms of CPU time)
- Queries your vector database for relevant context (50ms)
- Constructs the prompt with retrieved chunks (3ms)
- Calls the OpenAI API with
stream: true(0.5ms to fire the request) - Waits 4,800ms while GPT-4 generates 500 tokens and streams them back
- Forwards the stream to the user (negligible CPU, it's just proxying bytes)
Total actual compute: ~55 milliseconds. Total billed duration: 4,855 milliseconds. You just paid for 88 times more time than you actually "used" the CPU.
(And this assumes a fast model. If you're using a slower model for complex reasoning, or Claude in extended "thinking" mode, you might be waiting 30 seconds or more. The ratio gets worse.)
Here's the math for a single request with a 1 GB function:
- Billed duration: 4.855 seconds
- Memory allocation: 1 GB
- Cost: 4.855 × 0.0000166667 = $0.0000809 per request
That doesn't sound like much until you're handling 10 million requests per month. Then it's $809 in Lambda costs alone. The funny part is that the actual LLM tokens might cost you $300. Your wrapper function is more expensive than the AI.
When your infrastructure bill exceeds your API bill, the architecture is upside-down.
Response Streaming: A Beautiful Lie
AWS Lambda's streamifyResponse decorator was positioned to solve this problem. In theory instead of buffering the entire LLM response in memory (which could hit the 6 MB payload limit for traditional Lambda responses), you can stream the response back directly to the client. The user sees tokens appear in near real time.
Turns out that streaming doesn't reduce billed duration. The function is still well and alive, connected for the entire duration of the stream. If anything, streaming makes the problem worse, because now you're optimizing for user experience at the expense of infrastructure efficiency. The function can't terminate early. It has to sit there, forwarding bytes, until the last token arrives. The friction was removed which gave developers a cheat code to push the changes out. And we all know what happens with those "TODO: Replace with optimized...." comments.
Plus if your response exceeds 6 MB (maybe you're returning a long-form reasoning trace or a large document), you then start paying egress fees at the rate of $0.008 per GB beyond the first 6 MB. For most chat applications, this is irrelevant but for others dealing with images or files - it's a new cost.
(To be clear: streaming is the right architectural choice for UX in my opinion. But it doesn't fix the economic problem. It just makes it more palatable to the end user while making your CFO nervous.)
The Solutions - Good Old Web Sockets
Alright, enough doom. Let's talk about how to fix this. If your application is conversational (a chatbot, a coding assistant, a real-time agent), WebSockets flip the model entirely. Instead of the Lambda staying alive for the duration of the LLM call, you keep a persistent connection between the client and API Gateway. Lambda functions are invoked only when messages are sent or received.
The economics:
- WebSocket connection time: $0.25 per million minutes
- Lambda invocation: Only when there's actual work to do
A user might stay connected for 20 minutes but only send 5 messages. You're paying connection fees (pennies) and 5 Lambda invocations (also pennies). You're not paying for the 19 minutes of idle time between messages.
# Lambda is invoked ONLY on message receipt
def websocket_handler(event, context):
connection_id = event['requestContext']['connectionId']
user_message = json.loads(event['body'])['message']
# Call LLM (async, with callback to API Gateway)
llm_client.generate_streaming(
prompt=user_message,
callback=lambda token: send_to_websocket(connection_id, token)
)
return {"statusCode": 200} # Lambda terminates; connection stays open w/t API Gateway
The streaming tokens get pushed to the client via the WebSocket, but the Lambda isn't sitting there playing the relay. It fires off the request and initiates a friendship. The connection is maintained by API Gateway, which costs almost nothing.
The Async-Native Future
The fundamental mismatch is that serverless was designed for a compute-heavy workloads whereas LLMs are I/O-heavy. It's the wrong tool for the job and that shows up in observability gaps, pricing, scaling etc.
We'll need serious retooling for these use cases. We've not had a demand for such I/O heavy jobs, at the scale we're currently seeing. Cloud platforms will need to bill for active CPU cycles, not wall-clock time. We're seeing early experiments and implementations on this already..
The secret: Serverless works beautifully when you're computing. It bleeds you dry when you're waiting. LLMs force you to wait.
