Whether you are new or an expert in the Serverless world, AWS Lambda retry mechanism can cause a headache.
A distributed system usually has different nodes triggered by asynchronous actions. Each node must be designed as a single black box unit. When you design a distributed system, you have to consider a fallback system with a robust error-handling mechanism which may include automatic retries.
In this post, we’ll analyze the AWS Lambda Retry Policy, and the different techniques and best practices to handle errors.
You can also follow the explanation on Youtube:
AWS Lambda Retry policy
- Synchronous events (such as API Gateway): will not trigger any auto-retry policy. It’s the application’s responsibility to implement the fallback system.
- Async Events (such as SQS and SNS): will trigger two retries (by default). If all retries have failed, it’s important to save the event somewhere for later processing. For example, in a Dead Letter Queue (DLQ).
- Stream Based Events (such as Dynamo DB Streams): will retry the event until the data expires after a specified amount of time or is processed successfully.
Is it even worth retrying?
It does not make sense to retry for all request types. In some cases, retrying is only a waste of time and money. In some cases, if a request fails, it has no chance of succeeding in subsequent attempts. So, how do you stop your Lambda retrying?
The quick and dirty approach is to set the Maximum Retry Attempt value to 0. This feature is a recent add on the AWS Lambda platform (Nov 2019)
A more elegant way is to implement a Global Error Handler in your function.
The idea is that, since all possible exceptions are handled successfully, the Lambda will respond with a valid JSON object and will not retry any request.
Yes, it’s worth retrying
To get the most out of the retry logic, we have to understand the idempotency concept:
An action which, when performed multiple times, has no further effect on its subject after the first time it is performed.
Gotcha! Hold on, how can I understand if the current execution is a retry or new request?
Each Lamba has a unique request ID. Only when there is a Lambda retry you will get the same ID. In NodeJS, the value is in the context.awsRequestId property. There is a downside though. You need to store this data somewhere. The natural solution is using DynamoDB. For every new request, the Lambda function whether to add a new record on the DB or not.
Dead Letter Queue (DLQ)
The dead letter queue lets you redirect failed events to an SQS queue or SNS topic. From there, you can decide to add another Lambda function that will process the failed events and send them to a notification system. For example, send a message on a Slack channel.
You can configure the DLQ directly on the Lambda interface or using CloudFormation.
Step Functions
Step Functions is an orchestration service that allows you to model workflows as state machines. One can argue that this solution is cumbersome and too verbose but it comes with a series of benefits.
Firstly, Step Functions help you to build a microservice-oriented architecture. One of the first things I have done when I started developing with AWS Lambda was to execute several actions inside a single Lambda. Following this approach, the result was that I had created a perfect example of Serverless Monolith 😓
On the other hand, following the state-machine approach, it comes more naturally to run each operation in a different state (which are, indeed, different Lambdas).
Using Step Functions, the developer can decide the transition between states and retry behaviour (number of retries and delay duration). Each task can have its timeout value (unlimited). If the task is not completed in time, a StateTimeouterror is generated. Make sure to configure the Task timeout to be equal to the Lambda’s timeout.
Currently, Step Functions can only be triggered by a limited number of events (ApiGateway or from the SDK). The most common approach is to create a Lambda proxy function that acts as a trigger. For example, if you want to trigger your Step function using SQS, your proxy Lambda will be triggered by the SQS queue. Then, the proxy Lambda has to parse the SQS message and make the appropriate call to the Step Functions StartExecution API.
Conclusion
To be honest, I think that error handling in AWS Lambda can be confusing and not clear at first glance. Personally, I prefer to use the DLQ method for easy tasks and small projects.
In more complex scenarios, when I need granular control in the entire workflow and more control over retry behaviour, I go with Step Functions. This introduces additional cost for state transitions, but it gives you more flexibility in return (control number of retries and timeout).
There are also other techniques, such as using a middleware (Middy for example), that can help to handle errors.
Enjoy Serverless 🚀