by Yan Cui

How to choose the best event source for pub/sub messaging with AWS Lambda

AWS offers a wealth of options for implementing messaging patterns such as Publish/Subscribe (often shortened to pub/sub) with AWS Lambda. In this article, we’ll compare and contrast some of these options.

The pub/sub pattern

Pub/Sub is a messaging pattern where publishers and subscribers are decoupled through an intermediary message broker (ZeroMQ, RabbitMQ, SNS, and so on).

aMYeJtlvg8qD1aw7GWlpLcrptCADpvXiGRlt — Source: Publish Subscribe Pattern (Wikipedia)

In the AWS ecosystem, the obvious candidate for the broker role is Simple Notification Service (SNS).

SNS will make three attempts for your Lambda function to process a message before sending it to a Dead Letter Queue (DLQ) if a DLQ is specified for the function. However, according to an analysis by the folks at OpsGenie, the number of retries can be as many as six.

Another thing to consider is the degree of parallelism this setup offers. For each message, SNS will create a new invocation of your function. So if you publish 100 messages to SNS, then you can have 100 concurrent executions of the subscribed Lambda function.

This is great if you’re optimising for throughput.

However, we’re often constrained by the max throughput our downstream dependencies can handle — databases, S3, internal/external services, and so on.

If the burst in throughput is short, then there’s a good chance the retries would be sufficient (there’s a randomized, exponential back off between retries too) and you won’t miss any messages.

aUrhbLV0y5hatCjtR-kpZuvzi6eQZBmM9Gtz — Erred messages are retried 2 times with exponential back off. If the burst is short-lived then the retry is likely to succeed, resulting in no message loss.

If the burst in throughput is sustained over a long period of time, then you can exhaust the max number of retries. At this point you’ll have to rely on the DLQ and possibly human intervention in order to recover the messages that couldn’t be processed the first time around.

2uzNYBIHpLiAdHSsZBRl9aLIXXbKM5mQ5LBG — Erred messages are retried 2 times with exponential back off. But the burst in message rate overlaps with the retries, further exacerbating the problem and eventually the max number of retries are exhausted and erred messages have to be delivered to the DLQ instead (if one is specified).

Similarly, if the downstream dependency experiences an outage, then all messages received and retried during the outage are bound to fail.

PFbSpQtb0Nxx2RREF9VATvUBjJxR5lxu0WD5 — Any message received or retried during the downstream message will fail and be sent to the DLQ.

You can also run into the Lambda limit on the number of concurrent executions in a region. Since this is an account-wide limit, it will also impact your other systems within the account that rely on AWS Lambda: APIs, event processing, cron jobs, and so on.

SNS is also prone to suffer from temporal issues, like bursts in traffic, downstream outage, and so on. Kinesis, on the other hand, deals with these issues much better as described below:

The degree of parallelism is constrained by the number of shards, which can be used to amortize bursts in the message rate

MvIaF-A3FjeFNn5GSzSRzJ3vD-fcdF5kZv9q — Bursts in message rate is amortised, as the max throughput is determined by no. of shards * max batch size * 5 reads per second. Which gives you two levers to adjust the max throughput with.

Records are retried until success is achieved, unless the outage lasts longer than the retention policy you have on the stream (the default is 24 hours). You will eventually be able to process the records

v8c03ATaEUayJurzgbxbPSWtqG-rebS7v40U — The impact of a downstream outage is absorbed by the retry-until-success invocation policy.

But Kinesis Streams is not without its own problems. In fact, from my experience using Kinesis Streams with Lambda, I have found a number of caveats that need to be understood in order to make effective use of the service.

You can read about these caveats in another article I wrote here.

Interestingly, Kinesis Streams is not the only streaming option available on AWS. There is also DynamoDB Streams.

ZVV6125FoRnsc-WWr5xDeiQdYMgdJgbxZFbn — DynamoDB Streams can be used as a like-for-like replacement for Kinesis Streams.

By and large, DynamoDB Streams + Lambda works the same way as Kinesis Streams + Lambda. Operationally, it does have some interesting twists:

DynamoDB Streams auto scales the number of shards
If you’re processing DynamoDB Streams with AWS Lambda, then you don’t pay for the reads from DynamoDB Streams (but you still pay for the read and write capacity units for the DynamoDB table itself)

elMIs2s1bvzsS3t1s6lck8kKjUQtoN16p2d6 — Source: DynamoDB Pricing

Kinesis Streams offers the option to extend data retention to 7 days, but DynamoDB Streams doesn’t offer such an option

ZfHR49h8Odv-6oORBUdbEoulWBpkeTobaFSV — Source: Working with DynamoDB Streams

The fact that DynamoDB Streams auto scales the number of shards can be a double-edged sword. On one hand, it eliminates the need for you to manage and scale the stream (or come up with home-baked auto scaling solutions). But on the other hand, it can also diminish the ability to amortize spikes in the load you pass on to downstream systems.

As far as I know, there is no way to limit the number of shards a DynamoDB stream can scale up to, which is something you’d surely consider when implementing your own auto scaling solution.

I think the most pertinent question is, “what is your source of truth?”

Does a row being written in DynamoDB make it canon to the state of your system? This is certainly the case in most N-tier systems that are built around a database, regardless of whether it’s an RDBMS or NoSQL database.

In an event-sourced system where state is modeled as a sequence of events (as opposed to a snapshot), the source of truth might well be the Kinesis stream. For example, as soon as an event is written to the stream, it’s considered canon to the state of the system.

Then, there’re other considerations around cost, auto-scaling, and so on.

From a development point of view, DynamoDB streams also have some limitations and shortcomings:

Each stream is limited to events from one table
The records describe DynamoDB events and not events from your domain, which I’ve always felt creates a sense of dissonance when working with these events

Excluding the cost of Lambda invocations for processing the messages, here are some cost projections for using SNS vs Kinesis Streams vs DynamoDB Streams as the broker. I’m making the assumption that throughput is consistent, and that each message is 1KB in size.

Monthly cost at 1 msg/s

Monthly cost at 1,000 msg/s

These projections should not be taken at face value. For starters, the assumption about a perfectly consistent throughput and message size is unrealistic, and you’ll need some headroom with Kinesis and DynamoDB streams even if you’re not hitting the throttling limits.

That said, what these projections do tell me is that:

You get an awful lot with each shard in Kinesis streams
While there’s a baseline cost for using Kinesis streams, the cost goes down when usage scales up as compared to SNS and DynamoDB streams, thanks to the significantly lower cost per million requests

Whilst SNS, Kinesis, and DynamoDB streams are your basic choices for the broker, Lambda functions can also act as brokers in their own right and propagate events to other services.

This is the approach used by the aws-lambda-fanout project from awslabs. It allows you to propagate events from Kinesis and DynamoDB streams to other services that cannot directly subscribe to the three basic choice of brokers (either because of account/region limitations or that they’re just not supported).

arPDE4lmFHJ-GPKb3u-366nUFaDeaZxiyZp8 — The aws-lambda-fanout project from awslabs propagates events from Kinesis and DynamoDB Streams to other services across multiple accounts and regions.

While it’s a nice idea and definitely meets some specific needs, it’s worth bearing in mind the extra complexities it introduces, such as handling partial failures, dealing with downstream outages, misconfigurations, and so on.

Conclusion

So what is the best event source for doing pub-sub with AWS Lambda? Like most tech decisions, it depends on the problem you’re trying to solve, and the constraints you’re working with.

In this post, we looked at SNS, Kinesis Streams, and DynamoDB Streams as candidates for the broker role. We walked through a number of scenarios to see how the choice of event source affects scalability, parallelism, and resilience against temporal issues and cost.

You should now have a much better understanding of the tradeoffs between various event sources when working with Lambda.

Until next time!