Lambda to EC2 Migration #781

DavidLawes · 2022-10-07T13:42:01Z

DavidLawes
Oct 7, 2022

Summary

The mobile server side team are proposing to test whether we can improve the performance of our notification delivery system by migrating our services from lambda to EC2.

This RFC sets out the need and rationale for this experiment, the proposed end-state architecture and ideas about how we can evaluate our experiment with the least dev effort.

EDIT: we discussed this RFC with a number of stakeholders and some really interesting suggestions and insights were provided. The actions we're taking away are:

we will continue to talk with AWS in more detail about our infrastructure to ensure we have explored all options to optimise lambda
we brought forward some work to optimise the code running on the worker lambdas
we developed a roll out plan for the EC2 implementation that would allow us to incrementally increase the number of people in production that will be served notifications via the EC2 stack (see end of discussion for further information).

Need

During Q2 of 2022 the mobile server side team ran a number of experiments to improve our high-level SLO: deliver our notifications to 90% of the intended audience within 2 minutes.

A few of these experiments yielded a performance improvement, and our "90in2" percentage increased significantly for notifications being delivered to less than 1 million people. However, we struggled to reach our SLO for larger notifications, or notifications sent in parallel.

We can see how each system involved in the send of a large breaking news notification performs, for example notification 1a04d0da-c48c-43f2-a8c3-15b81bc74342 which was sent to 1.88M people on the 29th September:

System	Type	Total Processing Time (s)	Lambda Invocations
Notification	EC2	0.3	N/A
Harvester	Lambda	36	330
Workers	Lambda	181	3590

For this specific example the total duration (time from when the notification app received the request to when the last worker lambda finished sending to apns/firebase) was 183s, delivering a 90in2 percentage of 60.97.

During Q2 we retained some successful tests that improved the performance and concurrency of our lambdas: increasing lambda memory and increasing the thread pool size. We attempted to increase performance further by minimising lambda duration as well as defining provisioned concurrency, but neither of these options had a significant positive improvement on performance.

AWS controls the scale-in of our lambdas, but the maximum they scale-in lambdas is 60/min. Often, we did not see this maximum scale-in rate. Additionally, when thinking about our current architecture, we would ideally like as many as 150-200 sender lambdas immediately available and processing large breaking news notifications. From cloudwatch metrics we saw that the number of concurrent executions, e.g. for our worker lambdas, never exceeded ~75, even after defining provisioned concurrency of 150.

To increase the concurrency of our functions we would like to test using EC2 instead of lambda. The benefits we anticipate:

1 out of 10 worker lambda invocations include a cold start of 4.5s. After EC2 instances have started, we will experience no equivalent cold-start delays.
We can control the CPU and memory assigned to our EC2 instances, allowing us to control the concurrency within each individual instance.
We can increase the size of the ASG to help us increase concurrency.

Approach

Current architecture:

Proposed end-state architecture:

Before attempting to implement the end-state architecture we'd like to run an experiment to validate our hypothesis that migrating to EC2 will allow us to meet our 90in2 objective for larger, and parallel, notification sends.

How can we evaluate our experiment with the least dev effort?

To begin with migrate either the harvester or the worker lambdas, not both. It would be easier to test the impact on the harvester (our test setup in CODE more closely matches that of PROD compared to the workers) but we may not be able to evaluate the effect of the migration for our significant bottleneck, which are the workers.
Carry out our testing exclusively in the CODE environment, considering the code as a spike (to be thrown-away later) instead of production-ready.
Ensure our CODE environment replicates what we see in production. We have the test rig database so we can create dryRun notification send requests to a large audience. However, we may want to explore how to configure our workers to match production behaviour when sending dryRun requests (at the moment no http requests are made, so timings and concurrency levels are significantly smaller in CODE).

DavidLawes · 2022-10-11T16:01:25Z

DavidLawes
Oct 11, 2022
Author

Notes I picked up from Alina:

we could use this site to estimate costs of ec2 instances: https://ec2instances.github.io/
have we considered a roll-out plan? e.g. running lambda and ec2 in parallel
if we consider using t4 instances, be aware that we can run out of cpu credits

1 reply

DavidLawes Oct 13, 2022
Author

We have considered the roll out plan (see details in a different comment below), in summary:

we will duplicate the worker stacks, having both lambda and ec2 running at the same time
we will divert traffic to ec2 based on the topic, this will allow us to incrementally increase traffic flowing via ec2, easily roll-back (topic whitelist stored in config) and track like-for-like performance via ec2 compared to lambda

DavidLawes · 2022-10-12T10:33:21Z

DavidLawes
Oct 12, 2022
Author

Some questions from @jamesgorrie:

Have we spoken to our Amazon person about this from 2 perspectives:

Have we exhausted the lambda scaling issue, is it that they just don't scale for our needs?
Is the proposed architecture a sensible one?

I feels like we've explored those question thoroughly, but AWS account people often have secrets up their sleeves.

Another Q would be similar to what I've asked before, what is the metrics we hope for this to affect e.g. pageviews etc, or is notification time our bottom line on this one?

3 replies

frankie297 Oct 12, 2022
Collaborator

We have our AWS hour at work, do we think it's beneficial to attend next time?

waisingyiu Oct 12, 2022
Maintainer

I am not sure if pageview would be a good metric. It has too many factors apart from our notification performance.

DavidLawes Oct 13, 2022
Author

We have added an agenda item to the next aws office hours to discuss our architecture, its performance and the initiatives currently explored.

We have been looking at how we could measure the user benefit of increased notification delivery speed, at the moment it appears we may not be able to make strong correlations between app referral page views and notification delivery speed. If we can't draw conclusions from page view data we discussed whether we could involve ux to facilitate more subjective benefits, e.g. by carrying out user interviews to understand if delivery speed impacts credibility.

DavidLawes · 2022-10-12T10:34:24Z

DavidLawes
Oct 12, 2022
Author

From @georgeblahblah and @rtyley:

Have we investigated improvements we could make to the lambda code?
If there are improvements to the lambda code we could make, should these be done before the EC2 migration?

1 reply

DavidLawes Oct 12, 2022
Author

We have adjusted our plan for the quarter and will test the impact of batching our api calls at the beginning of the quarter (the development will take place in parallel with the dev work for the lambda->ec2 work)

DavidLawes · 2022-10-12T15:57:36Z

DavidLawes
Oct 12, 2022
Author

We had a think about some options for roll out:

cc @frankie297 @waisingyiu @groakland

2 replies

DavidLawes Oct 13, 2022
Author

What about a dlq retry? e.g. if message fails to be processed by ec2s, go straight to lambda sqs

DavidLawes Oct 13, 2022
Author

We also discussed:

long polling
getting batches of messages from the queue
set an expiry (if messages sit on ec2 queue for a long time, send to dlq, dlq sends to lambda queue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambda to EC2 Migration #781

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lambda to EC2 Migration #781

DavidLawes Oct 7, 2022

Summary

Need

Approach

How can we evaluate our experiment with the least dev effort?

Replies: 4 comments · 7 replies

DavidLawes Oct 11, 2022 Author

DavidLawes Oct 13, 2022 Author

DavidLawes Oct 12, 2022 Author

frankie297 Oct 12, 2022 Collaborator

waisingyiu Oct 12, 2022 Maintainer

DavidLawes Oct 13, 2022 Author

DavidLawes Oct 12, 2022 Author

DavidLawes Oct 12, 2022 Author

DavidLawes Oct 12, 2022 Author

DavidLawes Oct 13, 2022 Author

DavidLawes Oct 13, 2022 Author

DavidLawes
Oct 7, 2022

Replies: 4 comments 7 replies

DavidLawes
Oct 11, 2022
Author

DavidLawes Oct 13, 2022
Author

DavidLawes
Oct 12, 2022
Author

frankie297 Oct 12, 2022
Collaborator

waisingyiu Oct 12, 2022
Maintainer

DavidLawes Oct 13, 2022
Author

DavidLawes
Oct 12, 2022
Author

DavidLawes Oct 12, 2022
Author

DavidLawes
Oct 12, 2022
Author

DavidLawes Oct 13, 2022
Author

DavidLawes Oct 13, 2022
Author