Lambda to EC2 Migration #781
Replies: 4 comments 7 replies
-
Notes I picked up from Alina:
|
Beta Was this translation helpful? Give feedback.
-
Some questions from @jamesgorrie: Have we spoken to our Amazon person about this from 2 perspectives:
I feels like we've explored those question thoroughly, but AWS account people often have secrets up their sleeves. Another Q would be similar to what I've asked before, what is the metrics we hope for this to affect e.g. pageviews etc, or is notification time our bottom line on this one? |
Beta Was this translation helpful? Give feedback.
-
From @georgeblahblah and @rtyley:
|
Beta Was this translation helpful? Give feedback.
-
We had a think about some options for roll out: |
Beta Was this translation helpful? Give feedback.
-
Summary
The mobile server side team are proposing to test whether we can improve the performance of our notification delivery system by migrating our services from lambda to EC2.
This RFC sets out the need and rationale for this experiment, the proposed end-state architecture and ideas about how we can evaluate our experiment with the least dev effort.
EDIT: we discussed this RFC with a number of stakeholders and some really interesting suggestions and insights were provided. The actions we're taking away are:
Need
During Q2 of 2022 the mobile server side team ran a number of experiments to improve our high-level SLO: deliver our notifications to 90% of the intended audience within 2 minutes.
A few of these experiments yielded a performance improvement, and our "90in2" percentage increased significantly for notifications being delivered to less than 1 million people. However, we struggled to reach our SLO for larger notifications, or notifications sent in parallel.
We can see how each system involved in the send of a large breaking news notification performs, for example notification
1a04d0da-c48c-43f2-a8c3-15b81bc74342
which was sent to 1.88M people on the 29th September:For this specific example the total duration (time from when the notification app received the request to when the last worker lambda finished sending to apns/firebase) was 183s, delivering a 90in2 percentage of 60.97.
During Q2 we retained some successful tests that improved the performance and concurrency of our lambdas: increasing lambda memory and increasing the thread pool size. We attempted to increase performance further by minimising lambda duration as well as defining provisioned concurrency, but neither of these options had a significant positive improvement on performance.
AWS controls the scale-in of our lambdas, but the maximum they scale-in lambdas is 60/min. Often, we did not see this maximum scale-in rate. Additionally, when thinking about our current architecture, we would ideally like as many as 150-200 sender lambdas immediately available and processing large breaking news notifications. From cloudwatch metrics we saw that the number of concurrent executions, e.g. for our worker lambdas, never exceeded ~75, even after defining provisioned concurrency of 150.
To increase the concurrency of our functions we would like to test using EC2 instead of lambda. The benefits we anticipate:
Approach
Current architecture:
Proposed end-state architecture:
Before attempting to implement the end-state architecture we'd like to run an experiment to validate our hypothesis that migrating to EC2 will allow us to meet our 90in2 objective for larger, and parallel, notification sends.
How can we evaluate our experiment with the least dev effort?
Beta Was this translation helpful? Give feedback.
All reactions