[WIP] Scaling on AWS #117

suhlrich · 2023-10-11T05:39:30Z

This closes #113 .

TODO AWS CLI

Get machine ID
send shutdown message

turn off backend machine after 5 mins of no trial and there is nothing recording

utilsAPI.py

comments + addressing comment alex

antoinefalisse · 2024-05-13T18:37:08Z

@sashasimkin and @suhlrich could you take a look at this? I added the boto3 scripts and adjusted a couple of things.

I am not sure this is what we want here. I understand that if the number of trials is under a certain threshold (like the number of on-prem workers times a number of trials per worker) then we want to remove the scale-in protection. I think this is what this will do, is that what you had in mind?
I don't think this will ever be triggered, except if we don't use on-prem workers (and we want this to work). But the idea here is to remove the scale-in protection if the instance hasn't processed a trial for more than x minutes and if there aren't trials being recorded.

suhlrich

looks good!

antoinefalisse · 2024-05-13T20:14:48Z

app.py

+                t_lastTrial = time.localtime()
+
+        if autoScalingInstance and justProcessed:
+            justProcessed = False


Hi @sashasimkin as I mentioned in the comment, I don't think that part should ever be triggered when we have on-prem workers since there should be no case when that status is 404. However, I would like that part to work, and us to test it, when there are no no-prem workers. The intended behavior is NOT to process only 1 trial by every single worker machine. The intention is to remove the scale-in protection after a certain time minutesBeforeRemoveScaleInProtection and if there aren't trials being recorded. If there is a queue, then the justProcessed variable gets set again and again to True and the scale-in protection is not removed. Does that make sense?

I don't think I understand why we would need to move this there:

only to move if r.status_code == 404: block before if with_on_prem:

can you explain what would be the reason for r.status_code == 404? I assumed that this is returned when there's no work to do.

From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?

re. "The intended behavior is NOT to process only 1 trial by every single worker machine." - my bad, I didn't read the code fully, sorry.

can you explain what would be the reason for r.status_code == 404? I assumed that this is returned when there's no work to do.

Yes, exactly. If there are no job, then it returns 404 and start over until there is a job to process

From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?

I'm not sure I follow this. Are you suggesting to move this somewhere else?

My suggestion is to not have it at all as well as not have "if with_on_prem:" condition for unprotecting to simplify the code.

But I'm now not sure that I understand the whole logic, so I'll resolve this comment now and will try to explain the point on the call today.

utilsAPI.py

suhlrich added 2 commits October 10, 2023 16:31

first step of shutting down thru app.py

df96d13

complete app.py AWS shutdown

af00a87

sashasimkin mentioned this pull request May 3, 2024

Backend scaling -- stop instance #113

Open

5 tasks

sashasimkin reviewed May 3, 2024

View reviewed changes

utilsAPI.py Outdated Show resolved Hide resolved

antoinefalisse added 2 commits May 7, 2024 12:00

comments + addressing comment alex

e9f70e9

Merge pull request #156 from stanfordnmbl/scaling_pr

1ff678e

comments + addressing comment alex

antoinefalisse changed the title ~~[WIP] Shut down AWS machine from app.py~~ [WIP] Scaling on AWS May 13, 2024

antoinefalisse changed the base branch from main to dev May 13, 2024 17:03

antoinefalisse added 4 commits May 13, 2024 10:05

fix conflicts

150312e

3600 instead of 360

15cec94

scaling logic

4fdf1da

attempt on prem workers

93af3ad

suhlrich commented May 13, 2024

View reviewed changes

sashasimkin reviewed May 13, 2024

View reviewed changes

utilsAPI.py Outdated Show resolved Hide resolved

antoinefalisse added 2 commits May 13, 2024 13:15

minor

f5e42ba

add workflow for dev

44042a0

antoinefalisse merged commit 44042a0 into dev May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scaling on AWS #117

[WIP] Scaling on AWS #117

suhlrich commented Oct 11, 2023

antoinefalisse commented May 13, 2024 •

edited

Loading

suhlrich left a comment

This comment was marked as off-topic.

antoinefalisse May 13, 2024

antoinefalisse May 13, 2024

sashasimkin May 14, 2024

antoinefalisse May 14, 2024

sashasimkin May 15, 2024

[WIP] Scaling on AWS #117

[WIP] Scaling on AWS #117

Conversation

suhlrich commented Oct 11, 2023

antoinefalisse commented May 13, 2024 • edited Loading

suhlrich left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

antoinefalisse May 13, 2024

Choose a reason for hiding this comment

antoinefalisse May 13, 2024

Choose a reason for hiding this comment

sashasimkin May 14, 2024

Choose a reason for hiding this comment

antoinefalisse May 14, 2024

Choose a reason for hiding this comment

sashasimkin May 15, 2024

Choose a reason for hiding this comment

antoinefalisse commented May 13, 2024 •

edited

Loading