-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689
Conversation
Gonna do a couple CI runs in the meantime, but also test locally and think more about the problem |
Removed empty line (was there to test)
@@ -604,7 +604,7 @@ async def heartbeat(self, interval): | |||
self.log_debug("Sending heartbeat") | |||
await self.index.heartbeat(doc_id=self.id) | |||
|
|||
def next_sync(self, job_type): | |||
def next_sync(self, job_type, now): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a major change - before "now" was arbitrary - connector wakes up at 11:59:59, but when checks the schedule it's already 12:00:01 and the moment of scheduling is lost
TODO: still add tests, but wanted to open this PR for discussion |
@@ -39,6 +39,7 @@ def __init__(self, config): | |||
self.source_list = config["sources"] | |||
self.connector_index = None | |||
self.sync_job_index = None | |||
self.last_wake_up_time = datetime.utcnow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this means, if that if server starts 1 second after something should be scheduled, it won't be scheduled. It's probably fine, but we can also do datetime.utcnow() - timedelta(seconds=IDLING)
to still execute jobs that should have started IDLING seconds ago
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from a functional perspective. Left some comments around two print
statements and a variable assignment.
print( | ||
f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print( | |
f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}" | |
) |
I guess this was used for local debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, need to change it to logger.debug!
connector.log_debug( | ||
f"A scheduled '{job_type_value}' sync is created by another connector instance, skipping..." | ||
) | ||
return False | ||
|
||
try: | ||
next_sync = connector.next_sync(job_type) | ||
next_sync = connector.next_sync(job_type, last_wake_up_time) | ||
print(f"Next sync is at {next_sync}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(f"Next sync is at {next_sync}") |
Probably also a debug statement? Or should we change this to use logger
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good call
@@ -167,29 +170,40 @@ async def _run(self): | |||
await self.sync_job_index.close() | |||
return 0 | |||
|
|||
async def _scheduled_sync(self, connector, job_type): | |||
async def _try_schedule_sync(self, connector, job_type): | |||
last_wake_up_time = self.last_wake_up_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we need this assignment? Is the rationale to shorten the name throughout the method by removing the self
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just personal taste - since it's used several times, I just put it to variable and use down the code. So purely personal style, not opposed to inlining it.
Agree, and I like this approach/idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice change 👏
💔 Failed to create backport PR(s)
Successful backport PRs will be merged automatically after passing CI. To backport manually run: |
…k-ahead (#1689) (#1740) Co-authored-by: Artem Shelkovnikov <[email protected]>
…k-ahead (#1689) (#1739) Co-authored-by: Artem Shelkovnikov <[email protected]>
…k-ahead (#1689) (#1738) Co-authored-by: Artem Shelkovnikov <[email protected]>
Closes #1465
Problem: we inconsistently define what is "now" when doing scheduling for jobs + we use look-ahead approach (schedule jobs NO LATER THAN THEY SHOULD RUN). Example:
Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.
So at the moment of 11:59:29 PM, the next job is in 31 seconds and should not be scheduled yet. However, we don't wake up in exactly 30 seconds. We wake up in 30 seconds + overhead (event loop lag, we execute code before sleeping too), so in reality between the scheduling checks there's always an interval of 30+X seconds, where X can be between 1 and 2 seconds in best cases. So if we wait for 31 seconds and wake up, the job will be skipped and never scheduled.
This PR changes the approach to be look-behind (schedule the job NO EARLIER THAN THEY SHOULD RUN). Example:
Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.
With look-behind approach, when the service wakes up it checks if any job should have been scheduled after it woke up last time. So if server wakes up in 30 seconds, it checks if any jobs should have been scheduled in exact last 30 seconds. Same if it wakes up with a lag - in 35, 40, 77 or any other number of seconds. It eliminates all potential problems of not scheduling the task because of setting the expectations incorrectly.
Additionally, IMO it makes sense to always use look-behind approach: it's always better that the sync starts a bit later than a bit earlier when doing capacity planning and scheduling for the syncs, especially if the sync process is multistep (e.g. before sync happens, operators of 3rd-party systems do some step, for example turn off the replication and wait).
Checklists
Pre-Review Checklist
v7.13.2
,v7.14.0
,v8.0.0
)Release Note
Fixed a problem with scheduling when jobs sometimes were not correctly scheduled.
Changed scheduling mechanism to be look-behind and schedule jobs no earlier than expected time as opposed to no later than expected time before.