Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

Merged
merged 20 commits into from
Oct 6, 2023

Conversation

artem-shelkovnikov
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov commented Sep 27, 2023

Closes #1465

Problem: we inconsistently define what is "now" when doing scheduling for jobs + we use look-ahead approach (schedule jobs NO LATER THAN THEY SHOULD RUN). Example:

Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.

So at the moment of 11:59:29 PM, the next job is in 31 seconds and should not be scheduled yet. However, we don't wake up in exactly 30 seconds. We wake up in 30 seconds + overhead (event loop lag, we execute code before sleeping too), so in reality between the scheduling checks there's always an interval of 30+X seconds, where X can be between 1 and 2 seconds in best cases. So if we wait for 31 seconds and wake up, the job will be skipped and never scheduled.

This PR changes the approach to be look-behind (schedule the job NO EARLIER THAN THEY SHOULD RUN). Example:

Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.

With look-behind approach, when the service wakes up it checks if any job should have been scheduled after it woke up last time. So if server wakes up in 30 seconds, it checks if any jobs should have been scheduled in exact last 30 seconds. Same if it wakes up with a lag - in 35, 40, 77 or any other number of seconds. It eliminates all potential problems of not scheduling the task because of setting the expectations incorrectly.

Additionally, IMO it makes sense to always use look-behind approach: it's always better that the sync starts a bit later than a bit earlier when doing capacity planning and scheduling for the syncs, especially if the sync process is multistep (e.g. before sync happens, operators of 3rd-party systems do some step, for example turn off the replication and wait).

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes

Release Note

Fixed a problem with scheduling when jobs sometimes were not correctly scheduled.

Changed scheduling mechanism to be look-behind and schedule jobs no earlier than expected time as opposed to no later than expected time before.

@artem-shelkovnikov
Copy link
Member Author

Gonna do a couple CI runs in the meantime, but also test locally and think more about the problem

@@ -604,7 +604,7 @@ async def heartbeat(self, interval):
self.log_debug("Sending heartbeat")
await self.index.heartbeat(doc_id=self.id)

def next_sync(self, job_type):
def next_sync(self, job_type, now):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a major change - before "now" was arbitrary - connector wakes up at 11:59:59, but when checks the schedule it's already 12:00:01 and the moment of scheduling is lost

@artem-shelkovnikov
Copy link
Member Author

TODO: still add tests, but wanted to open this PR for discussion

@artem-shelkovnikov artem-shelkovnikov marked this pull request as ready for review October 4, 2023 10:41
@artem-shelkovnikov artem-shelkovnikov requested review from a team and removed request for a team October 4, 2023 10:41
@artem-shelkovnikov artem-shelkovnikov changed the title WIP: Artem/fix unreliable scheduling Fix unreliable job scheduling by making it look-behind instead of look-ahead Oct 4, 2023
@@ -39,6 +39,7 @@ def __init__(self, config):
self.source_list = config["sources"]
self.connector_index = None
self.sync_job_index = None
self.last_wake_up_time = datetime.utcnow()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this means, if that if server starts 1 second after something should be scheduled, it won't be scheduled. It's probably fine, but we can also do datetime.utcnow() - timedelta(seconds=IDLING) to still execute jobs that should have started IDLING seconds ago

Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a functional perspective. Left some comments around two print statements and a variable assignment.

Comment on lines 177 to 179
print(
f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(
f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}"
)

I guess this was used for local debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, need to change it to logger.debug!

connector.log_debug(
f"A scheduled '{job_type_value}' sync is created by another connector instance, skipping..."
)
return False

try:
next_sync = connector.next_sync(job_type)
next_sync = connector.next_sync(job_type, last_wake_up_time)
print(f"Next sync is at {next_sync}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"Next sync is at {next_sync}")

Probably also a debug statement? Or should we change this to use logger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good call

@@ -167,29 +170,40 @@ async def _run(self):
await self.sync_job_index.close()
return 0

async def _scheduled_sync(self, connector, job_type):
async def _try_schedule_sync(self, connector, job_type):
last_wake_up_time = self.last_wake_up_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we need this assignment? Is the rationale to shorten the name throughout the method by removing the self prefix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just personal taste - since it's used several times, I just put it to variable and use down the code. So purely personal style, not opposed to inlining it.

@seanstory
Copy link
Member

IMO it makes sense to always use look-behind approach: it's always better that the sync starts a bit later than a bit earlier when doing capacity planning and scheduling for the syncs

Agree, and I like this approach/idea

@artem-shelkovnikov artem-shelkovnikov requested a review from a team October 5, 2023 10:12
Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice change 👏

@github-actions
Copy link

github-actions bot commented Oct 6, 2023

💔 Failed to create backport PR(s)

Status Branch Result
8.9 #1738
8.8 Commit could not be cherrypicked due to conflicts
8.10 #1739
8.11 #1740

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 1689 --autoMerge --autoMergeMethod squash

artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023
artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023
artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduling is failing when poll interval = time to schedule a sync for new job
3 participants