Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

artem-shelkovnikov · 2023-09-27T11:25:28Z

Closes #1465

Problem: we inconsistently define what is "now" when doing scheduling for jobs + we use look-ahead approach (schedule jobs NO LATER THAN THEY SHOULD RUN). Example:

Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.

So at the moment of 11:59:29 PM, the next job is in 31 seconds and should not be scheduled yet. However, we don't wake up in exactly 30 seconds. We wake up in 30 seconds + overhead (event loop lag, we execute code before sleeping too), so in reality between the scheduling checks there's always an interval of 30+X seconds, where X can be between 1 and 2 seconds in best cases. So if we wait for 31 seconds and wake up, the job will be skipped and never scheduled.

This PR changes the approach to be look-behind (schedule the job NO EARLIER THAN THEY SHOULD RUN). Example:

Let's say, that now it's 11:59:29 PM and jobs trigger every 24 hours exactly at 12:00:00 PM. Our server wakes up every 30 seconds.

With look-behind approach, when the service wakes up it checks if any job should have been scheduled after it woke up last time. So if server wakes up in 30 seconds, it checks if any jobs should have been scheduled in exact last 30 seconds. Same if it wakes up with a lag - in 35, 40, 77 or any other number of seconds. It eliminates all potential problems of not scheduling the task because of setting the expectations incorrectly.

Additionally, IMO it makes sense to always use look-behind approach: it's always better that the sync starts a bit later than a bit earlier when doing capacity planning and scheduling for the syncs, especially if the sync process is multistep (e.g. before sync happens, operators of 3rd-party systems do some step, for example turn off the replication and wait).

Checklists

Pre-Review Checklist

this PR has a meaningful title
this PR links to all relevant github issues that it fixes or partially addresses
if there is no GH issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
Considered corresponding documentation changes

Release Note

Fixed a problem with scheduling when jobs sometimes were not correctly scheduled.

Changed scheduling mechanism to be look-behind and schedule jobs no earlier than expected time as opposed to no later than expected time before.

artem-shelkovnikov · 2023-09-27T11:25:54Z

Gonna do a couple CI runs in the meantime, but also test locally and think more about the problem

Removed empty line (was there to test)

artem-shelkovnikov · 2023-10-04T10:40:35Z

connectors/protocol/connectors.py

@@ -604,7 +604,7 @@ async def heartbeat(self, interval):
            self.log_debug("Sending heartbeat")
            await self.index.heartbeat(doc_id=self.id)

-    def next_sync(self, job_type):
+    def next_sync(self, job_type, now):


This is a major change - before "now" was arbitrary - connector wakes up at 11:59:59, but when checks the schedule it's already 12:00:01 and the moment of scheduling is lost

artem-shelkovnikov · 2023-10-04T10:41:43Z

TODO: still add tests, but wanted to open this PR for discussion

artem-shelkovnikov · 2023-10-04T10:45:28Z

connectors/services/job_scheduling.py

@@ -39,6 +39,7 @@ def __init__(self, config):
        self.source_list = config["sources"]
        self.connector_index = None
        self.sync_job_index = None
+        self.last_wake_up_time = datetime.utcnow()


What this means, if that if server starts 1 second after something should be scheduled, it won't be scheduled. It's probably fine, but we can also do datetime.utcnow() - timedelta(seconds=IDLING) to still execute jobs that should have started IDLING seconds ago

timgrein

Looks good from a functional perspective. Left some comments around two print statements and a variable assignment.

timgrein · 2023-10-04T11:06:48Z

connectors/services/job_scheduling.py

+        print(
+            f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}"
+        )


Suggested change

print(

f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}"

)

I guess this was used for local debugging?

Yup, need to change it to logger.debug!

timgrein · 2023-10-04T11:07:37Z

connectors/services/job_scheduling.py

                connector.log_debug(
                    f"A scheduled '{job_type_value}' sync is created by another connector instance, skipping..."
                )
                return False

            try:
-                next_sync = connector.next_sync(job_type)
+                next_sync = connector.next_sync(job_type, last_wake_up_time)
+                print(f"Next sync is at {next_sync}")


Suggested change

print(f"Next sync is at {next_sync}")

Probably also a debug statement? Or should we change this to use logger?

Yup, good call

timgrein · 2023-10-04T11:11:05Z

connectors/services/job_scheduling.py

@@ -167,29 +170,40 @@ async def _run(self):
                await self.sync_job_index.close()
        return 0

-    async def _scheduled_sync(self, connector, job_type):
+    async def _try_schedule_sync(self, connector, job_type):
+        last_wake_up_time = self.last_wake_up_time


Question: Do we need this assignment? Is the rationale to shorten the name throughout the method by removing the self prefix?

It's just personal taste - since it's used several times, I just put it to variable and use down the code. So purely personal style, not opposed to inlining it.

seanstory · 2023-10-04T16:37:53Z

IMO it makes sense to always use look-behind approach: it's always better that the sync starts a bit later than a bit earlier when doing capacity planning and scheduling for the syncs

Agree, and I like this approach/idea

timgrein

LGTM, nice change 👏

…k-ahead (#1689)

github-actions · 2023-10-06T08:54:11Z

💔 Failed to create backport PR(s)

Status	Branch	Result
✅	8.9	#1738
❌	8.8	Commit could not be cherrypicked due to conflicts
✅	8.10	#1739
✅	8.11	#1740

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 1689 --autoMerge --autoMergeMethod squash

…k-ahead (#1689) (#1740) Co-authored-by: Artem Shelkovnikov <[email protected]>

…k-ahead (#1689) (#1739) Co-authored-by: Artem Shelkovnikov <[email protected]>

…k-ahead (#1689) (#1738) Co-authored-by: Artem Shelkovnikov <[email protected]>

artem-shelkovnikov added 2 commits September 27, 2023 11:50

Minor renames

3c72d1f

Make it tidy

169477d

github-actions bot added auto-backport v8.11.0.0 labels Sep 27, 2023

artem-shelkovnikov and others added 14 commits September 28, 2023 11:36

let's see in CI

7046f39

Make autoformat

592b1e5

Add some more prints

5069c61

Make autoformat

1cc2919

Build this plz

8f3fd49

Build this plz

ae86dd8

Build this plz

6ad6b16

Build this plz

71d0fc0

Build this plz

7bb3495

Move things around

d0e4d06

Fixit

3fd7343

Autoformat

6c2943e

Flip the logic around - run job AFTER the cron trigger time

609fd39

Update Dockerfile.ftest

41c652d

Removed empty line (was there to test)

artem-shelkovnikov commented Oct 4, 2023

View reviewed changes

artem-shelkovnikov added the v8.10.0 label Oct 4, 2023

artem-shelkovnikov marked this pull request as ready for review October 4, 2023 10:41

artem-shelkovnikov requested review from a team and removed request for a team October 4, 2023 10:41

artem-shelkovnikov changed the title ~~WIP: Artem/fix unreliable scheduling~~ Fix unreliable job scheduling by making it look-behind instead of look-ahead Oct 4, 2023

Merge branch 'main' into artem/fix-unreliable-scheduling

a86fd0d

artem-shelkovnikov commented Oct 4, 2023

View reviewed changes

timgrein reviewed Oct 4, 2023

View reviewed changes

artem-shelkovnikov and others added 3 commits October 5, 2023 11:33

Make prints into log debug statements instead

6e7b8e1

Swap two statements

0038964

Merge branch 'main' into artem/fix-unreliable-scheduling

bd9d843

artem-shelkovnikov requested a review from a team October 5, 2023 10:12

timgrein approved these changes Oct 5, 2023

View reviewed changes

artem-shelkovnikov added v8.9.0 v8.8.0 labels Oct 6, 2023

artem-shelkovnikov merged commit 8ff4c18 into main Oct 6, 2023
2 checks passed

artem-shelkovnikov deleted the artem/fix-unreliable-scheduling branch October 6, 2023 08:53

github-actions bot mentioned this pull request Oct 6, 2023

[8.9] Fix unreliable job scheduling by making it look-behind instead of look-ahead (#1689) #1738

Merged

github-actions bot pushed a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

01cd92c

…k-ahead (#1689)

github-actions bot mentioned this pull request Oct 6, 2023

[8.10] Fix unreliable job scheduling by making it look-behind instead of look-ahead (#1689) #1739

Merged

github-actions bot pushed a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

b120fbb

…k-ahead (#1689)

github-actions bot mentioned this pull request Oct 6, 2023

[8.11] Fix unreliable job scheduling by making it look-behind instead of look-ahead (#1689) #1740

Merged

github-actions bot pushed a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

76bc31a

…k-ahead (#1689)

artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

a57ee0e

…k-ahead (#1689) (#1740) Co-authored-by: Artem Shelkovnikov <[email protected]>

artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

21fe434

…k-ahead (#1689) (#1739) Co-authored-by: Artem Shelkovnikov <[email protected]>

artem-shelkovnikov added a commit that referenced this pull request Oct 6, 2023

Fix unreliable job scheduling by making it look-behind instead of loo…

d5e9eb7

…k-ahead (#1689) (#1738) Co-authored-by: Artem Shelkovnikov <[email protected]>

artem-shelkovnikov mentioned this pull request Apr 17, 2024

Scheduled sync jobs trigger once per availability zone #2351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

artem-shelkovnikov commented Sep 27, 2023 •

edited

Loading

artem-shelkovnikov commented Sep 27, 2023

artem-shelkovnikov Oct 4, 2023

artem-shelkovnikov commented Oct 4, 2023

artem-shelkovnikov Oct 4, 2023

timgrein left a comment

timgrein Oct 4, 2023

artem-shelkovnikov Oct 4, 2023

timgrein Oct 4, 2023

artem-shelkovnikov Oct 4, 2023

timgrein Oct 4, 2023

artem-shelkovnikov Oct 4, 2023

seanstory commented Oct 4, 2023

timgrein left a comment

github-actions bot commented Oct 6, 2023

	print(
	f"Last time woke up at {last_wake_up_time}, woke up at {this_wake_up_time}"
	)

Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

Fix unreliable job scheduling by making it look-behind instead of look-ahead #1689

Conversation

artem-shelkovnikov commented Sep 27, 2023 • edited Loading

Closes #1465

Checklists

Pre-Review Checklist

Release Note

artem-shelkovnikov commented Sep 27, 2023

Choose a reason for hiding this comment

artem-shelkovnikov commented Oct 4, 2023

Choose a reason for hiding this comment

timgrein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanstory commented Oct 4, 2023

timgrein left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 6, 2023

💔 Failed to create backport PR(s)

artem-shelkovnikov commented Sep 27, 2023 •

edited

Loading