Skip to content

Conversation

sdodson
Copy link

@sdodson sdodson commented Sep 18, 2025

The goal being to ensure build-mechnical and the downstream build-node-image jobs finish before we typically see content changed, which tends to be EMEA to US East AM hours. In situations where we're hoping to pick up late changes this may cause delays, if that becomes an issue I'd propose we consider adding a second run at 17:00 UTC (EDT 12:00 + 5hr), but ideally we'd limit that to only versions where we hope to have rapid turn around. Those versions would be 9.6 today, then 9.6, 9.8, and 10.2 starting approximately May 2026.

@dustymabe
Copy link
Member

This time was originally chosen (1da5e54) based on when Fedora composes would finish for rawhide.

@dustymabe
Copy link
Member

ensure build-mechnical and the downstream build-node-image jobs finish before we typically see content changed

why? wouldn't you want to wait until after new content was available?

@sdodson
Copy link
Author

sdodson commented Sep 18, 2025

/hold
no idea if bots exist here to enforce that or not but trying to avoid merging while discussion is ongoing

@jlebon
Copy link
Member

jlebon commented Sep 18, 2025

We could make the cron spec a global pipecfg knob if it comes to it.

But that said, if we want perfect timing, obviously it's possible to trigger a job whenever content changes. I think ART today watches RHEL repos already, and also knows how to trigger jobs, so they're well-positioned for this.

@sdodson
Copy link
Author

sdodson commented Sep 18, 2025

All of the last 10 rhel-9.6 jobs showing up in Jenkins were triggered by timer or me. I can check to see why we're not getting builds triggered on change by ART for at least rhel-9.6.

This time was originally chosen (1da5e54) based on when Fedora composes would finish for rawhide.

The RHCOS jenkins doesn't seem to have any Fedora related jobs, I assume some other pipeline instance is running the same jobs against Fedora?

why? wouldn't you want to wait until after new content was available?

I'm OK with any timing that we think yields less failed builds and content that's no more than 24 hours old as of Wednesday at 16:00 UTC. Where RHEL builds associated with GA versions of OCP are the highest priority within the set both in terms of success and freshness.

@dustymabe
Copy link
Member

The RHCOS jenkins doesn't seem to have any Fedora related jobs, I assume some other pipeline instance is running the same jobs against Fedora?

yeah. we have an upstream pipeline, but share the same code that powers the pipeline.

I'm not opposed to setting the times based on the pipeline, just trying understand the requirements.

@jlebon
Copy link
Member

jlebon commented Sep 18, 2025

All of the last 10 rhel-9.6 jobs showing up in Jenkins were triggered by timer or me. I can check to see why we're not getting builds triggered on change by ART for at least rhel-9.6.

To be clear I wasn't implying that ART is doing this today. But that it could.

@dustymabe
Copy link
Member

All of the last 10 rhel-9.6 jobs showing up in Jenkins were triggered by timer or me. I can check to see why we're not getting builds triggered on change by ART for at least rhel-9.6.

To be clear I wasn't implying that ART is doing this today. But that it could.

I'd actually prefer ART didn't. Since RHEL content shouldn't really change more than once a day, is it really needed?

@jlebon
Copy link
Member

jlebon commented Sep 18, 2025

All of the last 10 rhel-9.6 jobs showing up in Jenkins were triggered by timer or me. I can check to see why we're not getting builds triggered on change by ART for at least rhel-9.6.

To be clear I wasn't implying that ART is doing this today. But that it could.

I'd actually prefer ART didn't. Since RHEL content shouldn't really change more than once a day, is it really needed?

Don't we today already trigger once a day? ISTM the ask here is to have better timing wrt RHEL, not necessarily have more builds. (This would actually be a reduction in builds if RHEL content changes less often than that... though I'm not sure in practice how often that happens.) But I guess if we go this way, we should probably also ask ART to cap the rate as well in case it sometimes churns faster for some reason.

@sdodson
Copy link
Author

sdodson commented Sep 30, 2025

Just summarizing.

Right now build-mechanical is only triggered by timer and it initiates builds for c10s, c9s, rhel-10.1 and rhel-9.6, in that order. It's never initiated by ART. Builds of c10s, c9s, rhel-10.1 and rhel-9.6 are pretty much exclusively triggered by build-mechanical or a human replaying a failure. I guess also Fedora pipelines happen too but I don't know about those.

build-mechanical has a median run time of six hours and a maximum runtime of 13 hours. We should expect that runtime to grow whenever we add 9.8 and 10.2, though we'll remove 10.1 at roughly the same time. There are almost always changes to c10s and c9s Tuesday through Saturday, so this means that rhel-10.1 or rhel-9.6 typically start 13:00-14:00 UTC.

build-node-image IS triggered by ART whenever OCP content changes, potentially multiple times a day.

I'm not sure how we can please everyone with just one knob to set the start time across two pipelines and three OSes publishing schedules. Do we know if moving four hours earlier would be too early for Fedora? It seems like CS9 composes are complete by 04:00, CS10 composes by 04:30, Rawhide by 5:30. RHEL is a bit of a wildcard unfortunately, but lets assume done 16:00 UTC is generally done.

With just one knob and no further changes I'd probably propose moving things two hours later in hopes of RHEL pushes having completed by the time we start RHEL builds but still with enough time to complete the process before the daily cut-off. Any appetite for that change? An alternative would be to try to go two hours earlier but I'd like to try later first.

A single knob controls both FCOS and RHCOS pipelines here. C9S, C10S,
and Rawhide composes seem to run between 04:00 and 05:00 UTC. Meanwhile
RHEL content is generally pushed 09:00 to 16:00 UTC. If there are
changes each build takes approximately 2.5 hours meaning that RHEL
builds would start at 17:00 UTC and should have enough time to complete
by end of day cut-offs, though it'll be tight.

This will likely require additional tweaking in the future when we add
10.2 and 9.8.
@sdodson sdodson force-pushed the build-mechcanical-0400UTC branch from 682dd21 to a39368a Compare September 30, 2025 19:51
@sdodson sdodson changed the title Move build-mechnical to 04:00 UTC Move build-mechnical to 12:00 UTC Oct 1, 2025
@dustymabe
Copy link
Member

Rawhide by 5:30.

Typically the time when the compose completes isn't the time that directory was created. For today's rawhide build I'd use the timestamp on this file which says:

STATUS                  2025-10-01 07:01   20

Do we know if moving four hours earlier would be too early for Fedora?

Another option here (I think) would be to change the order the c10s, c9s, rhel-10.1 and rhel-9.6 builds are triggered. If the rhel-* builds were to go first would that help achieve the goal here? I assume the punctuality here is more focused on the rhel-* builds and not so much the CentOS ones?

With just one knob and no further changes I'd probably propose moving things two hours later in hopes of RHEL pushes having completed by the time we start RHEL builds but still with enough time to complete the process before the daily cut-off.

Yeah. Later by two hours seems fine. The only problem I see here is that it leans more heavily on more western timezone folks to investgate failures because it will be later in the day before failures trickle in (more towards the end of eastern TZ folks' day).

If you don't mind, can we clear up what the actual goal is? In your original description you mention:

The goal being to ensure build-mechnical and the downstream build-node-image jobs finish before we typically see content changed, which tends to be EMEA to US East AM hours.

but I don't really understand why we wouldn't want to pick up the newest content sooner after it's available rather than later. I would think the goal would be "to have the newest content available by XYZ cutoff time on W day of the week". Can you put it in those terms?

FTR I'm not opposed to having multiple knobs here. It just makes things slightly more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants