-
Notifications
You must be signed in to change notification settings - Fork 343
Integrated code lifecycle
: Retry missing build jobs
#11330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
krusche
merged 23 commits into
develop
from
feature/integrated-code-lifecycle/retry-missing-jobs
Sep 19, 2025
Merged
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
fcbad94
retry missing jobs first
jfr2102 e4e858a
info log
jfr2102 5abadad
add extra schedule for retry of missing jobs
jfr2102 a76ea4c
delay first schedule
jfr2102 3ef9b35
retry only batch of jobs
jfr2102 3f58711
move missing job schedules to own service
jfr2102 d9e1d2a
try isolated ci tests
jfr2102 a19e894
resolve conflicts
jfr2102 1d305be
improve test
jfr2102 707b3eb
guard query
jfr2102 c49c3e5
null check submissionDate
jfr2102 dcde1f1
fix log order
jfr2102 743392a
improve retry logic
jfr2102 0e3206f
improve test
jfr2102 3ec393b
LocalCIException
jfr2102 071ebc0
config default max-missing-job-retries
jfr2102 9d47fd9
add composite indizes to buildJob table
jfr2102 44d64d2
Merge branch 'develop' into feature/integrated-code-lifecycle/retry-m…
jfr2102 1f039f2
Revert "add composite indizes to buildJob table"
jfr2102 68fa614
temp debug log
jfr2102 1e636cd
Revert "temp debug log"
jfr2102 31ea5d8
Merge branch 'develop' into feature/integrated-code-lifecycle/retry-m…
jfr2102 87b7537
Merge branch 'develop' into feature/integrated-code-lifecycle/retry-m…
jfr2102 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
158 changes: 158 additions & 0 deletions
158
...ain/java/de/tum/cit/aet/artemis/programming/service/localci/LocalCIMissingJobService.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
package de.tum.cit.aet.artemis.programming.service.localci; | ||
|
||
import java.time.ZonedDateTime; | ||
import java.util.List; | ||
import java.util.concurrent.TimeUnit; | ||
|
||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
import org.springframework.beans.factory.annotation.Value; | ||
import org.springframework.context.annotation.Lazy; | ||
import org.springframework.context.annotation.Profile; | ||
import org.springframework.data.domain.PageRequest; | ||
import org.springframework.data.domain.Pageable; | ||
import org.springframework.data.domain.Slice; | ||
import org.springframework.scheduling.annotation.Scheduled; | ||
import org.springframework.stereotype.Service; | ||
|
||
import de.tum.cit.aet.artemis.buildagent.dto.BuildJobQueueItem; | ||
import de.tum.cit.aet.artemis.exercise.repository.ParticipationRepository; | ||
import de.tum.cit.aet.artemis.programming.domain.ProgrammingExerciseParticipation; | ||
import de.tum.cit.aet.artemis.programming.domain.build.BuildJob; | ||
import de.tum.cit.aet.artemis.programming.domain.build.BuildStatus; | ||
import de.tum.cit.aet.artemis.programming.repository.BuildJobRepository; | ||
|
||
/** | ||
* Schedule service for detecting and retrying missing build jobs in the LocalCI system | ||
*/ | ||
@Lazy | ||
@Service | ||
@Profile("localci & scheduling") | ||
public class LocalCIMissingJobService { | ||
|
||
private static final Logger log = LoggerFactory.getLogger(LocalCIMissingJobService.class); | ||
|
||
private final BuildJobRepository buildJobRepository; | ||
|
||
private final LocalCITriggerService localCITriggerService; | ||
|
||
private final ParticipationRepository participationRepository; | ||
|
||
private final DistributedDataAccessService distributedDataAccessService; | ||
|
||
@Value("${artemis.continuous-integration.max-missing-job-retries:3}") | ||
jfr2102 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
private int maxMissingJobRetries; | ||
|
||
public LocalCIMissingJobService(BuildJobRepository buildJobRepository, LocalCITriggerService localCITriggerService, ParticipationRepository participationRepository, | ||
DistributedDataAccessService distributedDataAccessService) { | ||
this.buildJobRepository = buildJobRepository; | ||
this.localCITriggerService = localCITriggerService; | ||
this.participationRepository = participationRepository; | ||
this.distributedDataAccessService = distributedDataAccessService; | ||
} | ||
|
||
/** | ||
* Periodically checks the status of pending build jobs and updates their status if they are missing. | ||
* <p> | ||
* This scheduled task ensures that build jobs which are stuck in the QUEUED or BUILDING state for too long | ||
* are detected and marked as MISSING if their status cannot be verified. This helps prevent indefinite | ||
* waiting states due to external failures or inconsistencies in the CI system. | ||
* </p> | ||
* <p> | ||
* This mechanism is necessary because build jobs are managed externally, and various failure scenarios | ||
* can lead to jobs being lost without Artemis being notified: | ||
* </p> | ||
* <ul> | ||
* <li>Application crashes or restarts while build job was queued</li> | ||
* <li>network issues leading to Hazelcast data loss</li> | ||
* <li>Build agent crashes or is disconnected</li> | ||
* </ul> | ||
*/ | ||
@Scheduled(fixedRateString = "${artemis.continuous-integration.check-job-status-interval-seconds:300}", initialDelayString = "${artemis.continuous-integration.check-job-status-delay-seconds:60}", timeUnit = TimeUnit.SECONDS) | ||
jfr2102 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public void checkPendingBuildJobsStatus() { | ||
log.debug("Checking pending build jobs status"); | ||
List<BuildJob> pendingBuildJobs = buildJobRepository.findAllByBuildStatusIn(List.of(BuildStatus.QUEUED, BuildStatus.BUILDING)); | ||
ZonedDateTime now = ZonedDateTime.now(); | ||
final int buildJobExpirationInMinutes = 5; // If a build job is older than 5 minutes, and it's status can't be determined, set it to missing | ||
|
||
var queuedJobs = distributedDataAccessService.getQueuedJobs(); | ||
var processingJobs = distributedDataAccessService.getProcessingJobIds(); | ||
|
||
for (BuildJob buildJob : pendingBuildJobs) { | ||
var submissionDate = buildJob.getBuildSubmissionDate(); | ||
if (submissionDate == null || submissionDate.isAfter(now.minusMinutes(buildJobExpirationInMinutes))) { | ||
log.debug("Build job with id {} is too recent to check", buildJob.getBuildJobId()); | ||
continue; | ||
} | ||
if (buildJob.getBuildStatus() == BuildStatus.QUEUED && checkIfBuildJobIsStillQueued(queuedJobs, buildJob.getBuildJobId())) { | ||
log.debug("Build job with id {} is still queued", buildJob.getBuildJobId()); | ||
continue; | ||
} | ||
if (checkIfBuildJobIsStillBuilding(processingJobs, buildJob.getBuildJobId())) { | ||
log.debug("Build job with id {} is still building", buildJob.getBuildJobId()); | ||
continue; | ||
} | ||
log.error("Build job with id {} is in an unknown state", buildJob.getBuildJobId()); | ||
// If the build job is in an unknown state, set it to missing and update the build start date | ||
buildJobRepository.updateBuildJobStatus(buildJob.getBuildJobId(), BuildStatus.MISSING); | ||
} | ||
} | ||
|
||
/** | ||
* Periodically retries missing build jobs. | ||
* R | ||
* retrieves a slice of missing build jobs from the last hour and attempts to retry them. | ||
* If a build job has reached the maximum number of retries, it will not be retried again. | ||
*/ | ||
@Scheduled(fixedRateString = "${artemis.continuous-integration.retry-missing-jobs-interval-seconds:300}", initialDelayString = "${artemis.continuous-integration.retry-missing-jobs-delay-seconds:120}", timeUnit = TimeUnit.SECONDS) | ||
jfr2102 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public void retryMissingJobs() { | ||
log.debug("Checking for missing build jobs to retry"); | ||
|
||
var start = System.currentTimeMillis(); | ||
Slice<BuildJob> missingJobsSlice = getMissingJobsToRetrySliceOfLastHour(50); | ||
ekayandan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
log.debug("Retrieving missing jobs took {} ms", System.currentTimeMillis() - start); | ||
|
||
List<BuildJob> missingJobs = missingJobsSlice.getContent(); | ||
log.debug("Processing {} missing build jobs to retry", missingJobs.size()); | ||
|
||
for (BuildJob buildJob : missingJobs) { | ||
if (buildJob.getRetryCount() >= maxMissingJobRetries) { | ||
log.warn("Build job with id {} for participation {} has reached the maximum number of {} retries and will not be retried.", buildJob.getBuildJobId(), | ||
buildJob.getParticipationId(), maxMissingJobRetries); | ||
continue; | ||
} | ||
try { | ||
localCITriggerService.retryBuildJob(buildJob, (ProgrammingExerciseParticipation) participationRepository.findByIdElseThrow(buildJob.getParticipationId())); | ||
buildJobRepository.incrementRetryCount(buildJob.getBuildJobId()); | ||
} | ||
jfr2102 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
catch (Exception e) { | ||
log.error("Failed to retry build job with id {} for participation {}", buildJob.getBuildJobId(), buildJob.getParticipationId(), e); | ||
} | ||
} | ||
|
||
if (missingJobsSlice.hasNext()) { | ||
log.debug("There are more missing jobs to process in the next scheduled run."); | ||
} | ||
} | ||
|
||
private boolean checkIfBuildJobIsStillBuilding(List<String> processingJobIds, String buildJobId) { | ||
return processingJobIds.contains(buildJobId); | ||
} | ||
|
||
private boolean checkIfBuildJobIsStillQueued(List<BuildJobQueueItem> queuedJobs, String buildJobId) { | ||
return queuedJobs.stream().anyMatch(job -> job.id().equals(buildJobId)); | ||
} | ||
|
||
/** | ||
* Retrieves a slice of missing build jobs submitted within the last hour that do not have a newer job for the same participation. | ||
* | ||
* @param maxResults the maximum number of results to retrieve | ||
* @return a slice of missing build jobs | ||
*/ | ||
private Slice<BuildJob> getMissingJobsToRetrySliceOfLastHour(int maxResults) { | ||
Pageable pageable = PageRequest.of(0, maxResults); | ||
ZonedDateTime now = ZonedDateTime.now(); | ||
ZonedDateTime oneHourAgo = now.minusHours(1); | ||
return buildJobRepository.findMissingJobsToRetryInTimeRange(oneHourAgo, now, pageable); | ||
ekayandan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.