Add custom CI timeout to detect stuck jobs earlier #608
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Summary
Added custom timeout for
Buildjob of theCIworkflow based on historical data that could lead to resource savings and faster CI for the project.More details
Over the last 568 successful runs, the
Buildjob has a maximum runtime of 37 minutes (mean=2, std=2) across all matrix combinations.However, there are failed runs that fail after reaching the threshold of 6 hours that GitHub imposes. In other words, these jobs seem to get stuck, possibly for external or random reasons.
One such example is this job run, that failed after 6 hours, while the full list of timed-out jobs is available below. With the proposed changes, a total of 11 hours would have been saved over the last six months retrospectively, clearing the queue for other workflows and speeding up the CI of the project, while also saving resources in general 🌱.
The idea is to set a timeout to stop jobs that run much longer than their historical maximum, because such jobs are probably stuck and will simply fail after GitHub's timeout of 6 hours.
Our PR proposes to set the timeout to
max + 3*std = 43 minuteswheremaxandstd(standard deviation) are derived from the history of 568 successful runs. This will provide sufficient margin if the workflow gets naturally slower in the future, but if you would prefer lower/higher threshold we would be happy to do it.We propose the same for the
Update port sourcesjob of theUpdate lockfilesworkflow, that also has experienced timeouts.Note that the timeout applies to all the matrix jobs, and not to their sum, overriding the default 6-hour timeout of github.
Click here to see all the recently timed-out runs.
05-Jan-2025 => timed-out run
05-Jan-2025 => timed-out run
18-Apr-2025 => timed-out run
Context
Hi,
We are a team of researchers from University of Zurich and we are currently working on energy optimizations in GitHub Actions workflows.
Thanks for your time on this and for your contribution to open-source software in general.
Feel free to let us know (here or in the email below) if you have any questions, and thanks for putting in the time to read this.
Best regards,
Konstantinos Kitsios
[email protected]