-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[ci] clean source directory at the beginning of every Azure DevOps build #6416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think that.... maybe there's been a breaking change in Azure DevOps YAML syntax? It seems like nothing in I see this in logs
And none of the log statements I tried to write with |
Thanks @shiyu1994 ! I think maybe this is not related to the new agents at all, and to some other recent change in Azure DevOps (the service itself, not our configuration for this specific project)? I'll continue investigating. |
😱 it looks like a CMake build directory is being left behind between builds! Added this at the top of ls -alF ./build And was surprised to see output like this in logs.
I think that
Looking at the "Initialize Containers" stage, I see Azure is doing this: /usr/bin/docker create \
--name ubuntu-latest_ubuntu2204_07fc25 \
--label b0e97f \
--network vsts_network_277dfc2d2e79409fab81db297f5d5ed5 \
--name ci-container \
-v /usr/bin/docker:/tmp/docker:ro \
-v "/var/run/docker.sock":"/var/run/docker.sock" \
-v "/agent/_work/1":"/__w/1" \
-v "/agent/_work/_temp":"/__w/_temp" \
-v "/agent/_work/_tasks":"/__w/_tasks" \
-v "/agent/_work/_tool":"/__t" \
-v "/agent/externals":"/__a/externals":ro \
-v "/agent/_work/.taskkey":"/__w/.taskkey" \
ubuntu:22.04 \
"/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);" So I think this volume mount is the problem:
Azure DevOps is cloning the repo to
Here's an example of what that looked like in a previous, successful run from #6407 on April 9. Looks like all the same volume mounts, so I don't think the root cause is something like "Azure changed what is mounted by default on these runs". /usr/bin/docker create \
--name linux-artifact-builder_lightgbmvstsagentmanylinux_2_28_x86_64_d292d1 \
--label b0e97f \
--network vsts_network_fe607b351cfb4d48ab85a0f9ba8dd62e \
-v "/var/run/docker.sock":"/var/run/docker.sock" \
-v "/agent/_work/1":"/__w/1" \
-v "/agent/_work/_temp":"/__w/_temp" \
-v "/agent/_work/_tasks":"/__w/_tasks" \
-v "/agent/_work/_tool":"/__t" \
-v "/agent/externals":"/__a/externals":ro \
-v "/agent/_work/.taskkey":"/__w/.taskkey" \
lightgbm/vsts-agent:manylinux_2_28_x86_64 \
"/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);" |
Alright @shiyu1994 I think I found the issue and have an approach that'll fix it 😁 |
@@ -65,13 +86,22 @@ jobs: | |||
echo "##vso[task.prependpath]/usr/lib64/openmpi/bin" | |||
echo "##vso[task.prependpath]$CONDA/bin" | |||
displayName: 'Set variables' | |||
- script: | | |||
git clean -d -f -x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shiyu1994 we are only running one agent per machine in the linux pool for Azure DevOps, right?
I saw that it's possible to run multiple (Azure Pipelines Agents docs).
If we were running multiple agents on the same virtual machine, then destructive actions like git clean
like this might cause conflicts between different jobs.
@shiyu1994 could you please review this so we can unblock CI? |
@jmoralez or @borchero could you review this? I would still like to get @shiyu1994 's opinion (especially on #6416 (comment)), but we don't have to wait on that to merge this. I really want to get CI working again here so we can merge already-approved stuff and so we don't lose momentum with other contributors. @mayer79 in particular has been stuck for weeks on #6397 and related work on the R package because of the long periods of CI unavailability here. I think these changes would be easy to revert and the risk of being wrong about them is very low. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock, but I don't have any experience with Azure DevOps. Hopefully @shiyu1994 can review as well.
Thanks @jmoralez ! I'm very confident in this change, and even more confident that it'd be easy to revert if @shiyu1994 comes back to this conversation and sees some issue with the changes. I'm going to merge this and start getting the other already-approved PRs merged. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
Related to #6316
#6407 appeared to fix the CI issues with self-hosted agents that are used for all of the
Linux *
CI jobs running on Azure DevOps.But I've been seeing significant failures on those runners over the last few days. Not just the "docker not found" issue detailed at the top of #6316, but some others as well.
From investigation in this PR, it seems to me that there are 2 issues:
bash:
key in a generic task in Azure DevOps YAML results in the script not being executed ([ci] clean source directory at the beginning of every Azure DevOps build #6416 (comment))CMakeCache.txt
) are present in the working directory of subsequent runs ([ci] clean source directory at the beginning of every Azure DevOps build #6416 (comment))This fixes those things in the following ways:
Bash@3
task from Azure DevOps (docs)git clean
in the source directory of every runIt also adds a
set -e -E -o pipefail
at the top of some of the CI scripts, to make debugging of such issues a bit easier in the future. More complete make-the-scripts-stricter work is in progress over in #6266.Notes for Reviewers
What exactly caused this?
I don't know.
I don't see anything that seems relevant in the release notes from the April 10, 2024 Azure DevOps release (the most recent one as of this writing): https://learn.microsoft.com/en-us/azure/devops/release-notes/2024/sprint-237-update.
I saw several fully-successful runs after we switched to the new pool of runners, e.g. on #6407.
I found the root cause of "there are build artifacts being left behind" by adding a
ls -alF ./build
at the top of.ci/test.sh
here. You can see that files are being left behind by looking at one of the "Clean source directory" tasks in Azure DevOps for Linux jobs.(build link)
LightGBM's CI jobs write files into the same directory the source code is checked out to, and that directory's contents are preserved on the agent across runs via a Docker volume mount. I'm only observing that on Linux jobs that run on self-hosted runners, so I think the root cause must be specific to those runners.
The timing makes me think this is somehow related to the introduction of the new set of runners in #6407. For example, maybe the previous pool of runners had some cleanup logic that ran between jobs that isn't there on the new runners.