WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 10, 2025

  • Understand the problem: Need to create a tool to run DistributedApplicationTests in a loop to reproduce CI hangs
  • Explore existing test infrastructure and tools
  • Create RunTestsInLoop tool with configurable iterations, timeout, and filtering
  • Add comprehensive README documentation
  • Build and verify the tool works correctly
  • Run code review and address feedback
  • Run CodeQL security check
  • Add trx file and log analysis when tests fail or timeout
  • Add sample test output showing the hanging test issue for analysis
  • Add Docker-in-Docker (DinD) support to copilot-setup-steps workflow

The tool successfully reproduces the CI hang. The tests hang during DCP operations due to Polly.Timeout.TimeoutRejectedException - the DCP operations don't complete within 20 seconds. This was caused by Docker not being available in the Copilot agent container environment. Added DinD service to enable container operations.

Original prompt

Run DistributedApplicationTests in a loop and try to reproduce the hangs we often see in CI runs


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Run DistributedApplicationTests in a loop to identify CI hangs Add RunTestsInLoop tool to reproduce CI test hangs Dec 10, 2025
Copilot AI requested a review from davidfowl December 10, 2025 11:16
@davidfowl
Copy link
Member

@copilot now run DistributedApplicationTests in a loop

Copy link
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot now run DistributedApplicationTests in a loop

Ran DistributedApplicationTests in a loop. Found a hang on the first iteration:

Test: Aspire.Hosting.Tests.DistributedApplicationTests.AfterResourcesCreatedLifecycleHookWorks
Status: TIMEOUT after 5 minutes

The test hung during DCP object creation, with the stack trace showing it was stuck in DcpExecutor.CreateAllDcpObjectsAsync. The tool detected and killed the hung process after the configured 5 minute timeout.

Output showed the test was stuck at [+1/x9/?0] progress (1 passed, 9 remaining) when the timeout triggered.

@davidfowl
Copy link
Member

@copilot When a test fails because of a timeout, look at the trx file produced and analyze the logs.

Copy link
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot When a test fails because of a timeout, look at the trx file produced and analyze the logs.

Added trx file and log analysis in commit a25ab67. The tool now:

  1. Enables --report-trx to generate trx files during test runs
  2. Identifies the likely hanging test from console output progress markers
  3. Parses trx files (when available) to show passed/failed/in-progress test counts and details
  4. Shows paths to log files in artifacts/testresults/loop-runner/

Note: When tests timeout, the trx file may not be written since the process is killed. In those cases, the tool still identifies the hanging test from the console progress output.

@davidfowl
Copy link
Member

@copilot Now re-run the tests

Copy link
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot Now re-run the tests

Re-ran DistributedApplicationTests with the updated tool. Hang reproduced on iteration 1/10:

⏳ Likely hanging test: Aspire.Hosting.Tests.DistributedApplicationTests.AfterResourcesCreatedLifecycleHookWorks
   Progress: [+1/x9/?0]

📁 Log files (1):
    artifacts/testresults/loop-runner/Aspire.Hosting.Tests_net8.0_x64.log

The test hung in DcpExecutor.CreateAllDcpObjectsAsyncKubernetesService.ExecuteWithRetry during DCP object creation. Stack trace shows it was stuck in Polly retry logic.

The hang is highly reproducible - it occurred on the first iteration both times the tests were run.

@github-actions
Copy link
Contributor

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 13462

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 13462"

@davidfowl
Copy link
Member

@copilot Re-run the test and capture all of the logs and analyze them

Copy link
Contributor Author

Copilot AI commented Dec 10, 2025

@copilot Re-run the test and capture all of the logs and analyze them

Ran test with --verbose to capture all logs. Key findings:

Hanging test: Aspire.Hosting.Tests.DistributedApplicationTests.AfterResourcesCreatedLifecycleHookWorks
Progress: [+1/x9/?0] (1 passed, 9 remaining)

Root cause: Polly.Timeout.TimeoutRejectedException - DCP operation timed out after 20 seconds

Polly.Timeout.TimeoutRejectedException: The operation didn't complete within the allowed timeout of '00:00:20'.

Call stack:

  1. OrchestratorHostService.StartAsync
  2. ApplicationOrchestrator.RunApplicationAsync
  3. DcpExecutor.RunApplicationAsync
  4. DcpExecutor.CreateAllDcpObjectsAsync ← hangs here
  5. KubernetesService.ExecuteWithRetry (Polly retry loop)

Log sequence before failure:

  • DcpExecutor Debug: Stopped watching Container resources
  • DcpExecutor Debug: Stopped watching ContainerExec resources
  • Host Error: Hosting failed to start

The test is hanging during DCP resource creation in CreateAllDcpObjectsAsync. The Polly retry policy keeps retrying until the 5-minute test timeout triggers. This happens consistently on iteration 1.

@davidfowl
Copy link
Member

@copilot Run the tests and commit the trx to source control for analysis

Copy link
Contributor Author

Copilot AI commented Dec 11, 2025

@copilot Run the tests and commit the trx to source control for analysis

Committed sample test output in d298189.

Note: No trx file is generated because the tests hang indefinitely on DCP operations and never complete. The test framework can only write trx files when tests finish running.

The committed file tools/RunTestsInLoop/sample-output/test-run-output.txt (3793 lines) contains:

  • Full verbose test output
  • DCP initialization logs
  • The Polly.Timeout.TimeoutRejectedException stack traces
  • Timestamps showing the 20-second DCP timeout triggering repeatedly

Key finding: The first test VerifyContainerArgs fails immediately with DCP timeout. Then AfterResourcesCreatedLifecycleHookWorks hangs in an infinite retry loop until our 5-minute timeout kills the process.

@davidfowl
Copy link
Member

@copilot We need to update the github action to make sure docker is running

Copy link
Contributor Author

Copilot AI commented Dec 11, 2025

@copilot We need to update the github action to make sure docker is running

Added Docker verification step to the copilot-setup-steps workflow in 74fd096. The workflow now runs docker info after checkout to ensure Docker is running before the agent starts, following the same pattern used in run-tests.yml.

@davidfowl
Copy link
Member

@copilot we need docker in docker

Copy link
Contributor Author

Copilot AI commented Dec 11, 2025

@copilot we need docker in docker

Added Docker-in-Docker support in 481b6ba. The workflow now:

  • Runs a docker:dind service container with privileged mode
  • Sets DOCKER_HOST=tcp://localhost:2375 to connect to the DinD service
  • Includes a wait loop (up to 60s) to ensure Docker is ready before proceeding

@davidfowl davidfowl closed this Dec 11, 2025
@dotnet-policy-service dotnet-policy-service bot added this to the 13.2 milestone Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants