Report crashing OTEL process cleanly with proper status reporting #11448

blakerouse · 2025-11-26T16:41:36Z

What does this PR do?

At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.

This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.

Why is it important?

The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~(covered by unit tests)

Disruptive User Impact

None

How to test this PR locally

Use either an invalid OTEL configuration or one that will fail to start. Observer that when running elastic-agent run with that OTEL configuration in the elastic-agent.yml filed (aka. Hybrid Mode) that elastic-agent status --output=full provides correct state information.

Related issues

Closes [beats receivers] Supervised collector failing to start only reports OTel manager failed: supervised collector (pid: 94473) exited with error: exit status 1' #11173

elasticmachine · 2025-11-26T16:41:40Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm

The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.

internal/pkg/otel/manager/testing/testing.go

internal/pkg/otel/manager/common.go

internal/pkg/otel/manager/manager.go

cmacknz

I ran the collector with the following configuration which should not run:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    #username: "elastic"
    #password: "changeme"
    preset: balanced
    otel:
      exporter:
        not_a_setting: true

It does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:

{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]

That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).

internal/pkg/otel/manager/common.go

blakerouse · 2025-12-08T20:16:33Z

@cmacknz @swiatekm Thanks for the reviews. I have updated this PR to resolve all your comments.

elasticmachine · 2025-12-08T23:15:19Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: af48211

Failed CI Steps

Start ESS stack for FIPS integration tests

History

💔 Build #31671 failed 35246bb
💚 Build #31103 succeeded adbed0d
💔 Build #31098 failed e242f0c

cc @blakerouse

swiatekm

👍

internal/pkg/otel/manager/manager_test.go

cmacknz · 2025-12-09T19:02:29Z

Still see the healthcheck extension get output when there is a failure which I don't think we should do:

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/04d1d7a2-5a25-4834-bf8d-cd715d1d80a5
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]
~/Downloads/builds/elastic-agent-9.3.0-SNAPSHOT-darwin-aarch64 ······································ 02:01:06 PM
❯ sudo elastic-development-agent version
Binary: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)
Daemon: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)

blakerouse · 2025-12-10T13:31:33Z

Still see the healthcheck extension get output when there is a failure which I don't think we should do:

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/04d1d7a2-5a25-4834-bf8d-cd715d1d80a5
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]
~/Downloads/builds/elastic-agent-9.3.0-SNAPSHOT-darwin-aarch64 ······································ 02:01:06 PM
❯ sudo elastic-development-agent version
Binary: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)
Daemon: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)

Lets file that as a separate issue, because that exists even before this PR.

github-actions · 2025-12-10T13:32:02Z

@Mergifyio backport 8.19 9.1 9.2

mergify · 2025-12-10T13:32:12Z

backport 8.19 9.1 9.2

✅ Backports have been created

#11713 [8.19] (backport #11448) Report crashing OTEL process cleanly with proper status reporting has been created for branch 8.19 but encountered conflicts
#11715 [9.1] (backport #11448) Report crashing OTEL process cleanly with proper status reporting has been created for branch 9.1 but encountered conflicts
#11716 [9.2] (backport #11448) Report crashing OTEL process cleanly with proper status reporting has been created for branch 9.2

…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/manager.go

blakerouse · 2025-12-10T13:35:11Z

Filed an issue for hiding the extension #11714

…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/common.go # internal/pkg/otel/manager/common_test.go # internal/pkg/otel/manager/execution_subprocess.go # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go # internal/pkg/otel/manager/testing/testing.go

…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5)

…1448) (#11716) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) Co-authored-by: Blake Rouse <[email protected]>

…oper status reporting (#11713) * Report crashing OTEL process cleanly with proper status reporting (#11448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/manager.go * Fix merge. * go mod tidy. * Fix notice. --------- Co-authored-by: Blake Rouse <[email protected]>

blakerouse added 7 commits November 24, 2025 14:46

Work on better error handling on failure of otel component.

86d5eeb

Merge remote-tracking branch 'upstream/main' into fix-11173

d097700

Add skeleton for handling this.

a95a0b1

Work on the otel config to status translation.

6994585

implement that mapping

c48eea2

Finish implementation.

edaee59

Add changelog.

e242f0c

blakerouse self-assigned this Nov 26, 2025

blakerouse requested a review from a team as a code owner November 26, 2025 16:41

blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Nov 26, 2025

blakerouse requested review from michel-laterman and pchila November 26, 2025 16:41

Fix race condition.

adbed0d

cmacknz requested a review from swiatekm November 26, 2025 18:57

swiatekm reviewed Nov 27, 2025

View reviewed changes

internal/pkg/otel/manager/testing/testing.go Outdated Show resolved Hide resolved

internal/pkg/otel/manager/common.go Show resolved Hide resolved

internal/pkg/otel/manager/common.go Show resolved Hide resolved

internal/pkg/otel/manager/manager.go Show resolved Hide resolved

cmacknz reviewed Nov 27, 2025

View reviewed changes

internal/pkg/otel/manager/common.go Show resolved Hide resolved

pchila removed their request for review December 5, 2025 07:16

blakerouse added 2 commits December 8, 2025 09:59

Merge remote-tracking branch 'upstream/main' into fix-11173

9184d2f

Cleanups from code review.

35246bb

Fix formatting.

af48211

pierrehilbert requested review from cmacknz and swiatekm December 9, 2025 08:13

swiatekm previously approved these changes Dec 9, 2025

View reviewed changes

swiatekm reviewed Dec 9, 2025

View reviewed changes

internal/pkg/otel/manager/manager_test.go Show resolved Hide resolved

blakerouse added 2 commits December 9, 2025 14:02

Duh.

8217616

Merge remote-tracking branch 'upstream/main' into fix-11173

e95e504

blakerouse dismissed swiatekm’s stale review via e95e504 December 9, 2025 19:02

swiatekm approved these changes Dec 10, 2025

View reviewed changes

blakerouse merged commit 3182df5 into elastic:main Dec 10, 2025
22 checks passed

mergify bot mentioned this pull request Dec 10, 2025

[8.19] (backport #11448) Report crashing OTEL process cleanly with proper status reporting #11713

Merged

5 tasks

This was referenced Dec 10, 2025

[9.1] (backport #11448) Report crashing OTEL process cleanly with proper status reporting #11715

Closed

[9.2] (backport #11448) Report crashing OTEL process cleanly with proper status reporting #11716

Merged

blakerouse deleted the fix-11173 branch December 10, 2025 14:27

cmacknz mentioned this pull request Dec 12, 2025

[beats receivers] Log messages from the collector are not always identified as being from the collector #11785

Open

Report crashing OTEL process cleanly with proper status reporting #11448

Report crashing OTEL process cleanly with proper status reporting #11448

Uh oh!

Conversation

blakerouse commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Nov 26, 2025

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmacknz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blakerouse commented Dec 8, 2025

Uh oh!

elasticmachine commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💛 Build succeeded, but was flaky

Failed CI Steps

History

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmacknz commented Dec 9, 2025

Uh oh!

blakerouse commented Dec 10, 2025

Uh oh!

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

mergify bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

blakerouse commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blakerouse commented Nov 26, 2025 •

edited

Loading

cmacknz left a comment •

edited

Loading

elasticmachine commented Dec 8, 2025 •

edited

Loading

mergify bot commented Dec 10, 2025 •

edited

Loading