WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Nov 26, 2025

What does this PR do?

At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.

This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.

Why is it important?

The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test(covered by unit tests)

Disruptive User Impact

None

How to test this PR locally

Use either an invalid OTEL configuration or one that will fail to start. Observer that when running elastic-agent run with that OTEL configuration in the elastic-agent.yml filed (aka. Hybrid Mode) that elastic-agent status --output=full provides correct state information.

Related issues

@blakerouse blakerouse self-assigned this Nov 26, 2025
@blakerouse blakerouse requested a review from a team as a code owner November 26, 2025 16:41
@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Nov 26, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz requested a review from swiatekm November 26, 2025 18:57
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the collector with the following configuration which should not run:

outputs:
  default:
    type: elasticsearch
    hosts: [127.0.0.1:9200]
    api_key: "example-key"
    #username: "elastic"
    #password: "changeme"
    preset: balanced
    otel:
      exporter:
        not_a_setting: true

It does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:

{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}
❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) FAILED
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]

That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).

@pchila pchila removed their request for review December 5, 2025 07:16
@blakerouse
Copy link
Contributor Author

@cmacknz @swiatekm Thanks for the reviews. I have updated this PR to resolve all your comments.

@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 8, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @blakerouse

swiatekm
swiatekm previously approved these changes Dec 9, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cmacknz
Copy link
Member

cmacknz commented Dec 9, 2025

Still see the healthcheck extension get output when there is a failure which I don't think we should do:

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/04d1d7a2-5a25-4834-bf8d-cd715d1d80a5
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]
~/Downloads/builds/elastic-agent-9.3.0-SNAPSHOT-darwin-aarch64 ······································ 02:01:06 PM
❯ sudo elastic-development-agent version
Binary: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)
Daemon: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)

@blakerouse
Copy link
Contributor Author

Still see the healthcheck extension get output when there is a failure which I don't think we should do:

❯ sudo elastic-development-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ beat/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ beat/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ filestream-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ filestream-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ http/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ http/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   ├─ prometheus/metrics-monitoring
   │  ├─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  ├─ prometheus/metrics-monitoring
   │  │  └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   │  └─ prometheus/metrics-monitoring-metrics-monitoring-collector
   │     └─ status: (FAILED) Fatal: '' has invalid keys: not_a_setting
   └─ extensions
      ├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
      └─ extension:healthcheckv2/04d1d7a2-5a25-4834-bf8d-cd715d1d80a5
         └─ status: StatusFatalError ['' has invalid keys: not_a_setting]
~/Downloads/builds/elastic-agent-9.3.0-SNAPSHOT-darwin-aarch64 ······································ 02:01:06 PM
❯ sudo elastic-development-agent version
Binary: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)
Daemon: 9.3.0-SNAPSHOT (build: af48211496d454f5f68c57e5ff8b7228056128e2 at 2025-12-09 18:58:25 +0000 UTC)

Lets file that as a separate issue, because that exists even before this PR.

@blakerouse blakerouse merged commit 3182df5 into elastic:main Dec 10, 2025
22 checks passed
@github-actions
Copy link
Contributor

@Mergifyio backport 8.19 9.1 9.2

@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

backport 8.19 9.1 9.2

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Dec 10, 2025
…1448)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
@blakerouse
Copy link
Contributor Author

Filed an issue for hiding the extension #11714

mergify bot pushed a commit that referenced this pull request Dec 10, 2025
…1448)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)

# Conflicts:
#	internal/pkg/otel/manager/common.go
#	internal/pkg/otel/manager/common_test.go
#	internal/pkg/otel/manager/execution_subprocess.go
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go
#	internal/pkg/otel/manager/testing/testing.go
mergify bot pushed a commit that referenced this pull request Dec 10, 2025
…1448)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)
@blakerouse blakerouse deleted the fix-11173 branch December 10, 2025 14:27
blakerouse added a commit that referenced this pull request Dec 11, 2025
…1448) (#11716)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)

Co-authored-by: Blake Rouse <[email protected]>
blakerouse added a commit that referenced this pull request Dec 12, 2025
…oper status reporting (#11713)

* Report crashing OTEL process cleanly with proper status reporting (#11448)

* Work on better error handling on failure of otel component.

* Add skeleton for handling this.

* Work on the otel config to status translation.

* implement that mapping

* Finish implementation.

* Add changelog.

* Fix race condition.

* Cleanups from code review.

* Fix formatting.

* Duh.

(cherry picked from commit 3182df5)

# Conflicts:
#	internal/pkg/otel/manager/manager.go

* Fix merge.

* go mod tidy.

* Fix notice.

---------

Co-authored-by: Blake Rouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

4 participants