-
Notifications
You must be signed in to change notification settings - Fork 204
Report crashing OTEL process cleanly with proper status reporting #11448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the collector with the following configuration which should not run:
outputs:
default:
type: elasticsearch
hosts: [127.0.0.1:9200]
api_key: "example-key"
#username: "elastic"
#password: "changeme"
preset: balanced
otel:
exporter:
not_a_setting: trueIt does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}❯ sudo elastic-development-agent status
┌─ fleet
│ └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
├─ status: (DEGRADED) 1 or more components/units in a failed state
├─ beat/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ beat/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ beat/metrics-monitoring-metrics-monitoring-beats
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ filestream-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ filestream-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ filestream-monitoring-filestream-monitoring-agent
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ http/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ http/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ http/metrics-monitoring-metrics-monitoring-agent
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ prometheus/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ prometheus/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ prometheus/metrics-monitoring-metrics-monitoring-collector
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
└─ extensions
├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
└─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
└─ status: StatusFatalError ['' has invalid keys: not_a_setting]
That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @blakerouse |
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Still see the healthcheck extension get output when there is a failure which I don't think we should do: |
Lets file that as a separate issue, because that exists even before this PR. |
|
@Mergifyio backport 8.19 9.1 9.2 |
✅ Backports have been created
|
…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/manager.go
|
Filed an issue for hiding the extension #11714 |
…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/common.go # internal/pkg/otel/manager/common_test.go # internal/pkg/otel/manager/execution_subprocess.go # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go # internal/pkg/otel/manager/testing/testing.go
…1448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5)
…1448) (#11716) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) Co-authored-by: Blake Rouse <[email protected]>
…oper status reporting (#11713) * Report crashing OTEL process cleanly with proper status reporting (#11448) * Work on better error handling on failure of otel component. * Add skeleton for handling this. * Work on the otel config to status translation. * implement that mapping * Finish implementation. * Add changelog. * Fix race condition. * Cleanups from code review. * Fix formatting. * Duh. (cherry picked from commit 3182df5) # Conflicts: # internal/pkg/otel/manager/manager.go * Fix merge. * go mod tidy. * Fix notice. --------- Co-authored-by: Blake Rouse <[email protected]>
What does this PR do?
At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.
This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.
Why is it important?
The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragmentsusing the changelog tool[ ] I have added an integration test or an E2E test(covered by unit tests)Disruptive User Impact
None
How to test this PR locally
Use either an invalid OTEL configuration or one that will fail to start. Observer that when running
elastic-agent runwith that OTEL configuration in theelastic-agent.ymlfiled (aka. Hybrid Mode) thatelastic-agent status --output=fullprovides correct state information.Related issues