-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix(bug): memory leak in tracing client #4828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Cody Kaczynski <[email protected]>
4742b52 to
84f13fa
Compare
|
|
||
| // WrapWithTracing wraps an HTTP client's transport with tracing instrumentation. | ||
| // This should be called once when creating the client, not on every request. | ||
| func WrapWithTracing(client *http.Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use sync.once here maybe... But also I don't fully love the need to call notify.WrapWithTracing from all notifiers. Is there a way to replace the injection with an injection that reuses a client, avoiding the memory over-use? I am OOO until next week so I can't try out options until then, and also @siavashs should be back then and we can look at options. Or we can submit this, and then look into improvements later, given the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I also won't be able to try until next week but I'll mess around on Monday and see if I can avoid having to call it repeatedly 🙂
|
|
||
| // Inject trancing transport | ||
| client.Transport = tracing.Transport(client.Transport) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider trying wrapping this call with a sync.Once (we can have one at the notify/utils level?) and see if tracing still works, without the leak, and we also avoid having to make the call in each notifier? Or alternatively should we have the wrapping happen in httpclient, err := commoncfg.NewClientFromConfig(*conf.HTTPConfig, "telegram", httpOpts...) or in a function wrapping that one, so we avoid forgetting to call it on a new notifier or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sync.Once approach in this case I think you would need a sync.Map variable on the package scope to track the sync.Once instances.
But wouldn't it be much simpler to add the tracing round tripper in NewClientFromConfig() in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only location outside of notifier implementations that would be impacted by changing NewClientFromConfig() would be the alertmanager client in amtool and I guess there it would basically be a no-op since no tracing provider is initialized.
Description
We (Akamai/Linode) upgraded our Alertmanager machines to v0.30.0 after some testing on December 18th, 2025 and started seeing the process being OOMKilled on December 23rd, 2025.
We investigated a bit and realized that there was a memory leak relating to the new distributed tracing feature, where each request instantiated a new Transport client.
Before
This is the heap from one of the machines we saw this on, running v0.30.0:
As you can see,
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTripis using ~50% of the heap.We found feat: add distributed tracing support was included in this release, and then noticed in
notify/util.gothat in therequestmethod, a new session was created for every request:To get around this, in this PR we're initializing a single client instead and just re-using it.
After
I'm testing it by spamming tens thousands of alerts, so memory allocation isn't necessarily what it is in a normal production workload, but anyways this is what the heap looks like after my change:
As you can see, much less! For what it's worth, we don't currently get traces from Alertmanager (yet), but it looks like my change didn't break anything:

Summary
We are now re-using a single Transport client for tracing instead of creating a new one for each request, resulting in a significant decrease in memory usage:
For what it's worth, the machine we tested on did not have any OOMKills within the last ~2-3 weeks that we've been testing v0.30.0. We only saw this in production, which are under a considerable amount more load.
I'm not sure what the urgency is here for other folks or if anyone else is seeing similar behavior, but we've rolled back to v0.28.1 for now so not too big of a deal for us (although we would like to get back up to v0.30.x soon!)
Signed-off-by: Cody Kaczynski [email protected]