Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[elastic-agent] initial instrumentation #29031

Merged
merged 38 commits into from
Feb 1, 2022

Conversation

stuartnelson3
Copy link
Contributor

What does this PR do?

Add some initial tracing to elastic-agent. I basically just looked through the code and was looking at lowhanging fruit for wrapping http handlers and clients. I'd love to have some of the EA dev's point out places I've missed that I could add more tracing, and ways to trigger these traces to view them in the UI.

todo:

  • more instrumenting
  • tests
  • how to configure apm output? currently apm-agent-go just using default values and sending to the apm-server integration.

Why is it important?

This will help in troubleshooting issues as more users start running EA.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

build elastic-agent off this branch, start it under fleet, add the APM integration. the agent should appear as an apm service.

Related issues

closes #28895

Use cases

troubleshooting mis-behaving elastic-agents

Screenshots

image
(traces are coming from a metricbeat grabbing proxied data from apm-server)

@stuartnelson3 stuartnelson3 requested review from a team November 18, 2021 10:58
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 18, 2021
@mergify
Copy link
Contributor

mergify bot commented Nov 18, 2021

This pull request does not have a backport label. Could you fix it @stuartnelson3? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@stuartnelson3 stuartnelson3 added the Team:Elastic-Agent Label for the Agent team label Nov 18, 2021
@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 18, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 18, 2021
@stuartnelson3 stuartnelson3 removed the backport-skip Skip notification from the automated backport with mergify label Nov 18, 2021
@mergify
Copy link
Contributor

mergify bot commented Nov 18, 2021

This pull request does not have a backport label. Could you fix it @stuartnelson3? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 18, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 18, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Reason: null

  • Start Time: 2022-01-31T17:07:59.836+0000

  • Duration: 272 min 52 sec

  • Commit: c0b4b9e

Test stats 🧪

Test Results
Failed 0
Passed 48027
Skipped 4292
Total 52319

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@michel-laterman
Copy link
Contributor

Haven't reviewed anything yet, but this PR is exciting! It also covers some diagnostics work that's on my backlog.
I can make time to help out if needed

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

Not too sure about the span names. Typically I'd name after higher level operations, rather than method names. Maybe that's just my bias though.

@stuartnelson3
Copy link
Contributor Author

That would be much appreciated! I'm sort of flying blind at the moment as I'm not super familiar with the codebase, so any help would be great.

@ruflin
Copy link
Contributor

ruflin commented Nov 19, 2021

This is really exciting! Thanks for taking this on.

One thing I'm curious in the context of diagnostics: Assuming someone is not shipping the apm data to ES, would it still be possible to fetch some of it as part of the diagnostics call? Does that even make sense? @axw @stuartnelson3

@michel-laterman
Copy link
Contributor

@ruflin, I'm not entirely sure how the apm client works, I think that integration with the diagnostics command should be done as a separate PR.
I don't think the apm client has an option to write to a local file, in-memory buffer, etc. so we'll likely have to figure out how we want collect this information for diagnostics later.

@axw
Copy link
Member

axw commented Nov 22, 2021

One thing I'm curious in the context of diagnostics: Assuming someone is not shipping the apm data to ES, would it still be possible to fetch some of it as part of the diagnostics call? Does that even make sense?

@ruflin not out of the box, but it would be possible to do something like this by implementing a custom transport. e.g. we have a transport which decodes events and stores them in memory for testing purposes: https://github.com/elastic/apm-agent-go/blob/master/transport/transporttest/recorder.go

@ruflin
Copy link
Contributor

ruflin commented Nov 23, 2021

Very interesting @axw . Is this something that could be enabled "on-demand"? What I have in mind is something like:

  • User triggers elastic-agent diagnostic command
  • Recorder is enabled for x seconds
  • Recording is read out and stored on disk
  • Recorder is disabled again
  • Diagnostics finished

@michel-laterman Agree in any case to separate it out in future PR's.

@simitt
Copy link
Contributor

simitt commented Nov 23, 2021

@ruflin would you mind creating a dedicated issue for this kind of conversation. I believe we can isolate it and don't think we should block making progress with initial instrumentation by this.

@axw
Copy link
Member

axw commented Nov 24, 2021

@ruflin please ping me on said issue and I'll reply there so the answers don't get lost.

@ruflin
Copy link
Contributor

ruflin commented Nov 24, 2021

@stuartnelson3
Copy link
Contributor Author

@ph i've addressed some comments and written some follow-up questions, can you take a look when you have a chance?

also, I'm not sure what the e2e test failure is, it's apparently something with kubernetes (kubernetes_autodiscover_elastic-agent). I merged in latest master and will see what happens...

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @stuartnelson3 @blakerouse can you have an eye on it?

@stuartnelson3
Copy link
Contributor Author

any idea what's up with the packaging failures?

[2022-01-25T19:23:45.582Z] ERRO[2022-01-25T19:23:45Z] Could not download the binary for the agent   arch=x86_64 artifact=elastic-agent error="GET request failed with 404" extension=tar.gz os=linux version=8.1.0-dbc834fd-SNAPSHOT
[2022-01-25T19:23:45.582Z] ERRO[2022-01-25T19:23:45Z] Could not download the binary for the agent   arch=x86_64 artifact=elastic-agent error="GET request failed with 404" extension=tar.gz os=linux version=8.1.0-dbc834fd-SNAPSHOT

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks very mechanical, which is good to see. Made it easier to review such a large branch.

I reviewed the initTracer as that is probably the most important part. Looks good and provides unit test coverage for it.

@cmacknz
Copy link
Member

cmacknz commented Jan 27, 2022

any idea what's up with the packaging failures?

[2022-01-25T19:23:45.582Z] ERRO[2022-01-25T19:23:45Z] Could not download the binary for the agent   arch=x86_64 artifact=elastic-agent error="GET request failed with 404" extension=tar.gz os=linux version=8.1.0-dbc834fd-SNAPSHOT
[2022-01-25T19:23:45.582Z] ERRO[2022-01-25T19:23:45Z] Could not download the binary for the agent   arch=x86_64 artifact=elastic-agent error="GET request failed with 404" extension=tar.gz os=linux version=8.1.0-dbc834fd-SNAPSHOT

I think this might have been broken by an unrelated change on master, see #30007. I think there was a follow up fix if it's the same issue I was seeing on other PRs.

@stuartnelson3
Copy link
Contributor Author

/test

3 similar comments
@stuartnelson3
Copy link
Contributor Author

/test

@stuartnelson3
Copy link
Contributor Author

/test

@ruflin
Copy link
Contributor

ruflin commented Jan 31, 2022

/test

@stuartnelson3
Copy link
Contributor Author

/test

@mdelapenya
Copy link
Contributor

e2e are failing but not caused by this PR. We @elastic/observablt-robots are working on the permanent fix (very probably here elastic/e2e-testing#2064), so I'd say this PR is fine

👍

@stuartnelson3 stuartnelson3 merged commit 395ee91 into elastic:master Feb 1, 2022
@stuartnelson3 stuartnelson3 added backport-v8.1.0 Automated backport with mergify v8.1.0 and removed backport-v8.0.0 Automated backport with mergify labels Feb 1, 2022
ph added a commit to ph/beats that referenced this pull request Feb 3, 2022
This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129
@ph ph mentioned this pull request Feb 3, 2022
6 tasks
ph added a commit that referenced this pull request Feb 4, 2022
* Revert #29031

This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129

* fix make update

* fix linter
mergify bot pushed a commit that referenced this pull request Feb 4, 2022
* Revert #29031

This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129

* fix make update

* fix linter

(cherry picked from commit 718c923)
ph added a commit that referenced this pull request Feb 9, 2022
Co-authored-by: Pier-Hugues Pellerin <[email protected]>
stuartnelson3 added a commit to stuartnelson3/beats that referenced this pull request Feb 14, 2022
stuartnelson3 added a commit to stuartnelson3/beats that referenced this pull request Feb 14, 2022
stuartnelson3 added a commit to stuartnelson3/beats that referenced this pull request Feb 14, 2022
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.1.0 Automated backport with mergify enhancement Team:Elastic-Agent Label for the Agent team v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add APM instrumentation to Elastic Agent