-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.1.0-SNAPSHOT] Fleet Server can't enroll: FAILED: Missed two check-ins #1129
Comments
Curious that there's no logs from Fleet server related to the policy it selected. One possible workaround could be to add But we should figure out what the root issue here is regardless of if the workaround works. |
Hey @joshdover, we tried it with Julia while working on migration to hosted policies. With |
We did some investigation and it seems that the root cause is in the Elastic Agent Docker image. The last stable one we managed to build is this one: elastic/elastic-package#683 ( It maps onto:
|
We suspect that problem might have been introduced with this PR: elastic/beats#29031 cc @ph @blakerouse |
I will take a look, but I think the PR you have is the only major thing that I know that could have impact the agent. |
@criamico @mtojek this change might be related as well: elastic/kibana#108252 when starting elastic-package locally, I see this in fleet-server logs: what is more, |
I don't think it's related to elastic/kibana#108252. I've been able to successfully run this on # Create new elastic/fleet-server token
curl --request POST \
--url http://localhost:9200/_security/service/elastic/fleet-server/credential/token \
-u elastic:changeme
# Copy token response into authz header below
curl --request POST \
--url http://localhost:5601/api/fleet/setup \
--header 'authorization: Bearer <token>' \
--header 'content-type: application/json' \
--header 'kbn-xsrf: x' Do we need to update the token that we're using? Maybe our manual hardcoded token isn't working anymore due to a change in ES? |
@joshdover In this PR I forced the specific Docker image for Elastic Agent and it passed. Elasticsearch, Kibana images were the same. |
OK, I went through the all commits in fleet-server, the latest commit that add actual code in the server is 4 days ago, https://github.com/elastic/fleet-server/pulls?q=is%3Apr+is%3Amerged I am going to concentrate on the Agent side of things. |
This revert the code of the APM Instrumentation of the Elastic Agent. To unblock the build of and the CI for other team. This would require more investigation to really understand the problem. Fixes elastic/fleet-server#1129
This was a really deep rabbit hole, I took some time to have a running testing and a fast environment to debug using the Because of this, I've initially thought the problem was fleet-server. I've bisected the last good commit of the fleet server and the bug was still present. Now, everything was showing a problem with the Elastic Agent side, so I also did a bisect of the last good build to the working version. I was able to narrow it down to the actual APM instrumentation. Looking at the implementations the traces should have been disabled by default and not impact any behavior of the agent. If I remove the whole PR the Elastic Agent was able to do the initial enrollment into Fleet without any problems. I look more closely at the code I've tried to remove the code of the gRPC interceptor but it did not fix the situation. I've decided to revert the whole implementation of the APM instrumentation and we will need to look into it more. I've detected that importing Reverting the PR was noted a simple revert, another pull request applied after had a conflicting change. Looking at that PR, it was green except for the e2e CI, if the latter was working I am confident it would have caught that issues. Action items:
|
* Revert #29031 This revert the code of the APM Instrumentation of the Elastic Agent. To unblock the build of and the CI for other team. This would require more investigation to really understand the problem. Fixes elastic/fleet-server#1129 * fix make update * fix linter (cherry picked from commit 718c923)
Let's keep it open until we confirm that it fixed. |
interesting that the issue was closed from a forked repository? I don't remember seeing that ever. |
I'm seeing what appears to be the same issue with To reproduce, clone elastic/apm-server#7227 and run Logs:
|
@ph Could you please check what is the status of the elastic-agent Docker image? The issue still persists in Integrations. |
@mtojek I am taking another look. |
@mtojek Looking at the failure of the ci, this concern the 8.1 snapshots, and well I didn't merge elastic/beats#30209, I will double-check the failures and merge it. Is there a job that test on master? |
@ph if you don't care about running the specific steps that @mtojek mentioned: the steps I listed in #1129 (comment) are for main (8.2.0-SNAPSHOT). |
|
For example, it fails for the Integrations master, |
Build is ❯ git show 5529c31cf1bd68bf2ad089ef747186f9510ff3f1 [11:48:31]
commit 5529c31cf1bd68bf2ad089ef747186f9510ff3f1 (HEAD)
Author: Elastic Machine <[email protected]>
Date: Mon Feb 7 10:17:17 2022 -0600
[Release] update version to next minor 8.2.0 (#30160)
diff --git a/libbeat/version/version.go b/libbeat/version/version.go
index 873ae40db0..38249106a4 100644
--- a/libbeat/version/version.go
+++ b/libbeat/version/version.go
@@ -18,4 +18,4 @@
// Code generated by dev-tools/set_version
package version
-const defaultBeatVersion = "8.1.0"
+const defaultBeatVersion = "8.2.0 This is indeed a strange behavior, because I was able to reproduce the bug everytime with the instrumention commit and not without it. |
I retriggered the main job. Let's see what's the current status: link |
It should fail @mtojek I can reproduce the bug with the docker-images, the debug statement are lacking I am shooting a bit in the dark at that point. |
Interesting logs in the Kibana side, not sure why we have multiple
|
OK, I think we might have two different problems, let's start with the APM-Server, recently we have removed the autogeneration of configuration of fleet-server without a human 'intervention' elastic/kibana#108456. Looking at the APM docker-compose file at https://github.com/elastic/apm-server/blob/main/docker-compose.yml#L41-L63 we never configure the default fleet server configuration. So this aligns with we what see in the log Fleet Server is waiting on a configuration that will never exist. Elastic Package has creared a PR with elastic/elastic-package#676 Now I will check with the elastic-package. |
When I've tested #1129 (comment) I didn't use the container subcommand and did use the link from the kibana UI, so in that case Kibana generates the appropriate server configuration. |
Added notes here, using this configuration yield a few deprecation warning from elastic/kibana#108456 (comment) xpack.fleet.agentPolicies:
- name: Agent policy 1
description: Agent policy 1
is_managed: false
namespace: default
monitoring_enabled:
- logs
- metrics
package_policies:
- name: system-1
id: default-system
package:
name: system
- name: Fleet Server policy preconfigured
id: fleet-server-policy
namespace: default
package_policies:
- name: Fleet Server
package:
name: fleet_server [2022-02-08T20:34:28.207+00:00][WARN ][config.deprecation] Config key [xpack.fleet.agentPolicies.is_default] is deprecated. |
I still think it's something only when using an automation workflow, when I have a user journey it seems to work at least outside of containers. |
OK, 8.1.0 is stuck in a failure loop on Fleet-Server, the server is not even started. This is exactly what marcin had. |
OK, 8.2.0 elastic-package stack works for me.
Logging into Kibana show both Elastic Agent connecting to it. everything seems to be enrolled fine. @jlind23 @axw The main difference in 8.2.0 and 8.1.0 is really the instrumentation. fleet-server is identical. |
Thanks, @ph, for working on this to reduce the blast. I opened a similar PR to verify the 8.2.0 stack: elastic/elastic-package#692 Hey @simitt @axw @stuartnelson3, I suppose you've been already researching the APM instrumentation issue. Could you please share more details or link the issue, so we can learn what went wrong here? My bet is an undetected library conflict somewhere around GRPC. |
Elastic-package and apm-server problem is fixed so I am going to close this issue, If there is a problem we can reopen it. |
@stuartnelson3 is looking into this. |
This revert the code of the APM Instrumentation of the Elastic Agent. To unblock the build of and the CI for other team. This would require more investigation to really understand the problem. Fixes elastic/fleet-server#1129
Hi,
we adopted
elastic-package
to use predefined agent policies and confirmed with @juliaElastic that we're ready for switch (main branch is green).Since yesterday we're facing problems with enrollment:
More logs: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/main/98/artifact/build/elastic-stack-dump/synthetics/logs/
It affects the Integrations main branch, incl. synthetics, containerd, etc.
Steps to reproduce:
Thanks for any help with investigating this problem.
cc @jlind23 @joshdover
The text was updated successfully, but these errors were encountered: