Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APM UI doesn't show connected traces sent from AWS Distro for OTel (Lambda) #458

Open
michaelhyatt opened this issue Jun 24, 2021 · 11 comments

Comments

@michaelhyatt
Copy link

APM Server version (apm-server version): 7.13.1

Description of the problem including expected versus actual behavior:
I managed to get the distributed tracing going with AWS X-Ray and the OTel collector to pick it up. The flow is curl => API-GW => producer-lambda => API-GW => consumer-lambda. The code (serverless): https://github.com/michaelhyatt/serverless-aws-python-api-worker-otel

The interesting thing is that we are recording the 2 transactions with 1 span correctly as having the same trace.id, however, the APM UI doesn’t really show a proper trace.

AWS X-Ray showing the trace with 2 components:
image

Discover shows 3 records, 2 transactions and 1 span with the same trace.id:
image

The APM UI is showing 2 services, but no traces:
image

Steps to reproduce:
As per README, requires the serverless framework.

JSON documents

Consumer transaction

{
	"_index": "apm-7.13.1-transaction-000001",
	"_type": "_doc",
	"_id": "TdX7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.consumer"],
		"service.framework.version": ["0.16.dev0"],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
		"transaction.sampled": [true],
		"transaction.id": ["1b5693f6e07de9c7"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["c7ab9723-9e43-4488-89d0-e1759190b984"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-consumer"],
		"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [698],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-23T13:07:06.188Z"],
		"timestamp.us": [1624453624947661],
		"@timestamp": ["2021-06-23T13:07:04.947Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["8ae9f83b97976607"],
		"transaction.name": ["handler.consumer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453624947]
}

Producer HTTP span

{
	"_index": "apm-7.13.1-span-000001",
	"_type": "_doc",
	"_id": "T9X7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"span.destination.service.type": ["external"],
		"span.name": ["HTTP GET"],
		"destination.port": [443],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"span.duration.us": [2256175],
		"processor.event": ["span"],
		"agent.name": ["opentelemetry/python"],
		"destination.address": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
		"event.outcome": ["success"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"processor.name": ["transaction"],
		"labels.faas_version": ["$LATEST"],
		"span.id": ["898690921573f428"],
		"span.subtype": ["http"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"span.type": ["external"],
		"cloud.provider": ["aws"],
		"timestamp.us": [1624453622779541],
		"@timestamp": ["2021-06-23T13:07:02.779Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["8b8e0c642d96e5f2"],
		"span.destination.service.name": ["https://3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"],
		"span.destination.service.resource": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com:443"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453622779]
}

Producer transaction

{
	"_index": "apm-7.13.1-transaction-000001",
	"_type": "_doc",
	"_id": "TtX7OHoBmWidblu2OYlN",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.producer"],
		"service.framework.version": ["0.21b0"],
		"observer.name": ["instance-0000000008"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
		"transaction.sampled": [true],
		"transaction.id": ["8b8e0c642d96e5f2"],
		"trace.id": ["60d331f41b8d383766d158e55359aaca"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["fdcc0a5f-ce57-4201-a631-6097957f4a78"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"service.framework.name": ["opentelemetry.instrumentation.requests"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [2259030],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["60e5fd4e9651"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-23T13:07:06.189Z"],
		"timestamp.us": [1624453622778742],
		"@timestamp": ["2021-06-23T13:07:02.778Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.1"],
		"agent.version": ["1.2.0"],
		"parent.id": ["918243f3ba6e3f71"],
		"transaction.name": ["handler.producer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"]
	},
	"sort": [1624453622778]
}
@axw
Copy link
Member

axw commented Jun 24, 2021

@michaelhyatt would you be able to run the opentelemetry-collector with the logging exporter, and capture the logged spans? Then we can compare to the docs we're recording.

One thing I can see is that the consumer transaction's parent.id does not refer to either of the other events. Also, the producer transaction has a parent.id - so neither of the transactions is considered the root, hence why they don't show up under Traces. I'm guessing that the API Gateway is updating traceparent, but there doesn't appear to be any spans for API Gateway. I'm wildly guessing that API Gateway is sending XRay events but not OTel spans.

@michaelhyatt
Copy link
Author

@axw I have the debug logs of the embedded OTel collector that recorded the messages it sends. 2 logs are attached. Here is the example of the debug of the span that the consumer reports:



2021-06-23T23:00:34.750Z	DEBUG	loggingexporter/logging_exporter.go:48	ResourceSpans #0
--
Resource labels:
-> telemetry.sdk.language: STRING(python)
-> telemetry.sdk.name: STRING(opentelemetry)
-> telemetry.sdk.version: STRING(1.2.0)
-> cloud.region: STRING(ap-southeast-2)
-> cloud.provider: STRING(aws)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
-> service.name: STRING(aws-python-api-worker-project-dev-consumer)
InstrumentationLibrarySpans #0
InstrumentationLibrary opentelemetry.instrumentation.aws_lambda 0.16.dev0
Span #0
Trace ID       : 60d3bd0d1c71895e0786483748de7acf
Parent ID      : 84ecb3d0f1da98b3
ID             : ebd0715a456a282b
Name           : handler.consumer
Kind           : SPAN_KIND_SERVER
Start time     : 2021-06-23 23:00:34.733421785 +0000 UTC
End time       : 2021-06-23 23:00:34.734093944 +0000 UTC
Status code    : STATUS_CODE_UNSET
Status message :
Attributes:
-> faas.execution: STRING(6c00c38a-b1fe-4aac-955e-345a91831150)
-> faas.id: STRING(arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
 
END RequestId: 6c00c38a-b1fe-4aac-955e-345a91831150

consumer.txt
producer.txt

@AlexanderWert
Copy link
Member

AlexanderWert commented Jun 24, 2021

@michaelhyatt Can you try to disable X-Ray tracing in the API Gateway (https://docs.aws.amazon.com/xray/latest/devguide/xray-services-apigateway.html) and see whether it solves the observed problem?

@michaelhyatt
Copy link
Author

@AlexanderWert I disabled the API gateway tracing and now it doesn't appear on X-Ray service map or waterfall diagram:
image
image

Kibana APM UI still doesn't show the trace.

These are the documents recorded in Elasticsearch.

Consumer transaction

{
	"_index": "apm-7.13.2-transaction-000001",
	"_type": "_doc",
	"_id": "D-ToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.consumer"],
		"service.framework.version": ["0.16.dev0"],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-consumer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"],
		"transaction.sampled": [true],
		"transaction.id": ["8d38b2d9bee1c20f"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["4bc4f6f3-71e3-4380-91d7-5277a79be05e"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-consumer"],
		"service.framework.name": ["opentelemetry.instrumentation.aws_lambda"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [605],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-26T01:21:30.604Z"],
		"timestamp.us": [1624670489577889],
		"@timestamp": ["2021-06-26T01:21:29.577Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["e53f6b191e1d0281"],
		"transaction.name": ["handler.consumer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670489577]
}

Producer span

{
	"_index": "apm-7.13.2-span-000001",
	"_type": "_doc",
	"_id": "EeToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"span.destination.service.type": ["external"],
		"span.name": ["HTTP GET"],
		"destination.port": [443],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"span.duration.us": [2386427],
		"processor.event": ["span"],
		"agent.name": ["opentelemetry/python"],
		"destination.address": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
		"event.outcome": ["success"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"processor.name": ["transaction"],
		"labels.faas_version": ["$LATEST"],
		"span.id": ["e5edc4fb3045615a"],
		"span.subtype": ["http"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"span.type": ["external"],
		"cloud.provider": ["aws"],
		"timestamp.us": [1624670487255663],
		"@timestamp": ["2021-06-26T01:21:27.255Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["a64ec14bc1536181"],
		"span.destination.service.name": ["https://bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"],
		"span.destination.service.resource": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com:443"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670487255]
}

Producer transaction

{
	"_index": "apm-7.13.2-transaction-000001",
	"_type": "_doc",
	"_id": "EOToRXoB6AbCXZTTT6Dt",
	"_version": 1,
	"_score": null,
	"fields": {
		"transaction.name.text": ["handler.producer"],
		"service.framework.version": ["0.21b0"],
		"observer.name": ["instance-0000000009"],
		"labels.faas_name": ["aws-python-api-worker-project-dev-producer"],
		"service.language.name": ["python"],
		"labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"],
		"transaction.sampled": [true],
		"transaction.id": ["a64ec14bc1536181"],
		"trace.id": ["60d681143464a3ff1c61880768907b0a"],
		"processor.event": ["transaction"],
		"agent.name": ["opentelemetry/python"],
		"labels.faas_execution": ["4580f63e-2761-408e-99f3-046cb774d7e3"],
		"event.outcome": ["unknown"],
		"cloud.region": ["ap-southeast-2"],
		"service.name": ["aws-python-api-worker-project-dev-producer"],
		"service.framework.name": ["opentelemetry.instrumentation.requests"],
		"processor.name": ["transaction"],
		"transaction.duration.us": [2389360],
		"labels.faas_version": ["$LATEST"],
		"observer.version_major": [7],
		"observer.hostname": ["1e0706f92f11"],
		"transaction.type": ["custom"],
		"cloud.provider": ["aws"],
		"event.ingested": ["2021-06-26T01:21:30.605Z"],
		"timestamp.us": [1624670487254602],
		"@timestamp": ["2021-06-26T01:21:27.254Z"],
		"ecs.version": ["1.8.0"],
		"observer.type": ["apm-server"],
		"observer.version": ["7.13.2"],
		"agent.version": ["1.2.0"],
		"parent.id": ["7ade6ef5e6220237"],
		"transaction.name": ["handler.producer"]
	},
	"highlight": {
		"trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"]
	},
	"sort": [1624670487254]
}

@axw axw transferred this issue from elastic/apm-server Jun 28, 2021
@axw
Copy link
Member

axw commented Jun 28, 2021

@michaelhyatt I've transferred this to elastic/apm as (AFAICS) there's no bug with the translation of OpenTelemetry spans -- the server is translating them faithfully. It appears that API GW is injecting/mutating traceparent, but is not sending any OTel spans. When we receive an OTel span for Lambda, it has a parent span ID which points to nowhere.

That missing API GW span would theoretically be the trace root, and we base certain features on the trace roots, such as the list of traces. If API GW spans were sent to APM Server, then I expect it would all just work. Seems like missing/broken functionality in API GW to me.

@michaelhyatt
Copy link
Author

@axw The latest test was done without API-GW tracing turned on. Are you saying the API-GW is still changing the trace header even though it is not part of X-Ray traces anymore?

@axw
Copy link
Member

axw commented Jun 28, 2021

Yes, it looks like API GW is still adding a traceparent header before both Lambdas. The producer transaction has a field parent.id:7ade6ef5e6220237, and the consumer transaction has parent.id:e53f6b191e1d0281. There are no transactions or spans with these IDs.

@AlexanderWert
Copy link
Member

AlexanderWert commented Jun 28, 2021

Yes, it looks like API GW is still adding a traceparent header before both Lambdas.

Could it be the Lambda extension that is adding the traceparent (but not sending OTel spans but rather collects X-Ray-specific spans)? For instance for the Lambda initialization phase as shown in the screenshot above.

@axw
Copy link
Member

axw commented Jun 28, 2021

Possibly, I don't know. I should have said X-Ray header anyway, since there's no W3C traceparent headers involved here. The bottom line is that we're being told the producer span has a parent which does not exist, and so none of the spans is considered to be the trace root.

@cyrille-leclerc
Copy link

cyrille-leclerc commented Jun 28, 2021

I come to the same conclusion

Type (Transaction / Span) id parent.id timestamp.us name service.name
T a64ec14bc1536181 7ade6ef5e6220237 1624670487254602 handler.producer aws-python-api-worker-project-dev-producer
S e5edc4fb3045615a a64ec14bc1536181 1624670487255663 HTTP GET aws-python-api-worker-project-dev-producer
T 8d38b2d9bee1c20f e53f6b191e1d0281 1624670489577889 handler.consumer aws-python-api-worker-project-dev-consumer

The parent.id of the consumer/downstream transaction is not the trace.id of the parent HTTP GETspan as expected. It's an e53f6b191e1d0281 instead of e5edc4fb3045615a

@michaelhyatt
Copy link
Author

I tested this setup with Jaeger and it is able to handle the case quite well, although it displayed warnings about the missing parents.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants