Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus exporter - Histograms are disappearing. #3089

Closed
mshebeko-twist opened this issue Dec 14, 2022 · 11 comments · Fixed by #3429
Closed

Prometheus exporter - Histograms are disappearing. #3089

mshebeko-twist opened this issue Dec 14, 2022 · 11 comments · Fixed by #3429
Assignees
Labels
bug Something isn't working exporters prometheus

Comments

@mshebeko-twist
Copy link

mshebeko-twist commented Dec 14, 2022

Describe your environment Describe any aspect of your environment relevant to the problem, including your Python version,
python 3.7.10
opentelemetry_api-1.14.0
opentelemetry_sdk-1.14.0
opentelemetry_exporter_prometheus-1.12.0rc1

Steps to reproduce
Create Histogram and expose it with prometheus-exporter

What is the expected behavior?
Histograms should consistently appear on /metrics endpoint

What is the actual behavior?
Histograms disappears from /metrics endpoint after request/scrape. A change to metrics (in my case would be query to application itself) will make histogram appear again until next request/scrape. This is not happening neither in prometheus_client nor in javascript opentelemetry prometheus exporter.

Additional context
This behaviour is not optimal when multiple prometheus servers are scrapping metrics.
I have both auto instrumented and manually instrumented histograms both of them are disappearing.

Edit:
Don't know if its actually related but I hope it may be helpful.
I have additional observation. When pushing Histograms to OTEL Collector I sporadically encounter following error:

[email protected]/accumulator.go:105	failed to translate metric	{"kind": "exporter", "data_type": "metrics", "name": "prometheus", "data_type": "\u0000", "metric_name": "http.server.duration"}

But its not the case with JS implementation its just works without any errors.

@mshebeko-twist mshebeko-twist added the bug Something isn't working label Dec 14, 2022
@robotadam
Copy link

I was able to duplicate this, and I also found this related discussion: open-telemetry/opentelemetry-collector-contrib#13443

It appears to happen when there is no data collected for the histogram during an exporter interval.

Here's a minimal app.py that exhibits this, using python 3.10.9, otel 1.15.0/0.36b0

from fastapi import FastAPI
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.metrics import set_meter_provider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
from opentelemetry.sdk.resources import SERVICE_NAME, Resource

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}


exporter = OTLPMetricExporter(endpoint="http://otel-collector:4317")
console = ConsoleMetricExporter()
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5_000)
provider = MeterProvider(
    resource=Resource.create({SERVICE_NAME: "fastapi-test-app"}),
    metric_readers=[reader],
)
set_meter_provider(provider)

FastAPIInstrumentor.instrument_app(app)

and otel-collector-config.yaml, running on latest docker image otel/opentelemetry-collector-contrib, which is 0.67.0.

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheus]

Running this app the first metric is exported after the first web request successfully. After that every export will emit the error message and traceback every 5 seconds:

2022-12-20T00:12:05.663Z	error	[email protected]/accumulator.go:105	failed to translate metric	{"kind": "exporter", "data_type": "metrics", "name": "prometheus", "data_type": "\u0000", "metric_name": "http.server.duration"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*lastValueAccumulator).addMetric
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/accumulator.go:105
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*lastValueAccumulator).Accumulate
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/accumulator.go:82
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*collector).processMetrics
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/collector.go:67
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*prometheusExporter).ConsumeMetrics
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/prometheus.go:96
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:65
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/common.go:208
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:358
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:136
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:294
go.opentelemetry.io/collector/exporter/exporterhelper.NewMetricsExporter.func2
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:116
go.opentelemetry.io/collector/consumer.ConsumeMetricsFunc.ConsumeMetrics
	go.opentelemetry.io/collector/[email protected]/metrics.go:36
go.opentelemetry.io/collector/receiver/otlpreceiver/internal/metrics.(*Receiver).Export
	go.opentelemetry.io/collector/receiver/[email protected]/internal/metrics/otlp.go:63
go.opentelemetry.io/collector/pdata/pmetric/pmetricotlp.rawMetricsServer.Export
	go.opentelemetry.io/collector/[email protected]/pmetric/pmetricotlp/grpc.go:71
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1._MetricsService_Export_Handler.func1
	go.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:311
go.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1
	go.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:412
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	google.golang.org/[email protected]/server.go:1162
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:341
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	google.golang.org/[email protected]/server.go:1165
google.golang.org/grpc.chainUnaryInterceptors.func1
	google.golang.org/[email protected]/server.go:1167
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1._MetricsService_Export_Handler
	go.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:313
google.golang.org/grpc.(*Server).processUnaryRPC
	google.golang.org/[email protected]/server.go:1340
google.golang.org/grpc.(*Server).handleStream
	google.golang.org/[email protected]/server.go:1713
google.golang.org/grpc.(*Server).serveStreams.func1.2
	google.golang.org/[email protected]/server.go:965

@aabmass
Copy link
Member

aabmass commented Dec 20, 2022

Hey folks, thanks for raising this issue and debugging. Would one of you be willing to send a PR to fix this?

Longer term, I'm wondering if we should move away from using the Prometheus client library altogether and just generate prometheus text format on our own. This is what the JS SIG went with.

@robotadam
Copy link

My test showed this failing with OTLP/grpc export to an opentelemetry collector exporting prometheus, not using the built in prometheus exporter. So it may be that there are two separate issues or that the underlying cause is deeper in the metrics generation rather than inside the exporter.

In the related discussion they called out the duplicate data_type fields in the error message:
failed to translate metric {"kind": "exporter", "data_type": "metrics", "name": "prometheus", "data_type": "\u0000", "metric_name": "http.server.duration"}
perhaps that can point the way? If you have a suggestion of where to look I can see if I spot something.

@robotadam
Copy link

I think I isolated it a little further. In my testing this only occurs if there is a period where a previously-active histogram has no data points & and there is a different metric with data. I dumped some data from the OTLP exporter before it went over the wire. When there is data in the histogram, the request looks like this:

...
scope_metrics {
    scope {
      name: "__main__"
    }
    metrics {
      name: "http.server.duration"
      description: "A histogram."
      unit: "ms"
      histogram {
        data_points {
          start_time_unix_nano: 1671666510029448000
          time_unix_nano: 1671666515034673000
          count: 1
          sum: 1
          bucket_counts: 0
...

When there is no data, the output looks like this:

...
  scope_metrics {
    scope {
      name: "__main__"
    }
    metrics {
      name: "http.server.duration"
      description: "A histogram."
      unit: "ms"
    }
...

So it's not just that data_points is empty, it's also that histogram doesn't exist as a key.

I don't know why this only causes an error when the collector receives another metric, but it does seem like if there are no data points the Histogram metric should not be included. Adding a continue here in that case avoids sending it along:

I have no idea if that's a valid solution, but wanted to share what I learned.

DraggonFantasy added a commit to DraggonFantasy/opentelemetry-python that referenced this issue Feb 22, 2023
@mplachter
Copy link

Any update on this?

@ocelotl
Copy link
Contributor

ocelotl commented Jun 7, 2023

@robotadam how are you running that script to produce histogram data points?

@robotadam
Copy link

@ocelotl I ran it with uvicorn app:app --host 0.0.0.0 --port $PORT --workers 2 in a docker container. Details:

pyproject.toml:

[tool.poetry]
name = "oteltestapp"
version = "0.1.0"
description = ""
authors = ["Your Name <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
fastapi = "^0.88.0"
opentelemetry-exporter-otlp = "^1.15.0"
uvicorn = "^0.20.0"
opentelemetry-instrumentation-fastapi = "^0.36b0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

docker-compose.yml:

version: "3.8"
services:
  app:
    build:
      context: .
    volumes:
      - ./app.py:/app/app.py
    ports:
      - "4000:4000"
  otel-collector:
    image: otel/opentelemetry-collector-contrib
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "8888:8888"   # Prometheus metrics exposed by the collector
      - "8889:8889"   # Prometheus exporter metrics
      - "13133:13133" # health_check extension
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP http receiver
      - "55679:55679" # zpages

Dockerfile:

FROM python:3.10-slim-bullseye

ENV PYTHONUNBUFFERED=1 \
    # prevents python creating .pyc files
    PYTHONDONTWRITEBYTECODE=1 \
    \
    # pip
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 \
    \
    # poetry
    POETRY_VERSION=1.2.2 \
    # make poetry install to this location
    POETRY_HOME="/opt/poetry" \
    # We will use our own created virtualenv to make it easier to cache
    POETRY_VIRTUALENVS_CREATE=false \
    # do not ask any interactive question
    POETRY_NO_INTERACTION=1 \
    # the path with our venv
    VENV_PATH=/opt/venv \
    APP_PATH=/app \
    PORT=4000

# prepend poetry and our venv to path
ENV PATH="$POETRY_HOME/bin:$VENV_PATH/bin:$PATH"

# Install builder dependencies
RUN apt-get update && \
    apt-get install -y curl
RUN curl -sSL https://install.python-poetry.org | python3 -

# copy project requirement files here to ensure they will be cached.
WORKDIR $APP_PATH
COPY . $APP_PATH
WORKDIR $APP_PATH

# Install Deps
RUN python -m venv $VENV_PATH
ENV VIRTUAL_ENV=$VENV_PATH
RUN poetry install --no-ansi --only main

CMD uvicorn app:app --host 0.0.0.0 --port $PORT --workers 2

the poetry.lock file is noisy so I don't want to post it here, but you should be able to swap basically all of this out for a simple requirements file. I don't have time to do that today but if it will help I can see when I can free up time.

@ocelotl ocelotl self-assigned this Jun 9, 2023
@crisog
Copy link

crisog commented Dec 5, 2023

@ocelotl I have some spare time and a high interest into getting this fixed. If I come up with a fix, would you be able to review or will it have to go through a different process?

@bastbu
Copy link

bastbu commented Dec 22, 2023

@ocelotl to clarify - will this be addressed via #3407 (based on the discussions around #3277 (comment))?

@ocelotl
Copy link
Contributor

ocelotl commented Jan 11, 2024

@crisog I would surely, but it may be a fix already in #3429. @robotadam @mshebeko-twist @bastbu could you try again using the fix in #3429?

@bastbu
Copy link

bastbu commented Jan 12, 2024

@ocelotl can confirm, the fix solves the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporters prometheus
Projects
None yet
7 participants