Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace-Log-Metric关联方案 #320

Open
zzhutianyu opened this issue Oct 13, 2021 · 0 comments
Open

Trace-Log-Metric关联方案 #320

zzhutianyu opened this issue Oct 13, 2021 · 0 comments
Assignees

Comments

@zzhutianyu
Copy link
Collaborator

zzhutianyu commented Oct 13, 2021

指标关联Trace

exemplar机制

prometheus

prometheus主要是采用 exemplars 的机制在 metrics 中带上额外的信息。通过metrics的接口可以同事暴露exemplar
https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exemplars-1

# 后面的内容就是exemplar
# lable 采样值 采样时间
foo_bucket{le="0.1"} 8 # {} 0.054
foo_bucket{le="1"} 11 # {trace_id="KOO5S4vxi0o"} 0.67
foo_bucket{le="10"} 17 # {trace_id="oHg5SJYRHA0"} 9.8 1520879607.789

注入方式

c := GetPlayURLTotal.WithLabelValues(
            strconv.FormatInt(int64(callerType), 10),
            strconv.FormatInt(int64(device.GetOs()), 10),
            strconv.FormatInt(int64(device.GetNetwork()), 10),
            videoFormat,
)
sp := trace.SpanFromContext(ctx).SpanContext()
if sp.IsSampled() { // 可以继续增加其他条件使得exemplar样本更加典型
    c.(prometheus.ExemplarAdder).AddWithExemplar(1, prometheus.Labels{
          "traceID": sp.TraceID().String(),
    }) // 如果是histogram类型的则类型断言为prometheus.ExemplarObserver
} else {
    c.Inc()
}

otlp

otlp在协议中有Exemplar字段 可以在指标上报时将被采样的span跟指标关联.otlp-SDK是自动进行注入的,因为trace-log-metric 三者共享同样的otlp-context,所以可以不必要进行手工关联

// A representation of an exemplar, which is a sample input measurement.

// Exemplars also hold information about the environment when the measurement

// was recorded, for example the span and trace ID of the active span when the

// exemplar was recorded.

message Exemplar {

// The set of key/value pairs that were filtered out by the aggregator, but

// recorded alongside the original measurement. Only key/value pairs that were

// filtered out by the aggregator should be included

repeated opentelemetry.proto.common.v1.KeyValue filtered_attributes = 7;

// Labels is deprecated and will be removed soon.

// 1. Old senders and receivers that are not aware of this change will

// continue using the `filtered_labels` field.

// 2. New senders, which are aware of this change MUST send only

// `filtered_attributes`.

// 3. New receivers, which are aware of this change MUST convert this into

// `filtered_labels` by simply converting all int64 values into float.

//

// This field will be removed in ~3 months, on July 1, 2021.

repeated opentelemetry.proto.common.v1.StringKeyValue filtered_labels = 1 [deprecated = true];

// time_unix_nano is the exact time when this exemplar was recorded

//

// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January

// 1970.

fixed64 time_unix_nano = 2;

// The value of the measurement that was recorded. An exemplar is

// considered invalid when one of the recognized value fields is not present

// inside this oneof.

oneof value {

double as_double = 3;

sfixed64 as_int = 6;

}

// (Optional) Span ID of the exemplar trace.

// span_id may be missing if the measurement is not recorded inside a trace

// or if the trace is not sampled.

bytes span_id = 4;

// (Optional) Trace ID of the exemplar trace.

// trace_id may be missing if the measurement is not recorded inside a trace

// or if the trace is not sampled.

bytes trace_id = 5;

}

prometheus存储方式(tjg使用该方式)

https://github.com/prometheus/prometheus/pull/6635/files
prometheus 实现了一种环形连续内存的结构来存储 exemplar,并实现了对应的查询接口

$ curl -g 'http://localhost:9090/api/v1/query_exemplars?query=test_exemplar_metric_total&start=2020-09-14T15:22:25.479Z&end=020-09-14T15:23:25.479Z'
{
    "status": "success",
    "data": [
        {
            "seriesLabels": {
                "__name__": "test_exemplar_metric_total",
                "instance": "localhost:8090",
                "job": "prometheus",
                "service": "bar"
            },
            "exemplars": [
                {
                    "labels": {
                        "traceID": "EpTxMJ40fUus7aGY"
                    },
                    "value": "6",
                    "timestamp": 1600096945.479,
                }
            ]
        },
        {
            "seriesLabels": {
                "__name__": "test_exemplar_metric_total",
                "instance": "localhost:8090",
                "job": "prometheus",
                "service": "foo"
            },
            "exemplars": [
                {
                    "labels": {
                        "traceID": "Olp9XHlq763ccsfa"
                    },
                    "value": "19",
                    "timestamp": 1600096955.479,
                },
                {
                    "labels": {
                        "traceID": "hCtjygkIHwAN9vs4"
                    },
                    "value": "20",
                    "timestamp": 1600096965.489,
                },
            ]
        }
    ]
}

image

日志关联Trace

日志关联Trace 比较简单 只要在打印日志的时候获取到链路的TraceId和spanId 就可以关联Trace和单条日志了

Log
timestamp= TraceId=xxxx SpanId=xxxxx
Json
{"trace_id": "xxx", "span_id": "xxx", "log": "xxxx"}

最终清洗入库并标记trace_id和span_id即可实现联动
image

otlp-SDK 最终可以实现默认关联因为共享Context

监控存储exemplar

由于influxdb目前不支持exemplar入库,所以基于现有存储结构监控可以使用ES进行exemplar存储,避免高基线问题
修改如下

  • 相关prometheus的数据解析需要支持exemplar类型的解析并上报
  • transfer需要支持exemplar数据入库到ES
  • saas支持exemplar数据的查询
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant