Skip to content

cloudraftio/opentelemetry-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

OpenTelemetry Learning Roadmap

OpenTelemetry is the leading, vendor-neutral standard for observability in modern software systems, enabling unified collection of logs, metrics, and traces. By providing standardized APIs, SDKs, and integration libraries, it allows developers and DevOps teams to instrument their codebases easily, gaining deep visibility into distributed systems, performance bottlenecks, and reliability issues. Learning OpenTelemetry is essential because it empowers professionals to monitor, analyze, and improve system health proactively in increasingly complex cloud-native environments, using open standards that avoid vendor lock-in and ensure compatibility across monitoring tools.

We are hiring

We are hiring an OpenTelemetry and Observability enthusiast who believes in the power of open-source observability. Please write to us at [email protected] with a note on why you are a good fit.


✅ Week 1-2: Fundamentals of Observability

📚 Topics

  1. Telemetry Data (Traces, Metrics, Logs)

    • Traces: Follow a request’s journey through microservices. Provides visibility into how services interact.
    • Metrics: Numerical measurements (e.g., CPU usage, request counts) for performance monitoring.
    • Logs: Time-stamped, structured text information for debugging.
  2. Semantic Conventions

    • Standardized naming and data structure conventions to make telemetry consistent across applications.
    • E.g., standardized attribute names like http.method, db.system.
  3. Instrumentation (Manual & Auto)

    • Manual Instrumentation: Explicit code changes to inject telemetry (e.g., creating spans manually).
    • Auto Instrumentation: Using OpenTelemetry auto-instrumentation libraries to automatically capture telemetry without modifying application code.
  4. Analysis and Outcomes

    • How to analyze collected telemetry to identify bottlenecks, errors, and resource inefficiencies.
    • Define actionable outcomes like scaling services or fixing faulty dependencies.

🛠️ Hands-On Labs

  • Instrument a sample Python/Java app with OpenTelemetry SDK, configuring a trace and metric exporter to Grafana Tempo and Prometheus.
  • Send logs to Grafana Loki
  • Explore semantic conventions by labeling HTTP requests with standard attributes.

📝 Assignment

  1. Example: Instrument a sample Python Flask application with OpenTelemetry to collect HTTP request traces and export them to Grafana Tempo.
  2. Questions:
    • How do you decide between using manual vs auto instrumentation?
    • Explain the difference between the span kind “SERVER” and “CLIENT”.
    • Implement a custom span that measures the time taken by a database query.
  3. Task:
    • Instrument the app to collect:
      • Trace: HTTP request → DB query → External API call.
      • Metric: Request latency histograms
      • Log: Log contextual information for every HTTP request (include trace ID).
    • Visualize traces in Grafana Tempo and metrics in Grafana.

✅ Week 3-4: OpenTelemetry API & SDK

📚 Topics

  1. Data Model

    • Defines how telemetry data is structured: Resources, InstrumentationLibrary, and the actual signals (spans, metrics).
    • Understand how metadata (resource attributes like service name, environment) enrich data.
  2. Composability & Extension

    • SDK allows chaining of processors and exporters to extend functionality.
    • Example: Add a custom processor to sanitize sensitive data before export.
  3. Configuration Approaches

    • Environment Variables: Simple overrides for configuration.
    • YAML Files: Declarative configuration of pipelines.
    • Programmatic API: Configure SDK in code for dynamic control.
  4. Signals (Tracing, Metrics, Logs)

    • Trace: Sequence of spans representing request path.
    • Metric: Time-series data.
    • Log: Structured textual events.
  5. SDK Pipelines

    • How signals flow through SDK: Instrumentation → Processors (batching, sampling) → Exporters.
    • Learn how to set up pipelines to optimize performance and reliability.
  6. Agents

    • Lightweight components that run on the same host or in sidecar containers, collect local telemetry, and forward to the Collector.

🛠️ Hands-On Labs

  • Program a custom metric and a custom span exporter (e.g., file or HTTP).
  • Configure SDK using environment variables vs YAML vs programmatic configuration.
  • Build a context propagation demo across two microservices.

📝 Assignment

  • Implement an application that configures a custom sampling strategy in code.
  • Build a custom exporter that writes traces to a local file in JSON format.
  • Demonstrate end-to-end context propagation, including trace IDs in logs and spans.

📝 Assignment – Practical Examples & Questions

  1. Example: Build a custom HTTP exporter that sends telemetry data to a dummy HTTP server you implement in Flask or Node.js.
  2. Questions:
    • How does context propagation work between microservices? Implement a proof-of-concept example where a trace context is passed via HTTP headers between two services.
    • Write a code snippet that programmatically configures an OpenTelemetry pipeline with custom sampling (e.g., sample only 10% of requests).
    • What are the advantages and disadvantages of environment variable configuration vs programmatic configuration?
  3. Task:
    • Build a microservices setup with:
      • Service A → Service B → Service C
      • Ensure trace context flows correctly between services.
    • Implement a custom processor in the SDK that drops any trace where http.status_code < 400.
    • Create a custom metric to count how many error traces occurred in the last 5 minutes.

✅ Week 5-6: OpenTelemetry Collector

📚 Topics

  1. Configuration

    • Learn declarative pipeline configuration (receiver → processor → exporter).
    • Explore YAML config examples and customize.
  2. Deployment Strategies

    • Standalone: Runs as a separate service.
    • Agent: Runs per host (collect local telemetry).
    • Sidecar: Deployed alongside the app in containers.
    • DaemonSet: Kubernetes native deployment per node.
  3. Scaling Strategies

    • Horizontal scaling by running multiple collector instances.
    • Load balancing telemetry data using OTLP protocol.
  4. Pipelines

    • How multiple pipelines handle different types of data.
    • Example: One pipeline for traces → Grafana Tempo, another for metrics → Prometheus.
  5. Transforming Data

    • Apply processors to modify, filter, or aggregate telemetry data.
    • E.g., transform logs into metrics for alerting.

🛠️ Hands-On Labs

  • Deploy OpenTelemetry Collector on Kubernetes as a DaemonSet.
  • Configure two pipelines: traces to Grafana Tempo, metrics to Prometheus.
  • Apply a processor to filter out sensitive information.

📝 Assignment

  • Deploy a production-grade Collector pipeline for a sample microservices app.

  • Demonstrate high-volume data ingestion using Locust.

  • Tune resource limits and batching settings for optimal performance.

    📝 Assignment – Practical Examples & Questions

  1. Example: Deploy an OpenTelemetry Collector in Kubernetes as a DaemonSet with two pipelines:

    • Pipeline 1: Traces → Grafana Tempo
    • Pipeline 2: Metrics → Prometheus
  2. Questions:

    • How would you design a scalable observability pipeline to handle 10,000 events per second?
    • Write a YAML configuration for the OpenTelemetry Collector that applies a batch processor with a timeout of 5 seconds and a max batch size of 1000 spans.
  3. Task:

    • Apply a processor to drop traces containing the attribute debug=true.
    • Simulate 1000 concurrent requests to the instrumented application using Locust and monitor performance.
    • Measure CPU and memory usage of the Collector during load.

✅ Week 7: Maintaining & Debugging Observability Pipelines

📚 Topics

  1. Context Propagation

    • Critical to maintain correlation between distributed traces.
    • Understand how trace context is passed in HTTP headers or gRPC metadata.
  2. Debugging Pipelines

    • Techniques to troubleshoot missing data, delays, or misconfigurations.
    • Explore Collector logs, use --log-level debug.
  3. Error Handling

    • How the Collector handles backpressure, bad data, or network errors.
    • Implement retry strategies and dead-letter queues.
  4. Schema Management

    • Validate schema of incoming data.
    • Ensure consistency using tools like OTEL schema validator.

🛠️ Hands-On Labs

  • Intentionally break context propagation and trace the flow to identify gaps.
  • Apply schema validation and introduce a malformed span to observe behavior.
  • Simulate network errors and monitor retry behavior.

📝 Assignment

  • Create a troubleshooting report to solve a broken observability pipeline.
  • Demonstrate how to apply schema validation to prevent bad data ingestion.

📝 Assignment – Practical Examples & Questions

  1. Example: Introduce a misconfigured collector receiver (wrong port) and debug why no data is collected.

  2. Questions:

    • Describe how to debug missing trace data in Grafana Tempo.
    • How do you implement schema validation in the OpenTelemetry pipeline?
    • Implement a rule that drops any span where http.method != GET or POST.
  3. Task:

    • Intentionally misconfigure context propagation and debug how to fix the broken trace.
    • Configure the Collector to send telemetry errors to a dead-letter queue (file or HTTP endpoint).
    • Document the debugging steps taken, tools used, and lessons learned.

✅ Week 8: Capstone Project & Expert Use Cases

🚀 Final Project

Design and deploy a full observability solution:

  • Instrumented microservices in Kubernetes.
  • Otel Collector deployed as DaemonSet.
  • Traces exported to Tempo, metrics to Prometheus + Grafana.
  • Custom processors to mask sensitive data.
  • Custom exporter to an HTTP service.

📚 Deliverables

  1. Architecture diagram.
  2. Configuration files (Helm, YAML).
  3. Documentation explaining design choices.
  4. Evidence of load testing and scaling (performance benchmarks).
  5. Debugging report solving real-world issues (e.g., missing traces).

Additional Learning

  • Learn about the ecosystem and explore products and their differences, such as Prometheus, Grafana (PLG stack), VictoriaMetrics, Thanos, Cortex, ClickHouse (ClickStack), Signoz, Datadog, and so on. You will find good blog posts on Observability on our blog.

✅ Recommended Tools

  • OpenTelemetry Collector & Contrib
  • Grafana Tempo + Loki + Prometheus
  • Prometheus + Grafana
  • OpenTelemetry SDKs (Python, Java, Go)
  • Kubernetes + Helm
  • Locust (for load testing)
  • Postman or curl (for API calls)

About

Learning path for OpenTelemetry

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published