feat(db-builder): integrate Weaver for semconv compliance checking (#97) by SurbhiAgarwal1 · Pull Request #382 · open-telemetry/opentelemetry-ecosystem-explorer

SurbhiAgarwal1 · 2026-05-07T03:31:28Z

Technical Detail: Semantic Convention Integration (Issue #97)

This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the explorer-db-builder.

1. Architectural Overview

The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a SemconvEnricher that evaluates telemetry metadata against standard OTel registries using the OpenTelemetry Weaver engine.

Data Flow

Extraction: Retrieve metrics and spans from the normalized InstrumentationData.
Translation: Map OTel signals to a Weaver-compatible "Application Registry".
Evaluation: Execute weaver registry check against a specific semconv version.
Annotation: Persist compliance status back to the telemetry metadata.

2. Component: `SemconvEnricher`

Location: explorer_db_builder/semconv_enricher.py

This is the primary orchestrator for compliance checking.

Transformation Logic

The enricher generates a temporary directory containing:

manifest.yaml: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g., github.com/open-telemetry/semantic-conventions@v1.37.0).
telemetry.yaml: Translates internal metadata into Weaver's definition format.
- Metrics: Defined with type: metric and attributes using the ref keyword to ensure Weaver validates them against the registry's definitions.
- Spans: Defined with type: span, using synthetic IDs based on the instrumentation name and span kind (e.g., activej-http.SERVER).

Weaver Invocation

The enricher calls the weaver CLI via a subprocess.

Success Condition: If weaver registry check exits with code 0, all signals defined in the registry are considered compliant.
Error Handling: If errors are reported (return code 1), the enricher parses the stderr output to identify specific signals that failed validation and marks them accordingly.

3. Pipeline Integration

Location: explorer_db_builder/main.py

The enrichment stage is integrated into process_version immediately after the transform_instrumentation_format call.

transformed_inventory = transform_instrumentation_format(inventory)

# Enrich with semantic convention compliance
try:
    enricher = SemconvEnricher()
    enricher.enrich_inventory(transformed_inventory)
except Exception as e:
    logger.warning(f"Semantic convention enrichment failed: {e}")

This placement ensures that:

Enrichment works on normalized, clean data.
The pipeline remains resilient (a Weaver failure does not crash the build).

4. Frontend & Metadata Schema

Location: ecosystem-explorer/src/types/javaagent.ts

The compliance status is persisted as a semconv_compliance array on individual telemetry signals:

{
  "name": "http.server.request.duration",
  "unit": "s",
  "semconv_compliance": ["1.37.0"]
}

This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.

5. Verification & Testing

Location: tests/test_semconv_enricher.py

A dedicated test suite validates the following:

YAML Generation: Ensures the generated manifest.yaml and telemetry.yaml are valid and follow Weaver's specification.
Version Extraction: Tests the regex-based extraction of versions from OTel schema URLs.
Mocked CLI Interactions: Simulates various Weaver output scenarios (total success, partial failure, and system errors) to verify that the metadata is updated correctly.

…pen-telemetry#242) - Extend InventoryManager to discover and load library READMEs from registry - Augment instrumentation metadata with markdown_hash and enable backfilling - Implement markdown publishing to public data directory in DatabaseWriter - Add frontend types and API support for README lazy loading

…pen-telemetry#97) - Implement SemconvEnricher to validate telemetry via OTel Weaver - Insert enrichment stage into the javaagent builder pipeline - Add semconv_compliance field to Metric and Span models - Support dynamic versioning based on instrumentation schema_url

netlify · 2026-05-07T03:31:33Z

✅ Deploy Preview for otel-ecosystem-explorer ready!

Name	Link
🔨 Latest commit	`36e72b0`
🔍 Latest deploy log	https://app.netlify.com/projects/otel-ecosystem-explorer/deploys/69fc0793a83ede0008a9dda3
😎 Deploy Preview	https://deploy-preview-382--otel-ecosystem-explorer.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull request overview

Integrates semantic-convention (semconv) compliance checking into the Explorer DB build pipeline using the OpenTelemetry Weaver CLI, and adds support for publishing and serving per-library README markdown content (content-addressed via a hash) to the frontend.

Changes:

Add SemconvEnricher to generate a Weaver registry from instrumentation telemetry and annotate metrics/spans with semconv_compliance.
Publish library README markdown files to the generated database and backfill/augment instrumentations with markdown_hash.
Extend frontend types and API helpers to support semconv_compliance and README loading.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`SEMCONV_INTEGRATION_DETAIL.md`	Adds a technical deep-dive doc describing the semconv enrichment pipeline and schema updates.
`ecosystem-explorer/src/types/javaagent.ts`	Extends TS types to include `markdown_hash` and `semconv_compliance` on signals.
`ecosystem-explorer/src/lib/api/javaagent-data.ts`	Adds an API helper to fetch published README markdown files.
`ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/inventory_manager.py`	Adds helpers to scan/read content-addressed README markdown files per version.
`ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/__init__.py`	Adds a dev fallback for `__version__` when package metadata isn’t present.
`ecosystem-automation/explorer-db-builder/tests/test_semconv_enricher.py`	Adds unit tests for version extraction, YAML generation, and mocked Weaver interactions.
`ecosystem-automation/explorer-db-builder/src/explorer_db_builder/semconv_enricher.py`	Introduces the Weaver-based semconv compliance enricher.
`ecosystem-automation/explorer-db-builder/src/explorer_db_builder/metadata_backfiller.py`	Allows `markdown_hash` to be backfilled across versions.
`ecosystem-automation/explorer-db-builder/src/explorer_db_builder/main.py`	Wires semconv enrichment into `process_version` and publishes/augments README markdown during the build.
`ecosystem-automation/explorer-db-builder/src/explorer_db_builder/database_writer.py`	Adds `write_markdown` to publish README markdown into the DB output.
`ecosystem-automation/explorer-db-builder/src/explorer_db_builder/__init__.py`	Adds a dev fallback for `__version__` when package metadata isn’t present.
`ecosystem-automation/collector-watcher/src/collector_watcher/__init__.py`	Adds a dev fallback for `__version__` when package metadata isn’t present.

+        # If weaver is not found, it will raise an exception which is caught in enrich_instrumentation.
+
+        cmd = [self.weaver_path, "registry", "check", "-r", registry_dir]
+        result = subprocess.run(cmd, capture_output=True, text=True)


+        # Initially assume all are compliant if Weaver succeeded
+        # We need to know which signals we defined to populate the map.
+        # We'll read them back from the generated yaml.
+        with open(os.path.join(registry_dir, "telemetry.yaml")) as f:
+            telemetry_data = yaml.safe_load(f)
+            for group in telemetry_data.get("groups", []):
+                compliance_map[group["id"]] = True
+
+        if result.returncode != 0:
+            # Parse errors to mark specific signals as non-compliant
+            # Example error line: [Error] groups[0].attributes[1]: attribute 'foo' not found in registry
+            # This is complex to parse robustly without a stable Weaver output format.
+            # For the POC, if Weaver fails, we mark everything as non-compliant or log it.
+            logger.debug(f"Weaver reported errors:\n{result.stderr}")
+
+            # Simple heuristic: if an ID appears in an error line, mark it as non-compliant
+            for signal_id in compliance_map.keys():
+                if signal_id in result.stderr:
+                    compliance_map[signal_id] = False
+
+        return compliance_map


+        markdown_dir = self.database_dir / "markdown"
+        markdown_dir.mkdir(parents=True, exist_ok=True)
+        file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"
+


+        file_path = self.get_version_dir(version) / self.README_DIR / f"{library_name}-{markdown_hash}.md"
+        if not file_path.exists():
+            return None


+  const response = await fetch(`${BASE_PATH}/markdown/${libraryName}-${markdownHash}.md`);
+  if (!response.ok) {
+    throw new Error(`Failed to load README for ${libraryName}`);
+  }
+  return response.text();


+    def write_markdown(self, library_name: str, markdown_hash: str, content: str) -> None:
+        """Write markdown file to the database.
+
+        Args:
+            library_name: Name of the library
+            markdown_hash: Hash of the markdown content
+            content: Markdown content string
+        """
+        markdown_dir = self.database_dir / "markdown"
+        markdown_dir.mkdir(parents=True, exist_ok=True)
+        file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"
+
+        if file_path.exists():
+            logger.debug(f"Markdown for '{library_name}' with hash {markdown_hash} already exists, skipping write")
+            return
+
+        try:
+            with open(file_path, "w", encoding="utf-8") as f:
+                f.write(content)
+            file_size = len(content.encode("utf-8"))
+            self.files_written += 1
+            self.total_bytes += file_size
+            logger.debug(f"Wrote markdown for '{library_name}' with hash {markdown_hash}")
+        except OSError as e:
+            logger.error(f"Failed to write markdown for '{library_name}': {e}")
+            # README publishing failures must never fail DB generation as per requirements


+        # Pre-load README maps for all versions to enable augmentation and backfilling
+        readme_maps = {v: inventory_manager.load_library_readme_map(v) for v in versions}
+
+        # Publish all READMEs to the database
+        for version, readme_map in readme_maps.items():
+            for library_name, markdown_hash in readme_map.items():
+                content = inventory_manager.load_library_readme_content(version, library_name, markdown_hash)
+                if content:
+                    db_writer.write_markdown(library_name, markdown_hash, content)
+
+        def load_and_augment_inventory(version: Version) -> dict:
+            inventory = inventory_manager.load_versioned_inventory(version)
+            readme_map = readme_maps.get(version, {})
+
+            # Augment libraries and custom instrumentations with markdown_hash
+            for key in ["libraries", "custom"]:
+                if key in inventory:
+                    for item in inventory[key]:
+                        name = item.get("name")
+                        if name and name in readme_map:
+                            item["markdown_hash"] = readme_map[name]
+
+            return inventory


lucacavenaghi97 · 2026-05-07T21:16:28Z

Hi @SurbhiAgarwal1, I just noticed this branch includes the commit from #380, which is still under review. You can see the effect on Copilot's review: several of its comments here are actually about the README code from #380, not the semconv work. I'd rather wait for #380 to be merged and this branch rebased on main before doing a full review, so the diff reflects only the Weaver integration. Thanks for the work!

SurbhiAgarwal1 added 2 commits May 7, 2026 08:02

SurbhiAgarwal1 requested review from a team as code owners May 7, 2026 03:31

lucacavenaghi97 requested a review from Copilot May 7, 2026 20:09

Copilot started reviewing on behalf of lucacavenaghi97 May 7, 2026 20:10 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382

feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382
SurbhiAgarwal1 wants to merge 2 commits intoopen-telemetry:mainfrom
SurbhiAgarwal1:feat/97-semconv-integration

SurbhiAgarwal1 commented May 7, 2026

Uh oh!

netlify Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

lucacavenaghi97 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SurbhiAgarwal1 commented May 7, 2026

Technical Detail: Semantic Convention Integration (Issue #97)

1. Architectural Overview

Data Flow

2. Component: SemconvEnricher

Transformation Logic

Weaver Invocation

3. Pipeline Integration

4. Frontend & Metadata Schema

5. Verification & Testing

Uh oh!

netlify Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for otel-ecosystem-explorer ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lucacavenaghi97 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Component: `SemconvEnricher`

netlify Bot commented May 7, 2026 •

edited

Loading