open-telemetry · SurbhiAgarwal1 · May 7, 2026 · May 7, 2026
@@ -0,0 +1,76 @@
+# Technical Detail: Semantic Convention Integration (Issue #97)
+
+This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the `explorer-db-builder`.
+
+## 1. Architectural Overview
+The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a `SemconvEnricher` that evaluates telemetry metadata against standard OTel registries using the **OpenTelemetry Weaver** engine.
+
+### Data Flow
+1. **Extraction**: Retrieve metrics and spans from the normalized `InstrumentationData`.
+2. **Translation**: Map OTel signals to a Weaver-compatible "Application Registry".
+3. **Evaluation**: Execute `weaver registry check` against a specific semconv version.
+4. **Annotation**: Persist compliance status back to the telemetry metadata.
+
+## 2. Component: `SemconvEnricher`
+**Location**: `explorer_db_builder/semconv_enricher.py`
+
+This is the primary orchestrator for compliance checking.
+
+### Transformation Logic
+The enricher generates a temporary directory containing:
+- **`manifest.yaml`**: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g., `github.com/open-telemetry/semantic-conventions@v1.37.0`).
+- **`telemetry.yaml`**: Translates internal metadata into Weaver's definition format.
+  - **Metrics**: Defined with `type: metric` and attributes using the `ref` keyword to ensure Weaver validates them against the registry's definitions.
+  - **Spans**: Defined with `type: span`, using synthetic IDs based on the instrumentation name and span kind (e.g., `activej-http.SERVER`).
+
+### Weaver Invocation
+The enricher calls the `weaver` CLI via a subprocess. 
+- **Success Condition**: If `weaver registry check` exits with code 0, all signals defined in the registry are considered compliant.
+- **Error Handling**: If errors are reported (return code 1), the enricher parses the `stderr` output to identify specific signals that failed validation and marks them accordingly.
+
+## 3. Pipeline Integration
+**Location**: `explorer_db_builder/main.py`
+
+The enrichment stage is integrated into `process_version` immediately after the `transform_instrumentation_format` call.
+
+```python
+transformed_inventory = transform_instrumentation_format(inventory)
+
+# Enrich with semantic convention compliance
+try:
+    enricher = SemconvEnricher()
+    enricher.enrich_inventory(transformed_inventory)
+except Exception as e:
+    logger.warning(f"Semantic convention enrichment failed: {e}")
+```
+
+This placement ensures that:
+- Enrichment works on normalized, clean data.
+- The pipeline remains resilient (a Weaver failure does not crash the build).
+
+## 4. Frontend & Metadata Schema
+**Location**: `ecosystem-explorer/src/types/javaagent.ts`
+
+The compliance status is persisted as a `semconv_compliance` array on individual telemetry signals:
+
+```json
+{
+  "name": "http.server.request.duration",
+  "unit": "s",
+  "semconv_compliance": ["1.37.0"]
+}
+```
+
+This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.
+
+## 5. Verification & Testing
+**Location**: `tests/test_semconv_enricher.py`
+
+A dedicated test suite validates the following:
+- **YAML Generation**: Ensures the generated `manifest.yaml` and `telemetry.yaml` are valid and follow Weaver's specification.
+- **Version Extraction**: Tests the regex-based extraction of versions from OTel schema URLs.
+- **Mocked CLI Interactions**: Simulates various Weaver output scenarios (total success, partial failure, and system errors) to verify that the metadata is updated correctly.
+
+---
+**Branch**: `feat/97-semconv-integration`  
+**PR Title**: `feat(db-builder): integrate Weaver for semconv compliance checking (#97)`
@@ -16,4 +16,7 @@
 
 import importlib.metadata
 
-__version__ = importlib.metadata.version("collector-watcher")
+try:
+    __version__ = importlib.metadata.version("collector-watcher")
+except importlib.metadata.PackageNotFoundError:
+    __version__ = "0.0.0-dev"
@@ -16,4 +16,7 @@
 
 import importlib.metadata
 
-__version__ = importlib.metadata.version("explorer-db-builder")
+try:
+    __version__ = importlib.metadata.version("explorer-db-builder")
+except importlib.metadata.PackageNotFoundError:
+    __version__ = "0.0.0-dev"
@@ -203,6 +203,33 @@ def write_version_list(self, versions: list[Version]) -> None:
             logger.error(f"Failed to write version list: {e}")
             raise
 
+    def write_markdown(self, library_name: str, markdown_hash: str, content: str) -> None:
+        """Write markdown file to the database.
+
+        Args:
+            library_name: Name of the library
+            markdown_hash: Hash of the markdown content
+            content: Markdown content string
+        """
+        markdown_dir = self.database_dir / "markdown"
+        markdown_dir.mkdir(parents=True, exist_ok=True)
+        file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"
+
+        if file_path.exists():
+            logger.debug(f"Markdown for '{library_name}' with hash {markdown_hash} already exists, skipping write")
+            return
+
+        try:
+            with open(file_path, "w", encoding="utf-8") as f:
+                f.write(content)
+            file_size = len(content.encode("utf-8"))
+            self.files_written += 1
+            self.total_bytes += file_size
+            logger.debug(f"Wrote markdown for '{library_name}' with hash {markdown_hash}")
+        except OSError as e:
+            logger.error(f"Failed to write markdown for '{library_name}': {e}")
+            # README publishing failures must never fail DB generation as per requirements
+
     def get_stats(self) -> dict[str, Any]:
         """Get statistics about files written during this session.
 

@@ -27,6 +27,7 @@
 from explorer_db_builder.database_writer import DatabaseWriter
 from explorer_db_builder.instrumentation_transformer import transform_instrumentation_format
 from explorer_db_builder.metadata_backfiller import backfill_metadata
+from explorer_db_builder.semconv_enricher import SemconvEnricher
 
 logger = logging.getLogger(__name__)
 
@@ -97,6 +98,13 @@ def process_version(
 
     transformed_inventory = transform_instrumentation_format(inventory)
 
+    # Enrich with semantic convention compliance
+    try:
+        enricher = SemconvEnricher()
+        enricher.enrich_inventory(transformed_inventory)
+    except Exception as e:
+        logger.warning(f"Semantic convention enrichment failed for version {version}: {e}")
+
     if "libraries" not in transformed_inventory and "custom" not in transformed_inventory:
         raise KeyError(f"Inventory for version {version} missing 'libraries' and 'custom' keys")
 
@@ -139,9 +147,33 @@ def run_javaagent_builder(
         versions = get_release_versions(inventory_manager)
         logger.info(f"Processing {len(versions)} release versions")
 
+        # Pre-load README maps for all versions to enable augmentation and backfilling
+        readme_maps = {v: inventory_manager.load_library_readme_map(v) for v in versions}
+
+        # Publish all READMEs to the database
+        for version, readme_map in readme_maps.items():
+            for library_name, markdown_hash in readme_map.items():
+                content = inventory_manager.load_library_readme_content(version, library_name, markdown_hash)
+                if content:
+                    db_writer.write_markdown(library_name, markdown_hash, content)
+
+        def load_and_augment_inventory(version: Version) -> dict:
+            inventory = inventory_manager.load_versioned_inventory(version)
+            readme_map = readme_maps.get(version, {})
+
+            # Augment libraries and custom instrumentations with markdown_hash
+            for key in ["libraries", "custom"]:
+                if key in inventory:
+                    for item in inventory[key]:
+                        name = item.get("name")
+                        if name and name in readme_map:
+                            item["markdown_hash"] = readme_map[name]
+
+            return inventory
+
         backfilled_libraries = backfill_metadata(
             versions,
-            inventory_manager.load_versioned_inventory,
+            load_and_augment_inventory,
             item_key="libraries",
         )
         backfilled_inventories = backfill_metadata(

@@ -22,7 +22,7 @@
 
 logger = logging.getLogger(__name__)
 
-BACKFILLABLE_FIELDS = ["display_name", "description", "library_link", "has_javaagent"]
+BACKFILLABLE_FIELDS = ["display_name", "description", "library_link", "has_javaagent", "markdown_hash"]
 NESTED_BACKFILLABLE_FIELDS: dict[str, list[str]] = {
     "configurations": ["declarative_name", "examples"],
 }