Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions SEMCONV_INTEGRATION_DETAIL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Technical Detail: Semantic Convention Integration (Issue #97)

This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the `explorer-db-builder`.

## 1. Architectural Overview
The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a `SemconvEnricher` that evaluates telemetry metadata against standard OTel registries using the **OpenTelemetry Weaver** engine.

### Data Flow
1. **Extraction**: Retrieve metrics and spans from the normalized `InstrumentationData`.
2. **Translation**: Map OTel signals to a Weaver-compatible "Application Registry".
3. **Evaluation**: Execute `weaver registry check` against a specific semconv version.
4. **Annotation**: Persist compliance status back to the telemetry metadata.

## 2. Component: `SemconvEnricher`
**Location**: `explorer_db_builder/semconv_enricher.py`

This is the primary orchestrator for compliance checking.

### Transformation Logic
The enricher generates a temporary directory containing:
- **`manifest.yaml`**: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g., `github.com/open-telemetry/semantic-conventions@v1.37.0`).
- **`telemetry.yaml`**: Translates internal metadata into Weaver's definition format.
- **Metrics**: Defined with `type: metric` and attributes using the `ref` keyword to ensure Weaver validates them against the registry's definitions.
- **Spans**: Defined with `type: span`, using synthetic IDs based on the instrumentation name and span kind (e.g., `activej-http.SERVER`).

### Weaver Invocation
The enricher calls the `weaver` CLI via a subprocess.
- **Success Condition**: If `weaver registry check` exits with code 0, all signals defined in the registry are considered compliant.
- **Error Handling**: If errors are reported (return code 1), the enricher parses the `stderr` output to identify specific signals that failed validation and marks them accordingly.

## 3. Pipeline Integration
**Location**: `explorer_db_builder/main.py`

The enrichment stage is integrated into `process_version` immediately after the `transform_instrumentation_format` call.

```python
transformed_inventory = transform_instrumentation_format(inventory)

# Enrich with semantic convention compliance
try:
enricher = SemconvEnricher()
enricher.enrich_inventory(transformed_inventory)
except Exception as e:
logger.warning(f"Semantic convention enrichment failed: {e}")
```

This placement ensures that:
- Enrichment works on normalized, clean data.
- The pipeline remains resilient (a Weaver failure does not crash the build).

## 4. Frontend & Metadata Schema
**Location**: `ecosystem-explorer/src/types/javaagent.ts`

The compliance status is persisted as a `semconv_compliance` array on individual telemetry signals:

```json
{
"name": "http.server.request.duration",
"unit": "s",
"semconv_compliance": ["1.37.0"]
}
```

This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.

## 5. Verification & Testing
**Location**: `tests/test_semconv_enricher.py`

A dedicated test suite validates the following:
- **YAML Generation**: Ensures the generated `manifest.yaml` and `telemetry.yaml` are valid and follow Weaver's specification.
- **Version Extraction**: Tests the regex-based extraction of versions from OTel schema URLs.
- **Mocked CLI Interactions**: Simulates various Weaver output scenarios (total success, partial failure, and system errors) to verify that the metadata is updated correctly.

---
**Branch**: `feat/97-semconv-integration`
**PR Title**: `feat(db-builder): integrate Weaver for semconv compliance checking (#97)`
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,7 @@

import importlib.metadata

__version__ = importlib.metadata.version("collector-watcher")
try:
__version__ = importlib.metadata.version("collector-watcher")
except importlib.metadata.PackageNotFoundError:
__version__ = "0.0.0-dev"
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,7 @@

import importlib.metadata

__version__ = importlib.metadata.version("explorer-db-builder")
try:
__version__ = importlib.metadata.version("explorer-db-builder")
except importlib.metadata.PackageNotFoundError:
__version__ = "0.0.0-dev"
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,33 @@ def write_version_list(self, versions: list[Version]) -> None:
logger.error(f"Failed to write version list: {e}")
raise

def write_markdown(self, library_name: str, markdown_hash: str, content: str) -> None:
"""Write markdown file to the database.

Args:
library_name: Name of the library
markdown_hash: Hash of the markdown content
content: Markdown content string
"""
markdown_dir = self.database_dir / "markdown"
markdown_dir.mkdir(parents=True, exist_ok=True)
file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"

Comment on lines +214 to +217
if file_path.exists():
logger.debug(f"Markdown for '{library_name}' with hash {markdown_hash} already exists, skipping write")
return

try:
with open(file_path, "w", encoding="utf-8") as f:
f.write(content)
file_size = len(content.encode("utf-8"))
self.files_written += 1
self.total_bytes += file_size
logger.debug(f"Wrote markdown for '{library_name}' with hash {markdown_hash}")
except OSError as e:
logger.error(f"Failed to write markdown for '{library_name}': {e}")
# README publishing failures must never fail DB generation as per requirements
Comment on lines +206 to +231

def get_stats(self) -> dict[str, Any]:
"""Get statistics about files written during this session.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from explorer_db_builder.database_writer import DatabaseWriter
from explorer_db_builder.instrumentation_transformer import transform_instrumentation_format
from explorer_db_builder.metadata_backfiller import backfill_metadata
from explorer_db_builder.semconv_enricher import SemconvEnricher

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -97,6 +98,13 @@ def process_version(

transformed_inventory = transform_instrumentation_format(inventory)

# Enrich with semantic convention compliance
try:
enricher = SemconvEnricher()
enricher.enrich_inventory(transformed_inventory)
except Exception as e:
logger.warning(f"Semantic convention enrichment failed for version {version}: {e}")

if "libraries" not in transformed_inventory and "custom" not in transformed_inventory:
raise KeyError(f"Inventory for version {version} missing 'libraries' and 'custom' keys")

Expand Down Expand Up @@ -139,9 +147,33 @@ def run_javaagent_builder(
versions = get_release_versions(inventory_manager)
logger.info(f"Processing {len(versions)} release versions")

# Pre-load README maps for all versions to enable augmentation and backfilling
readme_maps = {v: inventory_manager.load_library_readme_map(v) for v in versions}

# Publish all READMEs to the database
for version, readme_map in readme_maps.items():
for library_name, markdown_hash in readme_map.items():
content = inventory_manager.load_library_readme_content(version, library_name, markdown_hash)
if content:
db_writer.write_markdown(library_name, markdown_hash, content)

def load_and_augment_inventory(version: Version) -> dict:
inventory = inventory_manager.load_versioned_inventory(version)
readme_map = readme_maps.get(version, {})

# Augment libraries and custom instrumentations with markdown_hash
for key in ["libraries", "custom"]:
if key in inventory:
for item in inventory[key]:
name = item.get("name")
if name and name in readme_map:
item["markdown_hash"] = readme_map[name]

return inventory
Comment on lines +150 to +172

backfilled_libraries = backfill_metadata(
versions,
inventory_manager.load_versioned_inventory,
load_and_augment_inventory,
item_key="libraries",
)
backfilled_inventories = backfill_metadata(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

logger = logging.getLogger(__name__)

BACKFILLABLE_FIELDS = ["display_name", "description", "library_link", "has_javaagent"]
BACKFILLABLE_FIELDS = ["display_name", "description", "library_link", "has_javaagent", "markdown_hash"]
NESTED_BACKFILLABLE_FIELDS: dict[str, list[str]] = {
"configurations": ["declarative_name", "examples"],
}
Expand Down
Loading