Skip to content

feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382

Open
SurbhiAgarwal1 wants to merge 2 commits intoopen-telemetry:mainfrom
SurbhiAgarwal1:feat/97-semconv-integration
Open

feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382
SurbhiAgarwal1 wants to merge 2 commits intoopen-telemetry:mainfrom
SurbhiAgarwal1:feat/97-semconv-integration

Conversation

@SurbhiAgarwal1
Copy link
Copy Markdown

Technical Detail: Semantic Convention Integration (Issue #97)

This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the explorer-db-builder.

1. Architectural Overview

The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a SemconvEnricher that evaluates telemetry metadata against standard OTel registries using the OpenTelemetry Weaver engine.

Data Flow

  1. Extraction: Retrieve metrics and spans from the normalized InstrumentationData.
  2. Translation: Map OTel signals to a Weaver-compatible "Application Registry".
  3. Evaluation: Execute weaver registry check against a specific semconv version.
  4. Annotation: Persist compliance status back to the telemetry metadata.

2. Component: SemconvEnricher

Location: explorer_db_builder/semconv_enricher.py

This is the primary orchestrator for compliance checking.

Transformation Logic

The enricher generates a temporary directory containing:

  • manifest.yaml: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g., github.com/open-telemetry/semantic-conventions@v1.37.0).
  • telemetry.yaml: Translates internal metadata into Weaver's definition format.
    • Metrics: Defined with type: metric and attributes using the ref keyword to ensure Weaver validates them against the registry's definitions.
    • Spans: Defined with type: span, using synthetic IDs based on the instrumentation name and span kind (e.g., activej-http.SERVER).

Weaver Invocation

The enricher calls the weaver CLI via a subprocess.

  • Success Condition: If weaver registry check exits with code 0, all signals defined in the registry are considered compliant.
  • Error Handling: If errors are reported (return code 1), the enricher parses the stderr output to identify specific signals that failed validation and marks them accordingly.

3. Pipeline Integration

Location: explorer_db_builder/main.py

The enrichment stage is integrated into process_version immediately after the transform_instrumentation_format call.

transformed_inventory = transform_instrumentation_format(inventory)

# Enrich with semantic convention compliance
try:
    enricher = SemconvEnricher()
    enricher.enrich_inventory(transformed_inventory)
except Exception as e:
    logger.warning(f"Semantic convention enrichment failed: {e}")

This placement ensures that:

  • Enrichment works on normalized, clean data.
  • The pipeline remains resilient (a Weaver failure does not crash the build).

4. Frontend & Metadata Schema

Location: ecosystem-explorer/src/types/javaagent.ts

The compliance status is persisted as a semconv_compliance array on individual telemetry signals:

{
  "name": "http.server.request.duration",
  "unit": "s",
  "semconv_compliance": ["1.37.0"]
}

This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.

5. Verification & Testing

Location: tests/test_semconv_enricher.py

A dedicated test suite validates the following:

  • YAML Generation: Ensures the generated manifest.yaml and telemetry.yaml are valid and follow Weaver's specification.
  • Version Extraction: Tests the regex-based extraction of versions from OTel schema URLs.
  • Mocked CLI Interactions: Simulates various Weaver output scenarios (total success, partial failure, and system errors) to verify that the metadata is updated correctly.

…pen-telemetry#242)

- Extend InventoryManager to discover and load library READMEs from registry
- Augment instrumentation metadata with markdown_hash and enable backfilling
- Implement markdown publishing to public data directory in DatabaseWriter
- Add frontend types and API support for README lazy loading
…pen-telemetry#97)

- Implement SemconvEnricher to validate telemetry via OTel Weaver
- Insert enrichment stage into the javaagent builder pipeline
- Add semconv_compliance field to Metric and Span models
- Support dynamic versioning based on instrumentation schema_url
@SurbhiAgarwal1 SurbhiAgarwal1 requested review from a team as code owners May 7, 2026 03:31
@netlify
Copy link
Copy Markdown

netlify Bot commented May 7, 2026

Deploy Preview for otel-ecosystem-explorer ready!

Name Link
🔨 Latest commit 36e72b0
🔍 Latest deploy log https://app.netlify.com/projects/otel-ecosystem-explorer/deploys/69fc0793a83ede0008a9dda3
😎 Deploy Preview https://deploy-preview-382--otel-ecosystem-explorer.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates semantic-convention (semconv) compliance checking into the Explorer DB build pipeline using the OpenTelemetry Weaver CLI, and adds support for publishing and serving per-library README markdown content (content-addressed via a hash) to the frontend.

Changes:

  • Add SemconvEnricher to generate a Weaver registry from instrumentation telemetry and annotate metrics/spans with semconv_compliance.
  • Publish library README markdown files to the generated database and backfill/augment instrumentations with markdown_hash.
  • Extend frontend types and API helpers to support semconv_compliance and README loading.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
SEMCONV_INTEGRATION_DETAIL.md Adds a technical deep-dive doc describing the semconv enrichment pipeline and schema updates.
ecosystem-explorer/src/types/javaagent.ts Extends TS types to include markdown_hash and semconv_compliance on signals.
ecosystem-explorer/src/lib/api/javaagent-data.ts Adds an API helper to fetch published README markdown files.
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/inventory_manager.py Adds helpers to scan/read content-addressed README markdown files per version.
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/__init__.py Adds a dev fallback for __version__ when package metadata isn’t present.
ecosystem-automation/explorer-db-builder/tests/test_semconv_enricher.py Adds unit tests for version extraction, YAML generation, and mocked Weaver interactions.
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/semconv_enricher.py Introduces the Weaver-based semconv compliance enricher.
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/metadata_backfiller.py Allows markdown_hash to be backfilled across versions.
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/main.py Wires semconv enrichment into process_version and publishes/augments README markdown during the build.
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/database_writer.py Adds write_markdown to publish README markdown into the DB output.
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/__init__.py Adds a dev fallback for __version__ when package metadata isn’t present.
ecosystem-automation/collector-watcher/src/collector_watcher/__init__.py Adds a dev fallback for __version__ when package metadata isn’t present.

# If weaver is not found, it will raise an exception which is caught in enrich_instrumentation.

cmd = [self.weaver_path, "registry", "check", "-r", registry_dir]
result = subprocess.run(cmd, capture_output=True, text=True)
Comment on lines +165 to +185
# Initially assume all are compliant if Weaver succeeded
# We need to know which signals we defined to populate the map.
# We'll read them back from the generated yaml.
with open(os.path.join(registry_dir, "telemetry.yaml")) as f:
telemetry_data = yaml.safe_load(f)
for group in telemetry_data.get("groups", []):
compliance_map[group["id"]] = True

if result.returncode != 0:
# Parse errors to mark specific signals as non-compliant
# Example error line: [Error] groups[0].attributes[1]: attribute 'foo' not found in registry
# This is complex to parse robustly without a stable Weaver output format.
# For the POC, if Weaver fails, we mark everything as non-compliant or log it.
logger.debug(f"Weaver reported errors:\n{result.stderr}")

# Simple heuristic: if an ID appears in an error line, mark it as non-compliant
for signal_id in compliance_map.keys():
if signal_id in result.stderr:
compliance_map[signal_id] = False

return compliance_map
Comment on lines +214 to +217
markdown_dir = self.database_dir / "markdown"
markdown_dir.mkdir(parents=True, exist_ok=True)
file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"

Comment on lines +158 to +160
file_path = self.get_version_dir(version) / self.README_DIR / f"{library_name}-{markdown_hash}.md"
if not file_path.exists():
return None
Comment on lines +84 to +88
const response = await fetch(`${BASE_PATH}/markdown/${libraryName}-${markdownHash}.md`);
if (!response.ok) {
throw new Error(`Failed to load README for ${libraryName}`);
}
return response.text();
Comment on lines +206 to +231
def write_markdown(self, library_name: str, markdown_hash: str, content: str) -> None:
"""Write markdown file to the database.

Args:
library_name: Name of the library
markdown_hash: Hash of the markdown content
content: Markdown content string
"""
markdown_dir = self.database_dir / "markdown"
markdown_dir.mkdir(parents=True, exist_ok=True)
file_path = markdown_dir / f"{library_name}-{markdown_hash}.md"

if file_path.exists():
logger.debug(f"Markdown for '{library_name}' with hash {markdown_hash} already exists, skipping write")
return

try:
with open(file_path, "w", encoding="utf-8") as f:
f.write(content)
file_size = len(content.encode("utf-8"))
self.files_written += 1
self.total_bytes += file_size
logger.debug(f"Wrote markdown for '{library_name}' with hash {markdown_hash}")
except OSError as e:
logger.error(f"Failed to write markdown for '{library_name}': {e}")
# README publishing failures must never fail DB generation as per requirements
Comment on lines +150 to +172
# Pre-load README maps for all versions to enable augmentation and backfilling
readme_maps = {v: inventory_manager.load_library_readme_map(v) for v in versions}

# Publish all READMEs to the database
for version, readme_map in readme_maps.items():
for library_name, markdown_hash in readme_map.items():
content = inventory_manager.load_library_readme_content(version, library_name, markdown_hash)
if content:
db_writer.write_markdown(library_name, markdown_hash, content)

def load_and_augment_inventory(version: Version) -> dict:
inventory = inventory_manager.load_versioned_inventory(version)
readme_map = readme_maps.get(version, {})

# Augment libraries and custom instrumentations with markdown_hash
for key in ["libraries", "custom"]:
if key in inventory:
for item in inventory[key]:
name = item.get("name")
if name and name in readme_map:
item["markdown_hash"] = readme_map[name]

return inventory
@lucacavenaghi97
Copy link
Copy Markdown
Member

Hi @SurbhiAgarwal1, I just noticed this branch includes the commit from #380, which is still under review. You can see the effect on Copilot's review: several of its comments here are actually about the README code from #380, not the semconv work. I'd rather wait for #380 to be merged and this branch rebased on main before doing a full review, so the diff reflects only the Weaver integration. Thanks for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants