feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382
feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382SurbhiAgarwal1 wants to merge 2 commits intoopen-telemetry:mainfrom
Conversation
…pen-telemetry#242) - Extend InventoryManager to discover and load library READMEs from registry - Augment instrumentation metadata with markdown_hash and enable backfilling - Implement markdown publishing to public data directory in DatabaseWriter - Add frontend types and API support for README lazy loading
…pen-telemetry#97) - Implement SemconvEnricher to validate telemetry via OTel Weaver - Insert enrichment stage into the javaagent builder pipeline - Add semconv_compliance field to Metric and Span models - Support dynamic versioning based on instrumentation schema_url
✅ Deploy Preview for otel-ecosystem-explorer ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Integrates semantic-convention (semconv) compliance checking into the Explorer DB build pipeline using the OpenTelemetry Weaver CLI, and adds support for publishing and serving per-library README markdown content (content-addressed via a hash) to the frontend.
Changes:
- Add
SemconvEnricherto generate a Weaver registry from instrumentation telemetry and annotate metrics/spans withsemconv_compliance. - Publish library README markdown files to the generated database and backfill/augment instrumentations with
markdown_hash. - Extend frontend types and API helpers to support
semconv_complianceand README loading.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
SEMCONV_INTEGRATION_DETAIL.md |
Adds a technical deep-dive doc describing the semconv enrichment pipeline and schema updates. |
ecosystem-explorer/src/types/javaagent.ts |
Extends TS types to include markdown_hash and semconv_compliance on signals. |
ecosystem-explorer/src/lib/api/javaagent-data.ts |
Adds an API helper to fetch published README markdown files. |
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/inventory_manager.py |
Adds helpers to scan/read content-addressed README markdown files per version. |
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
ecosystem-automation/explorer-db-builder/tests/test_semconv_enricher.py |
Adds unit tests for version extraction, YAML generation, and mocked Weaver interactions. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/semconv_enricher.py |
Introduces the Weaver-based semconv compliance enricher. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/metadata_backfiller.py |
Allows markdown_hash to be backfilled across versions. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/main.py |
Wires semconv enrichment into process_version and publishes/augments README markdown during the build. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/database_writer.py |
Adds write_markdown to publish README markdown into the DB output. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
ecosystem-automation/collector-watcher/src/collector_watcher/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
| # If weaver is not found, it will raise an exception which is caught in enrich_instrumentation. | ||
|
|
||
| cmd = [self.weaver_path, "registry", "check", "-r", registry_dir] | ||
| result = subprocess.run(cmd, capture_output=True, text=True) |
| # Initially assume all are compliant if Weaver succeeded | ||
| # We need to know which signals we defined to populate the map. | ||
| # We'll read them back from the generated yaml. | ||
| with open(os.path.join(registry_dir, "telemetry.yaml")) as f: | ||
| telemetry_data = yaml.safe_load(f) | ||
| for group in telemetry_data.get("groups", []): | ||
| compliance_map[group["id"]] = True | ||
|
|
||
| if result.returncode != 0: | ||
| # Parse errors to mark specific signals as non-compliant | ||
| # Example error line: [Error] groups[0].attributes[1]: attribute 'foo' not found in registry | ||
| # This is complex to parse robustly without a stable Weaver output format. | ||
| # For the POC, if Weaver fails, we mark everything as non-compliant or log it. | ||
| logger.debug(f"Weaver reported errors:\n{result.stderr}") | ||
|
|
||
| # Simple heuristic: if an ID appears in an error line, mark it as non-compliant | ||
| for signal_id in compliance_map.keys(): | ||
| if signal_id in result.stderr: | ||
| compliance_map[signal_id] = False | ||
|
|
||
| return compliance_map |
| markdown_dir = self.database_dir / "markdown" | ||
| markdown_dir.mkdir(parents=True, exist_ok=True) | ||
| file_path = markdown_dir / f"{library_name}-{markdown_hash}.md" | ||
|
|
| file_path = self.get_version_dir(version) / self.README_DIR / f"{library_name}-{markdown_hash}.md" | ||
| if not file_path.exists(): | ||
| return None |
| const response = await fetch(`${BASE_PATH}/markdown/${libraryName}-${markdownHash}.md`); | ||
| if (!response.ok) { | ||
| throw new Error(`Failed to load README for ${libraryName}`); | ||
| } | ||
| return response.text(); |
| def write_markdown(self, library_name: str, markdown_hash: str, content: str) -> None: | ||
| """Write markdown file to the database. | ||
|
|
||
| Args: | ||
| library_name: Name of the library | ||
| markdown_hash: Hash of the markdown content | ||
| content: Markdown content string | ||
| """ | ||
| markdown_dir = self.database_dir / "markdown" | ||
| markdown_dir.mkdir(parents=True, exist_ok=True) | ||
| file_path = markdown_dir / f"{library_name}-{markdown_hash}.md" | ||
|
|
||
| if file_path.exists(): | ||
| logger.debug(f"Markdown for '{library_name}' with hash {markdown_hash} already exists, skipping write") | ||
| return | ||
|
|
||
| try: | ||
| with open(file_path, "w", encoding="utf-8") as f: | ||
| f.write(content) | ||
| file_size = len(content.encode("utf-8")) | ||
| self.files_written += 1 | ||
| self.total_bytes += file_size | ||
| logger.debug(f"Wrote markdown for '{library_name}' with hash {markdown_hash}") | ||
| except OSError as e: | ||
| logger.error(f"Failed to write markdown for '{library_name}': {e}") | ||
| # README publishing failures must never fail DB generation as per requirements |
| # Pre-load README maps for all versions to enable augmentation and backfilling | ||
| readme_maps = {v: inventory_manager.load_library_readme_map(v) for v in versions} | ||
|
|
||
| # Publish all READMEs to the database | ||
| for version, readme_map in readme_maps.items(): | ||
| for library_name, markdown_hash in readme_map.items(): | ||
| content = inventory_manager.load_library_readme_content(version, library_name, markdown_hash) | ||
| if content: | ||
| db_writer.write_markdown(library_name, markdown_hash, content) | ||
|
|
||
| def load_and_augment_inventory(version: Version) -> dict: | ||
| inventory = inventory_manager.load_versioned_inventory(version) | ||
| readme_map = readme_maps.get(version, {}) | ||
|
|
||
| # Augment libraries and custom instrumentations with markdown_hash | ||
| for key in ["libraries", "custom"]: | ||
| if key in inventory: | ||
| for item in inventory[key]: | ||
| name = item.get("name") | ||
| if name and name in readme_map: | ||
| item["markdown_hash"] = readme_map[name] | ||
|
|
||
| return inventory |
|
Hi @SurbhiAgarwal1, I just noticed this branch includes the commit from #380, which is still under review. You can see the effect on Copilot's review: several of its comments here are actually about the README code from #380, not the semconv work. I'd rather wait for #380 to be merged and this branch rebased on main before doing a full review, so the diff reflects only the Weaver integration. Thanks for the work! |
Technical Detail: Semantic Convention Integration (Issue #97)
This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the
explorer-db-builder.1. Architectural Overview
The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a
SemconvEnricherthat evaluates telemetry metadata against standard OTel registries using the OpenTelemetry Weaver engine.Data Flow
InstrumentationData.weaver registry checkagainst a specific semconv version.2. Component:
SemconvEnricherLocation:
explorer_db_builder/semconv_enricher.pyThis is the primary orchestrator for compliance checking.
Transformation Logic
The enricher generates a temporary directory containing:
manifest.yaml: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g.,github.com/open-telemetry/semantic-conventions@v1.37.0).telemetry.yaml: Translates internal metadata into Weaver's definition format.type: metricand attributes using therefkeyword to ensure Weaver validates them against the registry's definitions.type: span, using synthetic IDs based on the instrumentation name and span kind (e.g.,activej-http.SERVER).Weaver Invocation
The enricher calls the
weaverCLI via a subprocess.weaver registry checkexits with code 0, all signals defined in the registry are considered compliant.stderroutput to identify specific signals that failed validation and marks them accordingly.3. Pipeline Integration
Location:
explorer_db_builder/main.pyThe enrichment stage is integrated into
process_versionimmediately after thetransform_instrumentation_formatcall.This placement ensures that:
4. Frontend & Metadata Schema
Location:
ecosystem-explorer/src/types/javaagent.tsThe compliance status is persisted as a
semconv_compliancearray on individual telemetry signals:{ "name": "http.server.request.duration", "unit": "s", "semconv_compliance": ["1.37.0"] }This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.
5. Verification & Testing
Location:
tests/test_semconv_enricher.pyA dedicated test suite validates the following:
manifest.yamlandtelemetry.yamlare valid and follow Weaver's specification.