Skip to content

Restructure backend manifest JSON with hierarchical keys and unified parameters #24

@PaulHax

Description

@PaulHax

Problem

The current backend manifest JSON structure has several issues:

  1. Concatenated string keys: Keys like "phase2_pipeline_zeroshot_comparative_regression_mistralai/Mistral-7B-Instruct-v0.3_merit-0.0" are hard to parse and filter
  2. Config duplication: Full experiment config is duplicated across every scenario
  3. Poor structure: Difficult to query by specific ADM, LLM, or KDMA parameters
  4. Artificial scenario indexing: Currently appends indices to scenario IDs instead of using actual scene structure from input.full_state.meta_info.scene_id
  5. No clear separation: Scenarios vs scenes are conflated
  6. File duplication: Creates filtered copies of input_output.json files unnecessarily
  7. Limited extensibility: Hard to add new parameter dimensions

Enhanced Solution

After further analysis, we've refined the approach to use a flexible parameter-based structure with integrity validation and fast lookup indices.

Key Design Decisions

  1. Flexible parameters: No rigid hierarchy - can handle ADMs with/without LLMs and future extensions
  2. No file duplication: Reference original files with source indices
  3. Hash-based experiment keys: Deterministic keys generated from parameter hash
  4. Fast lookups: Reverse indices for efficient filtering
  5. Integrity validation: File checksums prevent stale data issues

New Structure

{
  "manifest_version": "2.0",
  "generated_at": "2025-07-18T15:30:00Z",
  "experiments": {
    "exp_a1b2c3d4": {
      "parameters": {
        "adm": {
          "name": "phase2_pipeline_zeroshot_comparative_regression",
          "instance": { /* full ADM config */ }
        },
        "llm": {
          "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
          "precision": "half"
        }  < /dev/null |  null,  // null for non-LLM ADMs
        "kdma_values": [{"kdma": "merit", "value": 0.5}],  // [] for unaligned
        "alignment_target_id": "ADEPT-June2025-merit-0.5",
        "run_variant": "default"
      },
      "scenarios": {
        "June2025-MF1-eval": {
          "input_output": {
            "file": "data/2025-06-23__12-28-29/input_output.json",
            "checksum": "sha256:a1b2c3d4e5f6...",
            "alignment_target_filter": "ADEPT-June2025-merit-0.5"
          },
          "scores": null,
          "timing": "data/2025-06-23__12-28-29/timing.json",
          "scenes": {
            "Probe 1": { "source_index": 5, "scene_id": "Probe 1" },
            "Probe 5": { "source_index": 12, "scene_id": "Probe 5" }
          }
        }
      }
    }
  },
  "indices": {
    "by_adm": {
      "phase2_pipeline_zeroshot_comparative_regression": ["exp_a1b2c3d4"],
      "rule_based_baseline": ["exp_e5f6g7h8"]
    },
    "by_llm": {
      "mistralai/Mistral-7B-Instruct-v0.3": ["exp_a1b2c3d4"],
      "no-llm": ["exp_e5f6g7h8"]
    },
    "by_kdma": {
      "merit-0.5": ["exp_a1b2c3d4"],
      "unaligned": ["exp_e5f6g7h8"]
    },
    "by_scenario": {
      "June2025-MF1-eval": ["exp_a1b2c3d4", "exp_e5f6g7h8"]
    }
  },
  "files": {
    "data/2025-06-23__12-28-29/input_output.json": {
      "checksum": "sha256:a1b2c3d4e5f6...",
      "size": 2048576,
      "experiments": ["exp_a1b2c3d4", "exp_e5f6g7h8"]
    }
  }
}

Experiment Key Generation

Keys are generated deterministically from parameter hash:

function generateExperimentKey(parameters) {
  const keyData = {
    adm: parameters.adm.name,
    llm: parameters.llm?.model_name || "no-llm",
    kdma: parameters.kdma_values.map(kv => `${kv.kdma}-${kv.value}`).sort().join('_') || "unaligned",
    run_variant: parameters.run_variant || "default"
  };
  
  const hash = sha256(JSON.stringify(keyData));
  return `exp_${hash.substring(0, 8)}`;
}

Handling Complex Cases

  1. Multiple experiments per file: Each gets separate entry with different source_index values
  2. ADMs without LLMs: Use "llm": null
  3. Unaligned experiments: Use "kdma_values": []
  4. Run variants: Included in parameter hash for unique keys
  5. Future parameters: Easy to add to parameters object

Benefits

  • No file duplication: Keep original files, use indices
  • Fast queries: Pre-built indices for common filters
  • Extensible: Easy to add new parameter types
  • Integrity: Checksums prevent stale data issues
  • Flexible: Handles all current and future ADM/LLM combinations
  • Efficient: Smaller file sizes, better performance
  • Maintainable: Clear separation of concerns

Implementation Tasks

  • Update ExperimentConfig to generate flexible parameter-based keys
  • Modify GlobalManifest class to build enhanced structure with indices
  • Update experiment_parser.py to extract scene_id and build source_index mappings
  • Add file checksum calculation and integrity validation
  • Implement reverse mapping indices (by_adm, by_llm, by_kdma, by_scenario)
  • Handle ADMs without LLMs (use null approach)
  • Update frontend to consume new structure and use indices for fast queries
  • Add comprehensive tests for enhanced manifest structure
  • Implement backward compatibility during transition

This approach provides a robust, extensible foundation that can grow with the system's needs while solving all current structural issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions