This repository stores benchmark results and datasets collected with metriq-gym.
The data here is consumed by metriq-web for presentation and analysis.
Part of the Metriq project.
- Run
python scripts/aggregate.pyto generate aggregated results.
metriq-score is computed per metric relative to a baseline device, honoring directionality:
- higher-is-better:
score = (value / baseline) * 100 - lower-is-better:
score = (baseline / value) * 100
Example: Say X is the device baseline for series v0.4. Then for a metric where higher is better (e.g. "fidelity"), we assign a metriq-score of 100 to the value that X scored on that metric. If the raw value of that benchmark on X was 0.5, and another device Y reports 0.9, then the metriq-score of Y is 0.9 / 0.5 * 100 = 180.
Edit scripts/baselines.json to set the baseline per minor series (e.g. v0.4). Example:
{
"series": {
"v0.4": { "provider": "origin", "device": "origin_wukong" }
},
"default": { "provider": "ibm", "device": "ibm_torino" }
}
For now, all rows from the baseline device within a series are averaged to compute the baseline.