Design a scalar proxy metric
[ M(\text{model}, \text{proxy_data}) \in \mathbb{R} ]
that is computable only from:
- a model checkpoint, and
- unlabeled long-context proxy text,
and that tracks downstream long-context benchmark performance as closely as possible.
Your optimization target is held-out correlation between your proxy metric values and a locked downstream target score across a fixed model set.
The following are fixed for official comparison and must not be changed.
01-ai/Yi-6B-200KIn2Training/FILM-7BLeooyii/Longlora_32k_Slimpajama_1BLeooyii/NTK_32k_Slimpajama_1BLeooyii/NTK_64k_Slimpajama_1BLeooyii/NTK_64k_Slimpajama_2BLeooyii/PI_32k_Slimpajama_1BQwen/Qwen2-0.5B-InstructQwen/Qwen2-1.5B-InstructQwen/Qwen2-7B-InstructQwen/Qwen2.5-0.5B-InstructQwen/Qwen2.5-1.5B-InstructQwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-Coder-1.5B-InstructQwen/Qwen2.5-Coder-7B-Instructmistralai/Mistral-7B-Instruct-v0.2
- Source: GovReport sampled subset (unlabeled)
- File used in this repo:
data/proxy/govreport_50_32k_tokens.json - Size: 50 documents
- Max length per document: 32k tokens (pre-truncated during preparation)
The verifier uses this precomputed downstream target file:
results/downstream/downstream_avg.ruler32_longbench_lt32k.no4k.json
Each model has one scalar overall score in that file, representing the locked downstream target used for correlation evaluation.
scripts/verify_locked_no4k.py
This script is the official scorer for this problem setting.
Let:
- (\mathcal{M}) be the fixed set of 17 models above,
- (D) be the fixed unlabeled GovReport proxy corpus,
- (Y_m) be the locked downstream
overallscore for model (m\in\mathcal{M}), - (S_m = M(m, D)) be your submitted proxy score.
The verifier computes:
- Spearman correlation: (\rho_s(S, Y))
- Pearson correlation: (\rho_p(S, Y))
- Skipped Spearman (outlier-robust Spearman using MCD-based point filtering)
and bootstrap 95% confidence intervals for Spearman and skipped Spearman.
Your goal is to maximize correlation quality (typically Spearman as primary rank statistic), while keeping metric computation practical.
You submit one scalar per model for all locked models.
Two accepted formats:
- JSON object mapping model to score
- JSON list of rows, each row containing:
model- score key (default key name:
score, configurable via--metric-key)
{
"01-ai/Yi-6B-200K": 1.234,
"In2Training/FILM-7B": 0.918,
"Leooyii/Longlora_32k_Slimpajama_1B": 1.102
}(Real submission must include all 17 models exactly.)
You may change anything in your method design, including:
- token-level statistic definitions,
- sliding-window strategy,
- aggregation functions,
- hyperparameters,
- model-side inference implementation details.
You may run cross-validation or train/test splitting across the fixed model set during method development.
You may negate or rescale your metric before submission.
You must not change:
- the locked model set,
- the proxy dataset identity/content for official comparison,
- the locked downstream target file,
- the official verification script logic,
- the requirement that metric computation uses only
(model, proxy_data).
You must not use downstream benchmark labels/scores to compute per-model proxy values.
You must not use extra labeled data or downstream test content when computing the proxy metric.
No model fine-tuning/retraining is part of this task; evaluation is inference-time metric computation only.
Run from repository root:
python scripts/verify_locked_no4k.py \
--submission path/to/your_submission.json \
--metric-key score \
--out path/to/verify_result.jsonThe verifier will:
- enforce exact model-set match (missing/extra model => error),
- reject non-finite submitted scores,
- print and optionally save correlation outputs and confidence intervals.
For a valid research claim in this setting, report at minimum:
- exact code revision/commit,
- exact submission file used for verification,
- random seeds used by your method,
- compute budget (tokens processed and wall-clock runtime),
- verifier output JSON from
scripts/verify_locked_no4k.py.
- Implement candidate metric (M) using only
(model, proxy_data). - Compute one score per locked model on
data/proxy/govreport_50_32k_tokens.json. - Export the 17-model submission JSON.
- Run
scripts/verify_locked_no4k.py. - Iterate on metric design and re-verify.
This defines the benchmark contract for fair, reproducible method comparison.