Locked Problem Statement: Proxy Metric for Long-Context Capability

1. Objective

Design a scalar proxy metric

[ M(\text{model}, \text{proxy_data}) \in \mathbb{R} ]

that is computable only from:

a model checkpoint, and
unlabeled long-context proxy text,

and that tracks downstream long-context benchmark performance as closely as possible.

Your optimization target is held-out correlation between your proxy metric values and a locked downstream target score across a fixed model set.

2. What Is Fixed (Locked)

The following are fixed for official comparison and must not be changed.

2.1 Model universe (17 models)

01-ai/Yi-6B-200K
In2Training/FILM-7B
Leooyii/Longlora_32k_Slimpajama_1B
Leooyii/NTK_32k_Slimpajama_1B
Leooyii/NTK_64k_Slimpajama_1B
Leooyii/NTK_64k_Slimpajama_2B
Leooyii/PI_32k_Slimpajama_1B
Qwen/Qwen2-0.5B-Instruct
Qwen/Qwen2-1.5B-Instruct
Qwen/Qwen2-7B-Instruct
Qwen/Qwen2.5-0.5B-Instruct
Qwen/Qwen2.5-1.5B-Instruct
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-14B-Instruct
Qwen/Qwen2.5-Coder-1.5B-Instruct
Qwen/Qwen2.5-Coder-7B-Instruct
mistralai/Mistral-7B-Instruct-v0.2

2.2 Proxy data

Source: GovReport sampled subset (unlabeled)
File used in this repo: data/proxy/govreport_50_32k_tokens.json
Size: 50 documents
Max length per document: 32k tokens (pre-truncated during preparation)

2.3 Downstream target (locked labels)

The verifier uses this precomputed downstream target file:

results/downstream/downstream_avg.ruler32_longbench_lt32k.no4k.json

Each model has one scalar overall score in that file, representing the locked downstream target used for correlation evaluation.

2.4 Verification script

scripts/verify_locked_no4k.py

This script is the official scorer for this problem setting.

3. Formal Task Definition

Let:

(\mathcal{M}) be the fixed set of 17 models above,
(D) be the fixed unlabeled GovReport proxy corpus,
(Y_m) be the locked downstream overall score for model (m\in\mathcal{M}),
(S_m = M(m, D)) be your submitted proxy score.

The verifier computes:

Spearman correlation: (\rho_s(S, Y))
Pearson correlation: (\rho_p(S, Y))
Skipped Spearman (outlier-robust Spearman using MCD-based point filtering)

and bootstrap 95% confidence intervals for Spearman and skipped Spearman.

Your goal is to maximize correlation quality (typically Spearman as primary rank statistic), while keeping metric computation practical.

4. Submission Interface

You submit one scalar per model for all locked models.

Two accepted formats:

JSON object mapping model to score
JSON list of rows, each row containing:
- model
- score key (default key name: score, configurable via --metric-key)

4.1 Minimal example submission

{
  "01-ai/Yi-6B-200K": 1.234,
  "In2Training/FILM-7B": 0.918,
  "Leooyii/Longlora_32k_Slimpajama_1B": 1.102
}

(Real submission must include all 17 models exactly.)

5. What You May Change

You may change anything in your method design, including:

token-level statistic definitions,
sliding-window strategy,
aggregation functions,
hyperparameters,
model-side inference implementation details.

You may run cross-validation or train/test splitting across the fixed model set during method development.

You may negate or rescale your metric before submission.

6. What You Must Not Change

You must not change:

the locked model set,
the proxy dataset identity/content for official comparison,
the locked downstream target file,
the official verification script logic,
the requirement that metric computation uses only (model, proxy_data).

You must not use downstream benchmark labels/scores to compute per-model proxy values.

You must not use extra labeled data or downstream test content when computing the proxy metric.

No model fine-tuning/retraining is part of this task; evaluation is inference-time metric computation only.

7. Verification Command (Official)

Run from repository root:

python scripts/verify_locked_no4k.py \
  --submission path/to/your_submission.json \
  --metric-key score \
  --out path/to/verify_result.json

The verifier will:

enforce exact model-set match (missing/extra model => error),
reject non-finite submitted scores,
print and optionally save correlation outputs and confidence intervals.

8. Reproducibility Requirements

For a valid research claim in this setting, report at minimum:

exact code revision/commit,
exact submission file used for verification,
random seeds used by your method,
compute budget (tokens processed and wall-clock runtime),
verifier output JSON from scripts/verify_locked_no4k.py.

9. Recommended Research Workflow

Implement candidate metric (M) using only (model, proxy_data).
Compute one score per locked model on data/proxy/govreport_50_32k_tokens.json.
Export the 17-model submission JSON.
Run scripts/verify_locked_no4k.py.
Iterate on metric design and re-verify.

This defines the benchmark contract for fair, reproducible method comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locked Problem Statement: Proxy Metric for Long-Context Capability

1. Objective

2. What Is Fixed (Locked)

2.1 Model universe (17 models)

2.2 Proxy data

2.3 Downstream target (locked labels)

2.4 Verification script

3. Formal Task Definition

4. Submission Interface

4.1 Minimal example submission

5. What You May Change

6. What You Must Not Change

7. Verification Command (Official)

8. Reproducibility Requirements

9. Recommended Research Workflow

FilesExpand file tree

problem.md

Latest commit

History

problem.md

File metadata and controls

Locked Problem Statement: Proxy Metric for Long-Context Capability

1. Objective

2. What Is Fixed (Locked)

2.1 Model universe (17 models)

2.2 Proxy data

2.3 Downstream target (locked labels)

2.4 Verification script

3. Formal Task Definition

4. Submission Interface

4.1 Minimal example submission

5. What You May Change

6. What You Must Not Change

7. Verification Command (Official)

8. Reproducibility Requirements

9. Recommended Research Workflow