Skip to content

Latest commit

 

History

History
172 lines (116 loc) · 5.05 KB

File metadata and controls

172 lines (116 loc) · 5.05 KB

Locked Problem Statement: Proxy Metric for Long-Context Capability

1. Objective

Design a scalar proxy metric

[ M(\text{model}, \text{proxy_data}) \in \mathbb{R} ]

that is computable only from:

  • a model checkpoint, and
  • unlabeled long-context proxy text,

and that tracks downstream long-context benchmark performance as closely as possible.

Your optimization target is held-out correlation between your proxy metric values and a locked downstream target score across a fixed model set.

2. What Is Fixed (Locked)

The following are fixed for official comparison and must not be changed.

2.1 Model universe (17 models)

  • 01-ai/Yi-6B-200K
  • In2Training/FILM-7B
  • Leooyii/Longlora_32k_Slimpajama_1B
  • Leooyii/NTK_32k_Slimpajama_1B
  • Leooyii/NTK_64k_Slimpajama_1B
  • Leooyii/NTK_64k_Slimpajama_2B
  • Leooyii/PI_32k_Slimpajama_1B
  • Qwen/Qwen2-0.5B-Instruct
  • Qwen/Qwen2-1.5B-Instruct
  • Qwen/Qwen2-7B-Instruct
  • Qwen/Qwen2.5-0.5B-Instruct
  • Qwen/Qwen2.5-1.5B-Instruct
  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • Qwen/Qwen2.5-Coder-1.5B-Instruct
  • Qwen/Qwen2.5-Coder-7B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.2

2.2 Proxy data

  • Source: GovReport sampled subset (unlabeled)
  • File used in this repo: data/proxy/govreport_50_32k_tokens.json
  • Size: 50 documents
  • Max length per document: 32k tokens (pre-truncated during preparation)

2.3 Downstream target (locked labels)

The verifier uses this precomputed downstream target file:

  • results/downstream/downstream_avg.ruler32_longbench_lt32k.no4k.json

Each model has one scalar overall score in that file, representing the locked downstream target used for correlation evaluation.

2.4 Verification script

  • scripts/verify_locked_no4k.py

This script is the official scorer for this problem setting.

3. Formal Task Definition

Let:

  • (\mathcal{M}) be the fixed set of 17 models above,
  • (D) be the fixed unlabeled GovReport proxy corpus,
  • (Y_m) be the locked downstream overall score for model (m\in\mathcal{M}),
  • (S_m = M(m, D)) be your submitted proxy score.

The verifier computes:

  • Spearman correlation: (\rho_s(S, Y))
  • Pearson correlation: (\rho_p(S, Y))
  • Skipped Spearman (outlier-robust Spearman using MCD-based point filtering)

and bootstrap 95% confidence intervals for Spearman and skipped Spearman.

Your goal is to maximize correlation quality (typically Spearman as primary rank statistic), while keeping metric computation practical.

4. Submission Interface

You submit one scalar per model for all locked models.

Two accepted formats:

  1. JSON object mapping model to score
  2. JSON list of rows, each row containing:
    • model
    • score key (default key name: score, configurable via --metric-key)

4.1 Minimal example submission

{
  "01-ai/Yi-6B-200K": 1.234,
  "In2Training/FILM-7B": 0.918,
  "Leooyii/Longlora_32k_Slimpajama_1B": 1.102
}

(Real submission must include all 17 models exactly.)

5. What You May Change

You may change anything in your method design, including:

  • token-level statistic definitions,
  • sliding-window strategy,
  • aggregation functions,
  • hyperparameters,
  • model-side inference implementation details.

You may run cross-validation or train/test splitting across the fixed model set during method development.

You may negate or rescale your metric before submission.

6. What You Must Not Change

You must not change:

  • the locked model set,
  • the proxy dataset identity/content for official comparison,
  • the locked downstream target file,
  • the official verification script logic,
  • the requirement that metric computation uses only (model, proxy_data).

You must not use downstream benchmark labels/scores to compute per-model proxy values.

You must not use extra labeled data or downstream test content when computing the proxy metric.

No model fine-tuning/retraining is part of this task; evaluation is inference-time metric computation only.

7. Verification Command (Official)

Run from repository root:

python scripts/verify_locked_no4k.py \
  --submission path/to/your_submission.json \
  --metric-key score \
  --out path/to/verify_result.json

The verifier will:

  • enforce exact model-set match (missing/extra model => error),
  • reject non-finite submitted scores,
  • print and optionally save correlation outputs and confidence intervals.

8. Reproducibility Requirements

For a valid research claim in this setting, report at minimum:

  • exact code revision/commit,
  • exact submission file used for verification,
  • random seeds used by your method,
  • compute budget (tokens processed and wall-clock runtime),
  • verifier output JSON from scripts/verify_locked_no4k.py.

9. Recommended Research Workflow

  1. Implement candidate metric (M) using only (model, proxy_data).
  2. Compute one score per locked model on data/proxy/govreport_50_32k_tokens.json.
  3. Export the 17-model submission JSON.
  4. Run scripts/verify_locked_no4k.py.
  5. Iterate on metric design and re-verify.

This defines the benchmark contract for fair, reproducible method comparison.