This repository contains a clean, inference-only proxy metric for the 17-model no4k long-context benchmark in problem.md. The core idea is simple: instead of asking whether a model assigns low perplexity to long documents in general, ask whether it can use a long document to recover the text that comes immediately after an earlier excerpt from that same document.
That shift matters. Raw language modeling loss mixes many effects together: base model quality, pretraining domain match, tokenizer quirks, and general fluency. Long-context evaluation is narrower. What we really want to know is whether the model can carry information forward across a long prefix and retrieve it when needed. The method in this repo tries to measure exactly that.
For each proxy document, I sample a few excerpts from early or middle positions. Then I score the true continuation of each excerpt twice:
- once with the full document available in the prefix
- once without the document
The proxy score is the average improvement from having the document:
[ \text{retrieval_gain} = \text{NLL}(\text{answer} \mid \text{excerpt only}) - \text{NLL}(\text{answer} \mid \text{full document + excerpt}) ]
Higher is better. A strong long-context model should get a real benefit from seeing the whole document. A weak one should gain less, or even get distracted.
This is a single metric. It is not fit to the downstream labels, and it does not combine multiple hand-tuned metrics after looking at the target scores.
Here is a made-up example that captures the mechanics.
Suppose the document contains this passage:
The committee met on Tuesday to review the flood response plan. After two hours of discussion, it approved a temporary shelter expansion in the northern district.
Take the excerpt:
After two hours of discussion,
and let the target continuation be:
it approved a temporary shelter expansion
Now score that continuation in two settings.
Without the document, the model only sees the excerpt. Many continuations are plausible:
the meeting endedofficials requested more fundingit approved a temporary shelter expansion
If the model assigns an average answer-token NLL of 3.2, then:
[ \ell_{\text{without}} = 3.2 ]
With the full document prepended, the continuation is much less ambiguous because the supporting context is already present. If the average answer-token NLL drops to 1.1, then:
[ \ell_{\text{with}} = 1.1 ]
So the task-level retrieval gain is:
[ g = 3.2 - 1.1 = 2.1 ]
That is a strong positive result. It says the document made the continuation much easier to predict. If another model only dropped from 3.0 to 2.6, its gain would be 0.4, suggesting that it extracted much less value from the long context.
Long-context benchmarks often reduce to a model finding and using information that appears far away in the prefix. GovReport is a good proxy corpus for this because it contains long, coherent documents with repeated entities, local structure, and enough redundancy that an earlier excerpt often strongly constrains what comes next.
The retrieval setup forces the model into a controlled version of that problem:
- the query is an exact excerpt from the document
- the target is the exact text that follows that excerpt
- the only difference between the two scoring conditions is whether the full long document is present
That last part is important. It means the score is closer to a causal estimate of "how much the long context helped" than raw perplexity is. Using a gain instead of plain NLL also partially factors out base model strength. A generally strong model with weak long-context use should not win by default; it has to actually benefit from the long prefix.
Take a document (d = (x_1, \dots, x_T)), truncated to doc_tokens = 8192. For each query anchor (a), define:
- query length (q = 24)
- answer length (r = 24)
- query tokens (Q = (x_a, \dots, x_{a+q-1}))
- answer tokens (A = (x_{a+q}, \dots, x_{a+q+r-1}))
Anchors are chosen deterministically from the interval ([0.1U, 0.6U]), where (U = T - q - r). In the final run, I use n_queries = 2 anchors per document.
For model (m), score the answer under two prompts:
- With document:
[ \ell_{\text{with}}(m, d, a) = \frac{1}{r}\sum_{t=1}^{r} -\log p_m(A_t \mid d, Q, A_{<t}) ]
- Without document:
[ \ell_{\text{without}}(m, d, a) = \frac{1}{r}\sum_{t=1}^{r} -\log p_m(A_t \mid Q, A_{<t}) ]
The task score is:
[ g(m, d, a) = \ell_{\text{without}}(m, d, a) - \ell_{\text{with}}(m, d, a) ]
and the model-level submission score is:
[ M(m) = \frac{1}{|D|,|A_d|}\sum_{d \in D}\sum_{a \in A_d} g(m, d, a) ]
where (D) is the 50-document GovReport proxy set and (A_d) is the set of chosen anchors in document (d).
Implementation details:
- exact answer-token NLL is computed with KV-cache chunking, so long prefixes fit comfortably on one L40S
chunk_size = 1024dtype = bf16- scoring is deterministic
- no downstream labels are used anywhere in metric computation
The implementation lives in compute_retrieval_submission.py.
I started with straightforward sliding-window loss proxies such as mean_nll, tail loss, and tail-vs-head degradation. They were reasonable diagnostics, but they were not focused enough on the thing the benchmark seems to care about: not just surviving a long prefix, but using it.
The retrieval framing worked better because it asks a narrower question. If the document really helps, the model's uncertainty about the continuation should drop. If the model cannot maintain useful state over the long prefix, the improvement should be small.
I also compared two versions of the retrieval metric:
retrieval_nll: just negative NLL with the document presentretrieval_gain: NLL improvement from adding the document
retrieval_gain was clearly better, which matches the intuition above. The gain isolates context use more cleanly than raw accuracy-with-context does.
The committed final submission is:
The corresponding verifier output is:
On the full 50-document setting, the verifier reports:
spearman = 0.7230pearson = 0.6001skipped_spearman = 0.5780spearman_ci95 = [0.3499, 0.8854]
For development, the same metric on just 4 documents reached a slightly higher Spearman (0.7426), which was encouraging but not enough to trust on its own. Running the full 50 documents was the important check. The score held up.
The repository includes a small wrapper that launches the exact four-GPU run used for the final result:
Environment:
- 4x NVIDIA L40S
- local Hugging Face cache already populated at
/workspace/.hf_cache - Python environment created with
bash scripts/setup_env.sh
Setup:
bash scripts/setup_env.sh
export HF_HOME=/workspace/.hf_cacheReproduce the exact final run:
bash scripts/run_retrieval_full50.shThat script:
- shards the 17 models across 4 GPUs
- runs
retrieval_gainon all 50 proxy docs - merges the shard outputs
- writes the final submission JSON
- runs the verifier with
--bootstrap-resamples 5000
If you want a smaller smoke test first, run one model on a few docs:
./.venv/bin/python scripts/compute_retrieval_submission.py \
--models configs/locked_models_working17.json \
--include-model Qwen/Qwen2.5-0.5B-Instruct \
--proxy data/proxy/govreport_50_docs.json \
--metric retrieval_gain \
--max-docs 2 \
--doc-tokens 8192 \
--n-queries 2 \
--query-tokens 24 \
--answer-tokens 24 \
--chunk-size 1024 \
--dtype bf16 \
--out submissions/smoke_retrieval_gain.jsonThe main lesson is that a good proxy should try to preserve the mechanism of the target task, not just its surface format. Long-context ability is not the same thing as low perplexity on long text. A retrieval-style score is a better match because it asks whether the model can actually cash in the information stored in a long prefix.
The second lesson is that subtraction helps. Measuring the gain from context was more informative than measuring absolute performance with context. That is a useful pattern beyond this benchmark: if you can compare "with the information" versus "without the information," you often get a cleaner estimate of whether the model is using the signal you care about.
Finally, the result is good but not magic. The benchmark only has 17 models, the skipped-Spearman interval is still wide, and there is room to test richer retrieval task designs. But as a simple, label-free, inference-only method, retrieval_gain is a strong baseline and a sensible place to build from.