Skip to content

hhexiy/long-context-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Long-Context Models Should Be Good at Looking Things Up in Their Own Context

This repository contains a clean, inference-only proxy metric for the 17-model no4k long-context benchmark in problem.md. The core idea is simple: instead of asking whether a model assigns low perplexity to long documents in general, ask whether it can use a long document to recover the text that comes immediately after an earlier excerpt from that same document.

That shift matters. Raw language modeling loss mixes many effects together: base model quality, pretraining domain match, tokenizer quirks, and general fluency. Long-context evaluation is narrower. What we really want to know is whether the model can carry information forward across a long prefix and retrieve it when needed. The method in this repo tries to measure exactly that.

The Short Version

For each proxy document, I sample a few excerpts from early or middle positions. Then I score the true continuation of each excerpt twice:

  • once with the full document available in the prefix
  • once without the document

The proxy score is the average improvement from having the document:

[ \text{retrieval_gain} = \text{NLL}(\text{answer} \mid \text{excerpt only}) - \text{NLL}(\text{answer} \mid \text{full document + excerpt}) ]

Higher is better. A strong long-context model should get a real benefit from seeing the whole document. A weak one should gain less, or even get distracted.

This is a single metric. It is not fit to the downstream labels, and it does not combine multiple hand-tuned metrics after looking at the target scores.

A Small Worked Example

Here is a made-up example that captures the mechanics.

Suppose the document contains this passage:

The committee met on Tuesday to review the flood response plan. After two hours of discussion, it approved a temporary shelter expansion in the northern district.

Take the excerpt:

After two hours of discussion,

and let the target continuation be:

it approved a temporary shelter expansion

Now score that continuation in two settings.

Without the document, the model only sees the excerpt. Many continuations are plausible:

  • the meeting ended
  • officials requested more funding
  • it approved a temporary shelter expansion

If the model assigns an average answer-token NLL of 3.2, then:

[ \ell_{\text{without}} = 3.2 ]

With the full document prepended, the continuation is much less ambiguous because the supporting context is already present. If the average answer-token NLL drops to 1.1, then:

[ \ell_{\text{with}} = 1.1 ]

So the task-level retrieval gain is:

[ g = 3.2 - 1.1 = 2.1 ]

That is a strong positive result. It says the document made the continuation much easier to predict. If another model only dropped from 3.0 to 2.6, its gain would be 0.4, suggesting that it extracted much less value from the long context.

Why This Proxy Is Plausible

Long-context benchmarks often reduce to a model finding and using information that appears far away in the prefix. GovReport is a good proxy corpus for this because it contains long, coherent documents with repeated entities, local structure, and enough redundancy that an earlier excerpt often strongly constrains what comes next.

The retrieval setup forces the model into a controlled version of that problem:

  • the query is an exact excerpt from the document
  • the target is the exact text that follows that excerpt
  • the only difference between the two scoring conditions is whether the full long document is present

That last part is important. It means the score is closer to a causal estimate of "how much the long context helped" than raw perplexity is. Using a gain instead of plain NLL also partially factors out base model strength. A generally strong model with weak long-context use should not win by default; it has to actually benefit from the long prefix.

The Metric, Precisely

Take a document (d = (x_1, \dots, x_T)), truncated to doc_tokens = 8192. For each query anchor (a), define:

  • query length (q = 24)
  • answer length (r = 24)
  • query tokens (Q = (x_a, \dots, x_{a+q-1}))
  • answer tokens (A = (x_{a+q}, \dots, x_{a+q+r-1}))

Anchors are chosen deterministically from the interval ([0.1U, 0.6U]), where (U = T - q - r). In the final run, I use n_queries = 2 anchors per document.

For model (m), score the answer under two prompts:

  1. With document:

[ \ell_{\text{with}}(m, d, a) = \frac{1}{r}\sum_{t=1}^{r} -\log p_m(A_t \mid d, Q, A_{<t}) ]

  1. Without document:

[ \ell_{\text{without}}(m, d, a) = \frac{1}{r}\sum_{t=1}^{r} -\log p_m(A_t \mid Q, A_{<t}) ]

The task score is:

[ g(m, d, a) = \ell_{\text{without}}(m, d, a) - \ell_{\text{with}}(m, d, a) ]

and the model-level submission score is:

[ M(m) = \frac{1}{|D|,|A_d|}\sum_{d \in D}\sum_{a \in A_d} g(m, d, a) ]

where (D) is the 50-document GovReport proxy set and (A_d) is the set of chosen anchors in document (d).

Implementation details:

  • exact answer-token NLL is computed with KV-cache chunking, so long prefixes fit comfortably on one L40S
  • chunk_size = 1024
  • dtype = bf16
  • scoring is deterministic
  • no downstream labels are used anywhere in metric computation

The implementation lives in compute_retrieval_submission.py.

What Worked and What Did Not

I started with straightforward sliding-window loss proxies such as mean_nll, tail loss, and tail-vs-head degradation. They were reasonable diagnostics, but they were not focused enough on the thing the benchmark seems to care about: not just surviving a long prefix, but using it.

The retrieval framing worked better because it asks a narrower question. If the document really helps, the model's uncertainty about the continuation should drop. If the model cannot maintain useful state over the long prefix, the improvement should be small.

I also compared two versions of the retrieval metric:

  • retrieval_nll: just negative NLL with the document present
  • retrieval_gain: NLL improvement from adding the document

retrieval_gain was clearly better, which matches the intuition above. The gain isolates context use more cleanly than raw accuracy-with-context does.

Final Result

The committed final submission is:

The corresponding verifier output is:

On the full 50-document setting, the verifier reports:

  • spearman = 0.7230
  • pearson = 0.6001
  • skipped_spearman = 0.5780
  • spearman_ci95 = [0.3499, 0.8854]

For development, the same metric on just 4 documents reached a slightly higher Spearman (0.7426), which was encouraging but not enough to trust on its own. Running the full 50 documents was the important check. The score held up.

Reproducing The Exact Run

The repository includes a small wrapper that launches the exact four-GPU run used for the final result:

Environment:

  • 4x NVIDIA L40S
  • local Hugging Face cache already populated at /workspace/.hf_cache
  • Python environment created with bash scripts/setup_env.sh

Setup:

bash scripts/setup_env.sh
export HF_HOME=/workspace/.hf_cache

Reproduce the exact final run:

bash scripts/run_retrieval_full50.sh

That script:

  • shards the 17 models across 4 GPUs
  • runs retrieval_gain on all 50 proxy docs
  • merges the shard outputs
  • writes the final submission JSON
  • runs the verifier with --bootstrap-resamples 5000

If you want a smaller smoke test first, run one model on a few docs:

./.venv/bin/python scripts/compute_retrieval_submission.py \
  --models configs/locked_models_working17.json \
  --include-model Qwen/Qwen2.5-0.5B-Instruct \
  --proxy data/proxy/govreport_50_docs.json \
  --metric retrieval_gain \
  --max-docs 2 \
  --doc-tokens 8192 \
  --n-queries 2 \
  --query-tokens 24 \
  --answer-tokens 24 \
  --chunk-size 1024 \
  --dtype bf16 \
  --out submissions/smoke_retrieval_gain.json

A Few Takeaways

The main lesson is that a good proxy should try to preserve the mechanism of the target task, not just its surface format. Long-context ability is not the same thing as low perplexity on long text. A retrieval-style score is a better match because it asks whether the model can actually cash in the information stored in a long prefix.

The second lesson is that subtraction helps. Measuring the gain from context was more informative than measuring absolute performance with context. That is a useful pattern beyond this benchmark: if you can compare "with the information" versus "without the information," you often get a cleaner estimate of whether the model is using the signal you care about.

Finally, the result is good but not magic. The benchmark only has 17 models, the skipped-Spearman interval is still wide, and there is room to test richer retrieval task designs. But as a simple, label-free, inference-only method, retrieval_gain is a strong baseline and a sensible place to build from.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors