Skip to content

Replace eval() in LooGLE evaluation parsing#248

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/loogle-safe-parsing
Open

Replace eval() in LooGLE evaluation parsing#248
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix/loogle-safe-parsing

Conversation

@fallintoplace

Copy link
Copy Markdown
Contributor

Summary

  • replace eval() in the LooGLE cloze metrics with safe JSON/literal parsing
  • reuse the same safe parser for qa_pairs when building the LooGLE Hugging Face dataset
  • add regression tests for valid Python-style literals and non-literal payloads

Root cause

The LooGLE benchmark parsed both dataset fields and generated answers with eval(). For shortdep_cloze, that meant a crafted prediction string could execute Python during metric calculation.

Validation

  • uv run --python 3.11 pytest tests/test_loogle_parsing.py
  • uv run --python 3.11 flake8 evaluation/benchmarks/loogle/calculate_metrics.py evaluation/benchmarks/loogle/create_huggingface_dataset.py evaluation/benchmarks/loogle/parsing.py tests/test_loogle_parsing.py
  • uv run --python 3.11 mypy evaluation/benchmarks/loogle/calculate_metrics.py evaluation/benchmarks/loogle/create_huggingface_dataset.py evaluation/benchmarks/loogle/parsing.py tests/test_loogle_parsing.py --check-untyped-defs
  • uv run --python 3.11 python -c "import evaluation.benchmarks.loogle.create_huggingface_dataset as mod; print("import-ok", hasattr(mod, "build_task_dataframe"), hasattr(mod, "main"))"

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant