This repository is the official implementation of Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP. Please cite arxiv or Neurips 2021 version
The dataset is also available at https://doi.org/10.5061/dryad.n02v6wwzp
This will enable you to download and replicate the datasplits, but it has not been updated to include all requirements to run the (baselines and experiments notebooks).
pip install -r requirements.txt
git clone <anonymized> # if using code supplement, just unzip
cd decrypt
pushd ./data && unzip "*.json.zip" && popd
If you want to download the data yourself from the web (you probably don't want to)
git clone <anonymized> # if using code supplement, just unzip
cd decrypt
mkdir -p './data/puzzles'
python decrypt/scrape_parse/guardian_scrape.py --save_directory="./data/puzzles"
Then when you run load_guardian_splits
you will run
load_guardian_splits("./data/puzzles", load_from_files=True, use_premade_json=False)
from decrypt.scrape_parse import (
load_guardian_splits, # naive random split
load_guardian_splits_disjoint, # answer-disjoint split
load_guardian_splits_disjoint_hash # word-initial disjoint split
)
from decrypt.scrape_parse.guardian_load import SplitReturn
"""
each of these methods returns a tuple of `SplitReturn`
- soln to clue map (string to List of clues mapping to that soln): Dict[str, List[BaseClue]
this enables seeing all clues associated with a given answer word
- list of all clues (List[BaseClue])
- Tuple of three lists (the train, val, test splits), each is List[BaseClue]
Note that
load_guardian_splits() will verify that
- total glob length matches the one in paper (ie. number of puzzles downloaded matches)
- total clue set length matches the one in paper (i.e. filtering is the same)
- one of the clues in our train set matches our train set (i.e. a single clue
spot check for randomness)
If you get an assertion error or an exception during load, please file an
issue, since the splits should be identical
Alternatively, if you don't care, you can pass `verify=False` to
`load_guardian_splits`
"""
soln_to_clue_map, all_clues_list, (train, val, test) = load_guardian_splits()
We make code available to replicate the entire paper.
Note that the directory structure is specified in decrypt/config.py
. You can change it if you would like.
Most references use this file, but run commands (i.e. python ...
assume that the directories are unchanged
from the original config.py.
- The splits are replicated as above using the load methods
- The task is replicated in the following sections
- We provide code to replicate metric analysis. See the implementation in jupyter notebooks below
To run the notebooks, you should start your jupyter server from the top level decrypt
directory.
The notebooks have been run using pycharm open from the top level decrypt
directory.
If you experience import errors it is likely because you are not running from the top level.
Notebook to replicate the four baselines are in baselines
directory.
Note that a patch will need to be applied to work with the deits solver.
See experiments/curricular.ipynb
See experiments/model_analysis
Note that details of training and evaluating the models are available in the relevant jupyter notebooks.