A machine learning project aiming to turn handwritten equations into LaTeX.
To set up the repository:
-
Run
pip install -U -r requirements.txt
-
If on Windows and perl is not installed, then install perl: https://www.perl.org/
To download the dataset:
- Navigate to http://lstm.seas.harvard.edu/latex/data/
- Click the link labelled
IM2LATEX-100K-HANDWRITTEN.tgz (processed images, unprocessed formulas, training, validation and test set)
- Download to unzip such that all files/folders are directly in
/data
- There should be 5 files and 1 folder directly in
/data
:images
,formulas.lst
,test.lst
,train.lst
,val.lst
To preprocess the data:
- Images have already been preprocessed
- Preprocess the formulas:
python scripts/preprocessing/preprocess_formulas.py --mode normalize --input-file data/formulas.lst --output-file data/formulas.norm.lst
- Prepare train, validation and test files:
-
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/images --label-path data/formulas.norm.lst --data-path data/train.lst --output-path data/train_filter.lst
-
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/images --label-path data/formulas.norm.lst --data-path data/validate.lst --output-path data/validate_filter.lst
-
python scripts/preprocessing/preprocess_filter.py --no-filter --image-dir data/images --label-path data/formulas.norm.lst --data-path data/test.lst --output-path data/test_filter.lst
-
Credit to repository im2markup for the source code in scripts