Documentation for training a L1T Jet Tagging model for CMS Phase-2 L1 upgrades.
To train the jet tagger, there are multiple steps that one need to follow, from creating the raw datasets, preprocessing them, train the model, synthesize it, and make validation plots for different physics seeds. This README describes all the steps in a sequential manner.
The CI in this repository aims at building a pipeline that enables running all of these steps automatically.
A summary menu of all the steps is listed below:
1. Produce Raw Training Datasets
2. Prepare the data and train the model
4. Synthesize the model (with wrapper and CMSSW)
5. Implement the model in FPGA Firmware
Note that the instructions are assuming that you have access to the appropriate eos
data spaces. If you are not interested in reading lengthy documentation like me, here is a ultra-short version to get started on running the code (more details in each specific command is provided in each section above, futher help can be found by looking into each script):
#Activate the environment
conda activate tagger
#Run this to add the scripts in this directory to your python path
export CI_COMMIT_REF_NAME=local
#Prepare the data
python tagger/train/train.py --make-data
#Train the model
python tagger/train/train.py
#Make some basic validation plots
python tagger/train/train.py --plot-basic
#Make other plots for bbbb/bbtautau final state for example:
python tagger/plot/bbbb.py
python tagger/plot/bbtautau.py
#OR vbf tautau
python tagger/plot/vbf_tautau.py
#Synthesize the model (with wrapper and CMMSSW)
python tagger/firmware/hls4ml_convert.py
Creating the training datasets involve several steps:
- Taking the RAW samples and pruning/sliming them. This can be done running the
scripts in FastPUPPI, which also uses submission repo. This is currently done for all, and stored in here:
- These samples will then be processed by the nTuplizer, which is part of the FastPUPPI repo. In particular the
, which callsjetNTuplizer.cc
. Note that to submit jobs as part of this setup, you also need the submission repo as well.
After creating the training ntuples, in our setup, they will then be shuffled and concatenate (hadd
) into a big file, such as this one:
We use one of the scripts in this respository to prepare the training data. First, you have to set up the conda environment and set up the appropriate paths for the scripts:
#Create the environment from the yaml file
conda-env create -f environment.yml
#Activate the environment
conda activate tagger
#Run this to add the scripts in this directory to your python path
Then, to prepare the data for training:
python tagger/train/train.py --make-data
This prepare the data using the default options(look into the script to see what the options are). If you want to customize the input data path, or the data step size for uproot.iterate
, then you can use the full options
python tagger/train/train.py --make-data -i <your-rootfile> -s <custom-step-size>
This automatically create a new directory: training_data
(it will ask before removing the exisiting one), and writes the data into it. Then, to train the model:
python tagger/train/train.py
The models are defined in tagger/train/models.py
the baseline
model is provided as default.
Various physics validation plots can be make using the tagger/plot
modules, the plots are divided into different final states, such as bbbb.py
, to use the script, you need to derive the working points before evaluating the background rate/efficiency.
python tagger/plot/bbbb.py --deriveWPs -n <number of samples to use, usually ~1M>
then, evaluate the efficiency using:
python tagger/plot/bbbb.py --eff -n <number of samples to use, usually ~500k>
To synthesize the model into HDL codes, we first need use hls4ml
python tagger/firmware/hls4ml_convert.py
Then, these codes are synthesize again with an hls wrapper, and CMSSW:
To deactivate the environment:
conda deactivate
If you make any update for the environment, please edit the environment.yml
file and run:
conda env update --file environment.yml --prune
Reference on conda environment here: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
Related talks and materials to the project can be found here, they are ordered chronologically.