Introduction

This is a TensorFlow framework for the identification of ATLAS electrons by using neural networks.

Training at LPS

login to atlas16 for GPU's avaibility
```
ssh -Y atlas16
```
change to user directory
```
cd /opt/tmp/$USER  
```

link data file to user directory

ln -s /opt/tmp/godin/el_data/2020-05-28/el_data.h5 .

clone framework from GitHub

git clone https://github.com/dominiquegodin/el_classifier.git

enter framework directory
```
cd el_classifier
```
activate the virtual environment of TensorFlow2.1.0+Python3.6.8 Singularity image
```
singularity shell --nv --bind /opt /opt/tmp/godin/sing_images/tf-2.1.0-gpu-py3_sing-2.6.sif
```
use the flag --nv or not to wether run on GPUs or CPUs
start training; see options below
```
python classifier.py [OPTIONS]
```
for monitoring NVIDIA GPU devices, e.g. memory and power usage, temperature, fan speed, etc.
```
nvidia-smi
```

Training on Beluga Cluster

login to Beluga cluster
```
ssh -Y [email protected]
```
change to user directory
```
cd /home/$USER
```

link data file to user directory

ln -s /project/def-arguinj/dgodin/el_data/2020-05-28/el_data.h5 .

clone framework from GitHub

git clone https://github.com/dominiquegodin/el_classifier.git

enter framework directory
```
cd el_classifier
```

Using Slurm jobs manager (LPS or Beluga)

run classifier.sh script and send jobs to Slurm batch system
```
sbatch sbatch.sh
```
send array jobs with ID 1 to 10 to Slurm batch system
```
sbatch --array=1-10 sbatch.sh
```
report status of job
```
squeue
```
or
```
sview
```
cancel job
```
scancel $job_id
```

monitor jobs GPU usage at 2s interval

srun --jobid $job_id --pty watch -n 2 nvidia-smi

use Slurm interactively and request appropriate ressources on Beluga

salloc --time=00:30:00 --cpus-per-task=4 --gres=gpu:1 --mem=128G --x11 --account=def-arguinj

classifier.py Options

--n_train : number of training electrons (default=1e5)

--n_valid : number of testing electrons (default=1e5)

--batch_size : size of training batches (default=5000)

--n_epochs : number of training epochs (default=100)

--n_classes : number of classes (default=2)

--n_tracks : number of tracks (default=10)

--n_folds : number of folds for k-fold cross_validation

--n_gpus : number of gpus for distributed training (default=4)

--weight_type : name of weighting method, either of 'none' (default), 'match2b', 'match2s', 'flattening' should be given

--train_cuts : applied cuts on training samples

--valid_cuts : applied cuts on validation samples

--NN_type : CNN or FCN specify the type of neural networks (default=CNN)

--scaling : applies quantile transform to scalar variables when ON (fit performed on train sample and applied to whole sample)

--cross_valid : performs k-fold cross-validation

--plotting : plots accuracy history when ON, distributions separation and ROC curves

--output_dir : name of output directory (useful fo running jobs in parallel)

--model_in : hdf5 model file from a previous training checkpoint (requires .h5 extension)

--model_out : name of hdf5 checkpoint file used for saving and updating the model best weights

--scaler_in : name of the pickle file (.pkl) containing scaling transform (quantile) for scalars variables

--results_in : name of the pickle file (.pkl) containing validation results

Explanations

The model and weights are automatically saved to a hdf5 checkpoint for each epoch where the performance (either accuracy or loss function) has improved.
An early stopping callback allows the training to stop automatically when the validation performance has stop improving for a pre-determined number of epochs (default=10).
Finished or aborted trainings can be resumed from where they were stopped by using previously trained weights from other same-model hdf5 checkpoints (see --model_in option).
All plots, weights and models are saved by default in the "outputs" directory.
To use pre-trained weights and generate plots without re-training, n_epochs = 0 must be specify.
In order to optimize data transfer rate, the datafile should be present on the same server of the GPU's.

Setting up suitable python environment on Compute Canada nodes

See https://docs.computecanada.ca/wiki/Python .

First time

Create the environment:

module load python/3.6
module load scipy-stack
virtualenv --no-download ~/ENV

### install needed packages...
pip install --no-index --upgrade pip
pip install h5py
pip install --no-index tensorflow_cpu
pip install --no-index tensorflow_gpu
pip install -U scikit-learn
pip install tabulate
pip install scikit-image

Subsequent times

Just load the environment:

module load python/3.6
module load scipy-stack
source ~/ENV/bin/activate

Voila! To get out of it, deactivate.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
results		results
scripts		scripts
README.md		README.md
classifier.py		classifier.py
classifier.sh		classifier.sh
models.py		models.py
plots_DG.py		plots_DG.py
plots_KM.py		plots_KM.py
presampler.py		presampler.py
presampler.sh		presampler.sh
sbatch.sh		sbatch.sh
script.sh		script.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Training at LPS

Training on Beluga Cluster

Using Slurm jobs manager (LPS or Beluga)

classifier.py Options

Explanations

Setting up suitable python environment on Compute Canada nodes

First time

Subsequent times

About

Releases

Packages

Languages

etiennedreyer/el_classifier

Folders and files

Latest commit

History

Repository files navigation

Introduction

Training at LPS

Training on Beluga Cluster

Using Slurm jobs manager (LPS or Beluga)

classifier.py Options

Explanations

Setting up suitable python environment on Compute Canada nodes

First time

Subsequent times

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages