DeepProfiler

This page gathers practical advice and some tricks to be aware of when using DeepProfiler.

Content

Installation
Development
Examples and Guidance

Installation

First, check out the installation guide on DeepProfiler. Following this should be sufficient but in case you run into some roadblocks, here is a more detailed guide with some troubleshooting help. Please note, that simply using the docker images will alleviate any need for installations.

Detailed installation guide and tips

Read the DeepProfiler Wiki
Do not use conda, use virtual env and pip instead
Make sure your pip is up to date
When running on the old (pre 2.X Tensorflow) version, Python 3.6 is required and 3.7 / 3.8 will likely give errors
For old DP version, these are the versions you should have installed efficientnet == 1.1.0; keras == 2.2.5; tensorflow-gpu == 1.15.2 and h5py == 2.10.0
Manually add pip install tqdm (or, better, add it to the requirement.txt)
Change your config.json to { "train": { "sampling": "cache_size": 10 }}
Run the demo to see if you have succeeded

When running the demo data from the Wiki You need to add cache size to the config: { "train": { "sampling": "cache_size": 10 }} You will also need to set checkpoint: None in the demo data to run successfully.

CUDA and cuDNN
Make sure the correct CUDA and cuDNN are installed and the paths are correct. Most machines you will be working on either run on CUDA 10 or CUDA 11. Check out the NVIDIA installation guides for details and compatibility with TF

For the G2 and P2 GPU servers on EC2, the requirements are CUDA 10.0.0 and cuDNN 7.4.x; for the Ampere 100 GPUs used on the CHTC server, I needed to install CUDA 11.

Check out these pages to see what versions you need for what TF version:

Checklist before your first DP run

The size of the images aligns with the location files and the size in the config.json
The file names of the images align with the index.csv
The file names of the locations follow the right convention
You have at least two wells per plate
Can you run a dummy command on TensorFlow and does TF has access to your GPU?

Development

At the time of writing (October 2021) the DeepProfiler repo is undergoing a shift from being built on top of Tensorflow 1.15 to Tensorflow 2.5. The new development can be followed on the tf2 branch. Furthermore, [Juan](https://github.com/jccaicedo] and Nikita have been adding additional functionality for example real-time augmentation of images during training which is currently part of the tf2 branch.

Examples and Guidance

Generally speaking, one should try and seek guidance from Juan, Nikita, Niranj, and myself. These are currently the people who have worked with DP the most. Additionally, there is a DeepProfilerExperiments repository which holds examples of config files, indexes, and scripts for post-processing.

Example commands and files.

Example config file Beware that some config variables need to be adapted to the machine and GPU RAM size.

{
    "dataset": {
        "metadata": {
            "label_field": "Compound",
            "control_value": "DMSO"
        },
        "images": {
            "channels": [
                "DNA",
                "RNA",
                "ER",
                "AGP",
                "Mito"
              ],
            "file_format": "png",
            "bits": 8,
            "width": 1080,
            "height": 1080
        },
        "locations":{
            "mode": "single_cells",
            "box_size": 128,
            "area_coverage": 0.75,
            "mask_objects": false
        }
    },
    "prepare": {
        "illumination_correction": {
            "down_scale_factor": 4,
            "median_filter_size": 24
        },
        "compression": {
            "implement": false,
            "scaling_factor": 1.0
        }
    },
    "train": {
        "partition": {
            "targets": [
                "Compound"
            ],
            "split_field": "Split",
            "training_values": ["Training"],
            "validation_values": ["Test"]
        },
        "model": {
            "name": "efficientnet",
            "crop_generator": "sampled_crop_generator",
            "metrics": ["accuracy", "top_k"],
            "epochs": 20,
            "initialization":"ImageNet",
            "params": {
                "learning_rate": 0.005,
                "batch_size": 128,
                "conv_blocks": 0,
                "feature_dim": 256,
                "pooling": "avg"
            },
            "lr_schedule": "cosine"
        },
        "sampling": {
            "factor": 0.5,
            "workers": 8,
            "cache_size": 20000
        },
        "validation": {
            "frequency": 1,
            "top_k": 5,
            "batch_size": 32,
            "frame": "val",
            "sample_first_crops": true
        }
    },
    "profile": {
      "feature_layer": "pool5",
      "checkpoint": "checkpoint_0015.hdf5",
      "batch_size": 512
    }
}

The Profile section of the config changes depending on if a 'manually' trained model is used or if you are downloading a model which happens if no model file is found. In this case, the config reads:

    "profile": {
      "use_pretrained_input_size": 224,
      "feature_layer": "avg_pool",
      "checkpoint": "efficientnetb0_notop.h5",
      "batch_size": 512
    }

Execution commands A typical execution of DP will look like this.

Sampling:

python3 deepprofiler --root=/local_group_storage/broad_data/michael/training/ --config=config_sample.json --metadata=XX_index.csv --sample=XX_sample sample-sc

Training:

python3 deepprofiler --gpu=0 --metadata=XX_index.csv --exp=YY_train --sample=XX_sample --root=/home/user/project_dir/ --config=config_train.json train 2>&1 | tee log/log_train_YY.txt

Profile:

python3 deepprofiler --gpu=0 --root=/home/user/project_dir/ --config=config_profile.json --metadata=XX_index.csv --exp=YY_profile profile

Running your first pre-trained net

Most problems you will encounter stem from a wrong config, so make sure you have checked everything:

The size of the images aligns with the location files and the size in the config
The file names of the images align with the index.csv
The file names of the locations follow the right convention
You have at least two wells per plate
Make sure the correct CUDA and cuDNN are installed and the paths are correct

python3 deepprofiler --root=/home/ubuntu/project_dir/ --config=config.json --metadata=index_60.csv --logging=mykeyfile.key profile &> logs/log.txt

Experimental results

Speeds

Here I log the speeds at which the models run on different instances. Check the instances for their details: https://aws.amazon.com/ec2/instance-types/

I have decided to give the speeds in terms of seconds that a well needs. Example: 2 seconds per site: 1 Well = 9 sites ~ 20 seconds 1 PLate = 386 wells ~ 7700 seconds ~ 2.15 hours Lincs dataset = 136 plates ~ 290 hours ~ 12 days Lincs dataset with only DMSO & 1 concentration ~ 53 hours

P2

P2 instances are intended for general-purpose GPU compute applications. High-performance NVIDIA K80 GPUs, each with 2,496 parallel processing cores and 12GiB of GPU memory

CPU on P2 (4 cores) ~ 200 seconds per Site

GPU P2 Tesla K80 ~ 10 seconds per site

P3

This is the next generation after P2s P3 runs a NVIDIA Tesla V100 ~ 2.2 seconds per site

The overall time it took was about a week. Although the runtime was probably closer to 5 days.

Profiling on CHTC 0.2 seconds per site. 10 times faster than the others. I also increased batch size here.

Speed of training

P3: Epoch takes around 6 hours

Outputs

The outputs can be found under s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/outputs/efficientnet_B0/

and s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/outputs/ResNet50V2/

Post Processing

The DeepProfiler.processing function from pycytominer can be used to aggregate the output if DP: https://github.com/cytomining/pycytominer/blob/master/pycytominer/cyto_utils/DeepProfiler_processing.py With this, you can aggregate onto a site, well or plate level (well is standard). This is then level3 data and can be further processed. Example run files for this operation can be found in the pre-trained/post-processing part of this repo.

python aggregate.py

Single cell samples

TODO Explain why this is needed.

s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/single-cell-samples/

Speed on P3: A million samples a day. So best run 8 in parallel. Then you need to move them to one file.

Training

sample-sc python3 deepprofiler --root=/home/ubuntu/dp/features1 --config=config.json --metadata=train_sub_index.csv --logging=log.key --exp=efficientnet_train train &> logs/train_eff2.txt

Training speeds

 "model": {
            "name": "efficientnet",
            "crop_generator": "sampled_crop_generator",
            "metrics": ["accuracy", "top_k"],
            "epochs": 10,
            "initialization":"ImageNet",
            "params": {
                "learning_rate": 0.005,
                "batch_size": 32,
                "conv_blocks": 0,
                "feature_dim": 256,
                "pooling": "avg"
            },
 "validation": {
            "frequency": 2,
            "top_k": 5,
            "batch_size": 32,
            "frame": "val",
            "sample_first_crops": true
        }

Run for 7 hours. Get 98 000/171 000 steps.

Troubleshooting

This will detail the most common error messages and how to fix them. First some general things:

You need to add cache size to the config! { "train": { "sampling": "cache_size": 10 }}
If images are missing but they are in the index file, then DP will throw an error and stop the calculation!
Beware of the image sizes and the locations in the index. For my dataset, I had to divide the location values by 2
In general, you must have several wells per plate in your index. If you use only one Well, then it gives you a split error.
Generally, you can not run two DP commands at the same time, unless you restrict the memory usage for a certain process.
Some crop generators do not work or only work if you sample the data beforehand. I have been mainly using the 'normal' crop generator: crop_generator

Typical errors:

    dset = deepprofiler.dataset.image_dataset.read_dataset(context.obj["config"], mode='train')
  File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 243, in read_dataset
    dset.prepare_training_locations()
  File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 74, in prepare_training_locations
    locations = pd.concat(locations)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 284, in concat
    sort=sort,
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 331, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

This is an unhelpful error that will occur during training or sampling because, either your location files are empty or (more sneaky) because there is a mismatch between config and index. For example, the config can be:

"split_field": "Split",
            "training_values": ["Train"],
            "validation_values": ["Test"]

while you Split row in the index has the values 'Testing' and 'Training'.

If your training run abruptly stops here then it is a problem with your single cells and with the crop generators.

2021-06-15 13:38:19,497 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:18: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version.
Instructions for updating:
box_ind is deprecated, use box_indices instead
2021-06-15 13:38:19,739 - INFO - NumExpr defaulting to 4 threads.
Validation data loaded : 195 records of 195
Waiting for data (15000, 128, 128, 5) [(15000, 3)]
2021-06-15 13:39:34,215 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:218: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
2021-06-15 13:39:34,215 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:219: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
2021-06-15 13:39:35,650 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:223: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
2021-06-15 13:39:35,651 - WARNING - `tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
Killed

To solve this first run the sample-sc command and then train afterward.

THis can also happened when you have too little memory allocated! This will happen for all commands, be it train, profile, etc

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `l$
  "The `lr` argument is deprecated, use `learning_rate` instead.")
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and$
  category=CustomMaskWarning)
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "deepprofiler/__main__.py", line 197, in <module>
    cli(obj={})
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "deepprofiler/__main__.py", line 163, in train
    deepprofiler.learning.training.learn_model(context.obj["config"], dset, epoch, seed)
  File "/DeepProfiler/deepprofiler/learning/training.py", line 45, in learn_model
    model.train(epoch, metrics, verbose=verbose)
  File "/DeepProfiler/deepprofiler/learning/model.py", line 67, in train
    self.load_weights(epoch)
  File "/DeepProfiler/deepprofiler/learning/model.py", line 106, in load_weights
    self.copy_pretrained_weights()
  File "/DeepProfiler/plugins/models/efficientnet.py", line 106, in copy_pretrained_weights
    self.feature_model.layers[i + lshift].set_weights(base_model.layers[i].get_weights())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 1298, in set_weights
    'shape %s' % (ref_shape, weight.shape))
ValueError: Layer weight shape (3, 3, 5, 32) not compatible with provided weight shape (3, 3, 3, 32)
Setting pre-trained weights: 1.69%^Mfinished training 14255088

Other one related to Memory:

Loading validation data: 1 records of 1114^MLoading validation data: 2 records of 1114^MLoading validation data: 3 records of 1114^MLoading validatio$
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: box_index has values outside [0, batch_size)
         [[{{node train_inputs/cropping/CropAndResize}}]]
         [[train_inputs/tuple/control_dependency_1/_15]]
  (1) Out of range: box_index has values outside [0, batch_size)
         [[{{node train_inputs/cropping/CropAndResize}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "deepprofiler/__main__.py", line 197, in <module>
    cli(obj={})
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "deepprofiler/__main__.py", line 163, in train
    deepprofiler.learning.training.learn_model(context.obj["config"], dset, epoch, seed)
  File "/DeepProfiler/deepprofiler/learning/training.py", line 45, in learn_model
    model.train(epoch, metrics, verbose=verbose)
  File "/DeepProfiler/deepprofiler/learning/model.py", line 58, in train
    x_validation, y_validation = load_validation_data(self, main_session)
  File "/DeepProfiler/deepprofiler/learning/model.py", line 143, in load_validation_data
    session)
  File "/DeepProfiler/deepprofiler/learning/validation.py", line 35, in load_validation_data
    dset.scan(validation.process_batches, frame="val")
  File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 182, in scan
    f(index, image, meta)
  File "/DeepProfiler/deepprofiler/learning/validation.py", line 21, in process_batches
    self.config["train"]["validation"]["sample_first_crops"]
  File "/DeepProfiler/deepprofiler/imaging/cropping.py", line 317, in prepare_image
    output = session.run(self.input_variables["labeled_crops"], feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: box_index has values outside [0, batch_size)
         [[node train_inputs/cropping/CropAndResize (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:19) ]]
         [[train_inputs/tuple/control_dependency_1/_15]]
  (1) Out of range: box_index has values outside [0, batch_size)
         [[node train_inputs/cropping/CropAndResize (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:19) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node train_inputs/cropping/CropAndResize:
 train_inputs/cell_boxes (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:86)
 train_inputs/box_indicators (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:87)
 train_inputs/raw_images (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:85)
 train_inputs/cropping/crop_size (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:18)

Input Source operations connected to node train_inputs/cropping/CropAndResize:
 train_inputs/cell_boxes (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:86)
 train_inputs/box_indicators (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:87)
 train_inputs/raw_images (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:85)
 train_inputs/cropping/crop_size (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:18)

/DeepProfiler/deepprofiler/imaging/cropping.py:49: RuntimeWarning: invalid value encountered in true_divide
  output[:, :, i] = (output[:, :, i] - mean) / std

Trainings

7/27 First successful run. Two prior runs were with divide by null. Trained on top20_moa. 3 epochs, 64 batch size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly