-
Notifications
You must be signed in to change notification settings - Fork 2
DeepProfiler
This page gathers practical advice and some tricks to be aware of when using DeepProfiler.
- Installation
- Development
- Examples and Guidance
First, check out the installation guide on DeepProfiler. Following this should be sufficient but in case you run into some roadblocks, here is a more detailed guide with some troubleshooting help. Please note, that simply using the docker images will alleviate any need for installations.
- Read the DeepProfiler Wiki
- Do not use conda, use virtual env and pip instead
- Make sure your pip is up to date
- When running on the old (pre 2.X Tensorflow) version, Python 3.6 is required and 3.7 / 3.8 will likely give errors
- For old DP version, these are the versions you should have installed
efficientnet == 1.1.0
;keras == 2.2.5
;tensorflow-gpu == 1.15.2
andh5py == 2.10.0
- Manually add
pip install tqdm
(or, better, add it to the requirement.txt) - Change your
config.json
to{ "train": { "sampling": "cache_size": 10 }}
- Run the demo to see if you have succeeded
When running the demo data from the Wiki
You need to add cache size to the config: { "train": { "sampling": "cache_size": 10 }}
You will also need to set checkpoint: None
in the demo data to run successfully.
CUDA and cuDNN
Make sure the correct CUDA and cuDNN are installed and the paths are correct. Most machines you will be working on either run on CUDA 10 or CUDA 11. Check out the NVIDIA installation guides for details and compatibility with TF
For the G2 and P2 GPU servers on EC2, the requirements are CUDA 10.0.0 and cuDNN 7.4.x; for the Ampere 100 GPUs used on the CHTC server, I needed to install CUDA 11.
Check out these pages to see what versions you need for what TF version:
- https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html.
- https://developer.nvidia.com/rdp/cudnn-archive.
- https://stackoverflow.com/questions/50622525/which-tensorflow-and-cuda-version-combinations-are-compatible.
- The size of the images aligns with the location files and the size in the
config.json
- The file names of the images align with the
index.csv
- The file names of the locations follow the right convention
- You have at least two wells per plate
- Can you run a dummy command on TensorFlow and does TF has access to your GPU?
At the time of writing (October 2021) the DeepProfiler repo is undergoing a shift from being built on top of Tensorflow 1.15 to Tensorflow 2.5. The new development can be followed on the tf2 branch. Furthermore, [Juan](https://github.com/jccaicedo] and Nikita have been adding additional functionality for example real-time augmentation of images during training which is currently part of the tf2 branch.
Generally speaking, one should try and seek guidance from Juan, Nikita, Niranj, and myself. These are currently the people who have worked with DP the most. Additionally, there is a DeepProfilerExperiments repository which holds examples of config files, indexes, and scripts for post-processing.
Example config file Beware that some config variables need to be adapted to the machine and GPU RAM size.
{
"dataset": {
"metadata": {
"label_field": "Compound",
"control_value": "DMSO"
},
"images": {
"channels": [
"DNA",
"RNA",
"ER",
"AGP",
"Mito"
],
"file_format": "png",
"bits": 8,
"width": 1080,
"height": 1080
},
"locations":{
"mode": "single_cells",
"box_size": 128,
"area_coverage": 0.75,
"mask_objects": false
}
},
"prepare": {
"illumination_correction": {
"down_scale_factor": 4,
"median_filter_size": 24
},
"compression": {
"implement": false,
"scaling_factor": 1.0
}
},
"train": {
"partition": {
"targets": [
"Compound"
],
"split_field": "Split",
"training_values": ["Training"],
"validation_values": ["Test"]
},
"model": {
"name": "efficientnet",
"crop_generator": "sampled_crop_generator",
"metrics": ["accuracy", "top_k"],
"epochs": 20,
"initialization":"ImageNet",
"params": {
"learning_rate": 0.005,
"batch_size": 128,
"conv_blocks": 0,
"feature_dim": 256,
"pooling": "avg"
},
"lr_schedule": "cosine"
},
"sampling": {
"factor": 0.5,
"workers": 8,
"cache_size": 20000
},
"validation": {
"frequency": 1,
"top_k": 5,
"batch_size": 32,
"frame": "val",
"sample_first_crops": true
}
},
"profile": {
"feature_layer": "pool5",
"checkpoint": "checkpoint_0015.hdf5",
"batch_size": 512
}
}
The Profile section of the config changes depending on if a 'manually' trained model is used or if you are downloading a model which happens if no model file is found. In this case, the config reads:
"profile": {
"use_pretrained_input_size": 224,
"feature_layer": "avg_pool",
"checkpoint": "efficientnetb0_notop.h5",
"batch_size": 512
}
Execution commands A typical execution of DP will look like this.
Sampling:
python3 deepprofiler --root=/local_group_storage/broad_data/michael/training/ --config=config_sample.json --metadata=XX_index.csv --sample=XX_sample sample-sc
Training:
python3 deepprofiler --gpu=0 --metadata=XX_index.csv --exp=YY_train --sample=XX_sample --root=/home/user/project_dir/ --config=config_train.json train 2>&1 | tee log/log_train_YY.txt
Profile:
python3 deepprofiler --gpu=0 --root=/home/user/project_dir/ --config=config_profile.json --metadata=XX_index.csv --exp=YY_profile profile
Most problems you will encounter stem from a wrong config, so make sure you have checked everything:
- The size of the images aligns with the location files and the size in the config
- The file names of the images align with the index.csv
- The file names of the locations follow the right convention
- You have at least two wells per plate
- Make sure the correct CUDA and cuDNN are installed and the paths are correct
python3 deepprofiler --root=/home/ubuntu/project_dir/ --config=config.json --metadata=index_60.csv --logging=mykeyfile.key profile &> logs/log.txt
Here I log the speeds at which the models run on different instances. Check the instances for their details: https://aws.amazon.com/ec2/instance-types/
I have decided to give the speeds in terms of seconds that a well needs. Example: 2 seconds per site: 1 Well = 9 sites ~ 20 seconds 1 PLate = 386 wells ~ 7700 seconds ~ 2.15 hours Lincs dataset = 136 plates ~ 290 hours ~ 12 days Lincs dataset with only DMSO & 1 concentration ~ 53 hours
P2 instances are intended for general-purpose GPU compute applications. High-performance NVIDIA K80 GPUs, each with 2,496 parallel processing cores and 12GiB of GPU memory
CPU on P2 (4 cores) ~ 200 seconds per Site
GPU P2 Tesla K80 ~ 10 seconds per site
This is the next generation after P2s P3 runs a NVIDIA Tesla V100 ~ 2.2 seconds per site
The overall time it took was about a week. Although the runtime was probably closer to 5 days.
Profiling on CHTC 0.2 seconds per site. 10 times faster than the others. I also increased batch size here.
P3: Epoch takes around 6 hours
The outputs can be found under
s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/outputs/efficientnet_B0/
and
s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/outputs/ResNet50V2/
The DeepProfiler.processing function from pycytominer can be used to aggregate the output if DP: https://github.com/cytomining/pycytominer/blob/master/pycytominer/cyto_utils/DeepProfiler_processing.py With this, you can aggregate onto a site, well or plate level (well is standard). This is then level3 data and can be further processed. Example run files for this operation can be found in the pre-trained/post-processing part of this repo.
python aggregate.py
TODO Explain why this is needed.
s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/single-cell-samples/
Speed on P3: A million samples a day. So best run 8 in parallel. Then you need to move them to one file.
sample-sc
python3 deepprofiler --root=/home/ubuntu/dp/features1 --config=config.json --metadata=train_sub_index.csv --logging=log.key --exp=efficientnet_train train &> logs/train_eff2.txt
"model": {
"name": "efficientnet",
"crop_generator": "sampled_crop_generator",
"metrics": ["accuracy", "top_k"],
"epochs": 10,
"initialization":"ImageNet",
"params": {
"learning_rate": 0.005,
"batch_size": 32,
"conv_blocks": 0,
"feature_dim": 256,
"pooling": "avg"
},
"validation": {
"frequency": 2,
"top_k": 5,
"batch_size": 32,
"frame": "val",
"sample_first_crops": true
}
Run for 7 hours. Get 98 000/171 000 steps.
This will detail the most common error messages and how to fix them. First some general things:
- You need to add cache size to the config!
{ "train": { "sampling": "cache_size": 10 }}
- If images are missing but they are in the index file, then DP will throw an error and stop the calculation!
- Beware of the image sizes and the locations in the index. For my dataset, I had to divide the location values by 2
- In general, you must have several wells per plate in your index. If you use only one Well, then it gives you a
split
error. - Generally, you can not run two DP commands at the same time, unless you restrict the memory usage for a certain process.
- Some crop generators do not work or only work if you sample the data beforehand. I have been mainly using the 'normal' crop generator:
crop_generator
dset = deepprofiler.dataset.image_dataset.read_dataset(context.obj["config"], mode='train')
File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 243, in read_dataset
dset.prepare_training_locations()
File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 74, in prepare_training_locations
locations = pd.concat(locations)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 284, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 331, in __init__
raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
This is an unhelpful error that will occur during training or sampling because, either your location files are empty or (more sneaky) because there is a mismatch between config and index. For example, the config can be:
"split_field": "Split",
"training_values": ["Train"],
"validation_values": ["Test"]
while you Split row in the index has the values 'Testing' and 'Training'.
If your training run abruptly stops here then it is a problem with your single cells and with the crop generators.
2021-06-15 13:38:19,497 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:18: calling crop_and_resize_v1 (from tensorflow.python.ops.image_ops_impl) with box_ind is deprecated and will be removed in a future version.
Instructions for updating:
box_ind is deprecated, use box_indices instead
2021-06-15 13:38:19,739 - INFO - NumExpr defaulting to 4 threads.
Validation data loaded : 195 records of 195
Waiting for data (15000, 128, 128, 5) [(15000, 3)]
2021-06-15 13:39:34,215 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:218: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
2021-06-15 13:39:34,215 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:219: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
2021-06-15 13:39:35,650 - WARNING - From /home/ubuntu/dp/DeepProfiler/deepprofiler/imaging/cropping.py:223: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
2021-06-15 13:39:35,651 - WARNING - `tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
Killed
To solve this first run the sample-sc
command and then train afterward.
THis can also happened when you have too little memory allocated! This will happen for all commands, be it train, profile, etc
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `l$
"The `lr` argument is deprecated, use `learning_rate` instead.")
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and$
category=CustomMaskWarning)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "deepprofiler/__main__.py", line 197, in <module>
cli(obj={})
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "deepprofiler/__main__.py", line 163, in train
deepprofiler.learning.training.learn_model(context.obj["config"], dset, epoch, seed)
File "/DeepProfiler/deepprofiler/learning/training.py", line 45, in learn_model
model.train(epoch, metrics, verbose=verbose)
File "/DeepProfiler/deepprofiler/learning/model.py", line 67, in train
self.load_weights(epoch)
File "/DeepProfiler/deepprofiler/learning/model.py", line 106, in load_weights
self.copy_pretrained_weights()
File "/DeepProfiler/plugins/models/efficientnet.py", line 106, in copy_pretrained_weights
self.feature_model.layers[i + lshift].set_weights(base_model.layers[i].get_weights())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 1298, in set_weights
'shape %s' % (ref_shape, weight.shape))
ValueError: Layer weight shape (3, 3, 5, 32) not compatible with provided weight shape (3, 3, 3, 32)
Setting pre-trained weights: 1.69%^Mfinished training 14255088
Other one related to Memory:
Loading validation data: 1 records of 1114^MLoading validation data: 2 records of 1114^MLoading validation data: 3 records of 1114^MLoading validatio$
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: box_index has values outside [0, batch_size)
[[{{node train_inputs/cropping/CropAndResize}}]]
[[train_inputs/tuple/control_dependency_1/_15]]
(1) Out of range: box_index has values outside [0, batch_size)
[[{{node train_inputs/cropping/CropAndResize}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "deepprofiler/__main__.py", line 197, in <module>
cli(obj={})
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "deepprofiler/__main__.py", line 163, in train
deepprofiler.learning.training.learn_model(context.obj["config"], dset, epoch, seed)
File "/DeepProfiler/deepprofiler/learning/training.py", line 45, in learn_model
model.train(epoch, metrics, verbose=verbose)
File "/DeepProfiler/deepprofiler/learning/model.py", line 58, in train
x_validation, y_validation = load_validation_data(self, main_session)
File "/DeepProfiler/deepprofiler/learning/model.py", line 143, in load_validation_data
session)
File "/DeepProfiler/deepprofiler/learning/validation.py", line 35, in load_validation_data
dset.scan(validation.process_batches, frame="val")
File "/DeepProfiler/deepprofiler/dataset/image_dataset.py", line 182, in scan
f(index, image, meta)
File "/DeepProfiler/deepprofiler/learning/validation.py", line 21, in process_batches
self.config["train"]["validation"]["sample_first_crops"]
File "/DeepProfiler/deepprofiler/imaging/cropping.py", line 317, in prepare_image
output = session.run(self.input_variables["labeled_crops"], feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 968, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: box_index has values outside [0, batch_size)
[[node train_inputs/cropping/CropAndResize (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:19) ]]
[[train_inputs/tuple/control_dependency_1/_15]]
(1) Out of range: box_index has values outside [0, batch_size)
[[node train_inputs/cropping/CropAndResize (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:19) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node train_inputs/cropping/CropAndResize:
train_inputs/cell_boxes (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:86)
train_inputs/box_indicators (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:87)
train_inputs/raw_images (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:85)
train_inputs/cropping/crop_size (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:18)
Input Source operations connected to node train_inputs/cropping/CropAndResize:
train_inputs/cell_boxes (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:86)
train_inputs/box_indicators (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:87)
train_inputs/raw_images (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:85)
train_inputs/cropping/crop_size (defined at /DeepProfiler/deepprofiler/imaging/cropping.py:18)
/DeepProfiler/deepprofiler/imaging/cropping.py:49: RuntimeWarning: invalid value encountered in true_divide
output[:, :, i] = (output[:, :, i] - mean) / std
7/27 First successful run. Two prior runs were with divide by null. Trained on top20_moa. 3 epochs, 64 batch size.