-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to build the container (GPU) #1
Comments
Have you seen that I provide pre-built images? Try to use those if possible. If that's not possible I'll look into this |
Thank you for your reply. I want to try pre-built images. The path where the demo files are located is: First of all, I know I should write: But after this, what would I write to predict cell divisions from the demo file images? I'm not familiar with docker. The script in the paper was:
|
Please take a close look at the readme! It covers this topic in detail. As it notes, data is not bundled with the pre-built images, you need to obtain the data separately, which it sounds like you have done. Other than the first line which you can ignore ( This is just a matter of plumbing the data so that it is available within the container's file system. Very concretely, if your data lives at export DATA_VOL=/home/ubuntu3/Software_Guide_Full/Images/DataSetB
docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL=/data rueberger/division_detection:latest_gpu python division_detection/scripts/predict.py /data --chunk_size 50 100 100 It should work as written in the script you shared - I made some slight modifications to be consistent with the current readme. Finally, a word of caution: be very careful in doing inference with the model as trained. I would not expect it to generalize to other indicators/sample preparation protocols. If you are interested in classifying cell divisions for another dataset, I would recommend retraining the model. Which isn't really supported by this package right now, but is feasible and something I would be happy to talk about. |
Thank you for your reply. When I run the script you told me about, it looks like the prediction process is running.
Is it okay if I get this message? I want to see if the prediction process worked, but the process of copying the prediction results does not work. In the paper, the script was as follows
Can you modify this script appropriately? After the prediction, a new folder named bboxes and h5 was created in the folder where the paper's demo files are located Are these folders called bboxes and h5 the prediction results? I want to be able to use this software at all costs. |
I'm happy to keep troubleshooting until you get the results you need. If memory serves, yes those h5 files are the predictions. I'm also concerned about those tracebacks. Can you run the predictions again and take a close look at main and gpu memory usage? My primary concern would be problems caused by running out of memory. I'll try to look into the fiji h5 issue tomorrow. I don't think there is any need to run the script to copy the results out of the container, and it sounds like you already know where to find them. |
Thank you for your reply. I'm relieved to see that the prediction results have already been output. Regarding traceback, as you pointed out, I think it is due to lack of GPU memory. When we monitored the GPU usage, it was almost 100%. Since my PC has two GPUs (8GB x 2), I added the command "--allowed_gpus 0 1" to do the processing on multiple GPUs. However, only one of the GPUs was being used. I'll be using a Quadro RTX 8000 (48GB) in production, so I don't think that's a problem, but I'm a little concerned because I don't know if I'm the reason why it's not doing the multi-GPU processing. I would appreciate your advice on the issue of importing .h5 files into ImageJ/Fiji. |
This sounds reasonable, given that the readme notes that a chunksize of 50, 100, 100 used 8.5GB of GPU memory. You'll need to use a smaller chunk size if you want to do inference with the 8GB GPUs. What about the host memory? Some background on chunk sizes: as the input image is typically enormous (about 1024^3 for this work), we divide the image into chunks that can fit into GPU memory and run inference simultaneously across the entire chunk. Since overlapping activations are reused, inference is most efficient when chunk sizes are as large as possible. In short: use the largest chunk size you can fit into GPU memory. Since your quadro has about 4x the memory of what I used for this work, you should use much larger chunk sizes than I did. Can you tell me more about what you're planning to do with the model? To reiterate - I would be extremely cautious about putting this model as trained into "production". If you're planning to do anything other than reproducing our results, we should talk.
Fix the GPU memory issue first (by using a smaller chunk size), and if this still a problem I'll look into it. Can you send me one of the generated .h5s so I can look into the Fiji issue? |
Thank you for your reply. I figured that the problem of getting errors and not being able to open the .h5 file was due to insufficient memory on the GPU. So I used a Quadro RTX 8000 with 48GB of memory to perform the cell-division detection task. While working, I monitored the GPU's memory and found that only 8.8GB of memory was being used out of 48GB. However, I still got the same error as before.
Furthermore, after completing the task, the following error occurred when trying to open the .h5 file with Fiji/ImageJ
I am uploading some of the .h5 files that were created. There are two types of files, but I can only upload one type due to space limitations.
I'm not going to use the training model trained on your data to guess our data. I've used Deep learning in my research, so I know that much. Right now, I'm focusing on whether TGMM will work on our own PC. In the future, when I detect cell division with my own data, please tell me how to train. Of course, I think that we will be requesting it in the form of joint research. |
Update for you: I've reproduced the string formatting traceback and am working on the fix. I've also traced the root cause of the GPU parallelism problem, and uncovered a number of other problems. I sincerely apologize, this package is in much worse shape than I thought it was. I am committed to fixing that, but it may take some time. Please bear with me.
Great, just wanted to make sure I'm being clear about the state of the model. |
Thank you for your response. I don't use the cell-detection module right away, so please take your time with the fix. However, many researchers in Japan want to use TGMM. I am a member of ABiS (Advanced Bioimaging Support) in Japan. There are many requests that Japanese researchers want to do full-scale tracking analysis using TGMM, but they are having trouble installing TGMM software. I want to spread this wonderful software. Your cooperation is essential. Please cooperate. |
Are you in touch with my coauthors? I think Leo Guignard is probably who you want to talk about installing TGMM, I can't help you with that. Happy to put you in touch with them, send me a note at [email protected] |
Thanks for letting me know. I've already been in contact with Leo and I've talked to him about all the problems in TGMM except for the cell-detection module. I have already solved these problems. The last problem is about this cell-detection module. If we can solve this problem, we can make full use of the TGMM software. |
Not done yet, but I've fixed most things, including GPU parallelism. If you want to try it out: You'll probably get an |
Thank you for the correction. I tried the method you gave me.
However, I get the following error message
I'm not sure if it's my own fault or if a fix is still in the works. Please comment. Best regards. |
Whoops, my mistake. I changed the path semantics in a bid to improve the user interface. Previously you had to pass a directory (in this case The prediction script now accepts input in either hdf5 or klb file formats, and requires the full path. Change the run command to |
Thanks for the fix. But I'm getting the error again. I typed the following script into the terminal
Then I get the following error
There doesn't seem to be a klb folder under the data folder. Thank you for your advice. |
You're missing a space between the script invocation and the directory path, as the errors reports. The path in the container is |
I'm sorry for changing the script without permission. However, I get the following message.
and
And I still can't open the .h5 file created by Fiji/ImageJ. I use a plugin called HDF5 to open the .h5 file. With that plugin, I can open the sample (.h5) files attached to the paper. If you know of any other way to open .h5 files in Fiji/ImageJ, please let me know. |
I upload some of the created .h5 files. |
I believe that error indicates that the prediction files already exist, which is puzzling because the prediction directory is ephemeral - it does not persist beyond the lifetime of the container. Did you run the prediction script more than once in that container? Sorry about the obtuse error - improving the error in this scenario is one of the enhancements I'm working on. Let's make sure that you're running the script in a new container and try this again. You need to stop the container if it's running, and remove it once it's stopped.
As far as I know FIJI has no native support for HDF5. We were viewing the predictions in FIJI, but I unfortunately I don't recall how. I'm happy to add an option to output a different file format if we can't get h5s working in FIJI, but first let's make sure you can generate predictions without any errors. FYI the .h5 you've attached is a 2D projection of the full input volume. Actual predictions are output by default to To copy the predictions from the container filesystem to the host, run |
Thank you for fixing the bug. I have been removing containers every time.
As a result, I still got the following error, which is the same as before.
As you said, let's focus on getting the predictions generated without errors first. Thank you for your help. |
I haven't been able to replicate this.
That shouldn't be a problem. You can't remove a container that hasn't stopped, if Can you please double check what arguments you used when you got the error? The arguments in what you shared ( You would have gotten the following error for those arguments:
I'm concerned that you might be using a slightly different image than me. Can you please share the output of Let's try running the script in a slightly different way. The current
which will drop you into a shell in the container. Then verify that
Note that you will have to delete the h5 directory first. |
Thank you for your kind reply. The information of docker image is as follows.
The docker version information is as follows.
As you said, I tried running the script in a slightly different way.
Then I get the following error message.
I got a message like this and couldn't continue. Please tell me what to do. Thank you. |
I believe that was caused by the space between |
Thanks for the fix. As you said, I ran the script as follows:
Then I ran the following script:
I got the following message.
Finally I ran the following script.
I still got the same error message as before.
Of course, I deleted the h5 directory first. I hope it helps in fixing the bug. |
Just wanted to let you know I haven't forgotten about this, just been very busy... |
It's been a while, Rueberger. Thank you for contacting me. I know you are already away from Janelia. It would be nice if you could work on fixing the code when you have time, even if it's just a little bit at a time. Thank you. |
Hi, Rueberger. Thank you |
Yup, still on my todo list, but no guarantees I'll be able to get to it anytime soon... @SakuSakuLab any updates from your end? Last time I checked I still couldn't reproduce the problem you're having |
Long time no see, Rueberger. |
Dear rueberger,
With great interest,
I have read your papers entitled
“In Toto Imaging and Reconstruction of Post-Implantation Mouse Development at the Single-Cell Level (2018)”.
I tried to use TGMM software.
But the module8 (Automated cell division detection) doesn’t work.
I installed docker and nvidia-docker as instructed
and executed the following command.
Then,
at Step 10/18 : RUN conda install -y -c ilastik pyklb,
the following message was displayed.
After this,
a message like Examining conflict for …... continued for a long time,
and the following message was displayed.
Finally,
I got the following error message:
See attached file A for full text of error.
A.txt
I can't understand why this happens.
I think the installation of docker and nvidia-docker was successful.
Because when I executed
docker run hello-world
,the following message was displayed;
This message shows that your installation appears to be working correctly.
I would like to solve this problem by any means.
Would you tell me the solution?
The text was updated successfully, but these errors were encountered: