Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build the container (GPU) #1

Open
SakuSakuLab opened this issue Jul 26, 2020 · 30 comments
Open

Unable to build the container (GPU) #1

SakuSakuLab opened this issue Jul 26, 2020 · 30 comments

Comments

@SakuSakuLab
Copy link

Dear rueberger,

With great interest,
I have read your papers entitled
“In Toto Imaging and Reconstruction of Post-Implantation Mouse Development at the Single-Cell Level (2018)”.

I tried to use TGMM software.
But the module8 (Automated cell division detection) doesn’t work.

I installed docker and nvidia-docker as instructed
and executed the following command.

cd /home/ubuntu3/Software_Guide_Full/Division-Detection
make gpu_image

Then,
at Step 10/18 : RUN conda install -y -c ilastik pyklb,
the following message was displayed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

After this,

a message like Examining conflict for …... continued for a long time,
and the following message was displayed.

Found conflicts! Looking for incompatible packages. 
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
[…]
Your python: python=2.7

Finally,

I got the following error message:

The command '/bin/sh -c conda install -y -c ilastik pyklb' returned a non-zero code: 1
Makefile:14: recipe for target 'gpu_image' failed
make: *** [gpu_image] Error 1

See attached file A for full text of error.
A.txt

I can't understand why this happens.
I think the installation of docker and nvidia-docker was successful.
Because when I executed docker run hello-world,
the following message was displayed;

This message shows that your installation appears to be working correctly.

I would like to solve this problem by any means.

Would you tell me the solution?

@rueberger
Copy link
Owner

Have you seen that I provide pre-built images? Try to use those if possible. If that's not possible I'll look into this

@SakuSakuLab
Copy link
Author

Thank you for your reply.

I want to try pre-built images.
However, I don't know the scripts that process the demo files attached to the paper by using this pre-built images.

The path where the demo files are located is:
/home/ubutu3/Software_Guide_Full/Images/DataSetB

First of all, I know I should write:
docker pull rueberger/division_detection

But after this, what would I write to predict cell divisions from the demo file images?

I'm not familiar with docker.
Please tell me the scripts concretely line by line.

The script in the paper was:

(in the folder Division-Detection)
make gpu_image

cd ../Images/DataSetB

export DATA_VOL=’pwd’

cd -

docker run --runtime=nvidia --name div_det type=bind,source=$DATA_VOL,destination=$DATA_VOL -it --mount division_detection: latest_gpu python division_detection/scripts/predict.py $DATA_VOL --chunk_size 50 100 100

@rueberger
Copy link
Owner

Please take a close look at the readme! It covers this topic in detail. As it notes, data is not bundled with the pre-built images, you need to obtain the data separately, which it sounds like you have done.

Other than the first line which you can ignore (make gpu_image), the script you shared should work.

This is just a matter of plumbing the data so that it is available within the container's file system.

Very concretely, if your data lives at /home/ubuntu3/Software_Guide_Full/Images/DataSetB then this is the incantation you want

export DATA_VOL=/home/ubuntu3/Software_Guide_Full/Images/DataSetB
docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL=/data rueberger/division_detection:latest_gpu python division_detection/scripts/predict.py /data --chunk_size 50 100 100

It should work as written in the script you shared - I made some slight modifications to be consistent with the current readme.

Finally, a word of caution: be very careful in doing inference with the model as trained. I would not expect it to generalize to other indicators/sample preparation protocols. If you are interested in classifying cell divisions for another dataset, I would recommend retraining the model. Which isn't really supported by this package right now, but is feasible and something I would be happy to talk about.

@SakuSakuLab
Copy link
Author

Thank you for your reply.
I have no knowledge of docker at all and I apologize for bothering you.

When I run the script you told me about, it looks like the prediction process is running.
However, I'm concerned about the following message repeatedly.

fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 861, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 734, in format
    return fmt.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 465, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 329, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file predict.py, line 578
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 861, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 734, in format
    return fmt.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 465, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 329, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file predict.py, line 578
fetching chunk
Yielding chunks
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk
fetching chunk

Is it okay if I get this message?

I want to see if the prediction process worked, but the process of copying the prediction results does not work.

In the paper, the script was as follows

mkdir ../Outputs/Dvisions
docker cp div_det:/results/pretrained/DataSetB/dense ../Ouputs/Divisions

Can you modify this script appropriately?
I'm not familiar with docker and I don't know where the prediction results folder is located.

After the prediction, a new folder named bboxes and h5 was created in the folder where the paper's demo files are located
(/home/ubutu3/Software_Guide_Full/Images/DataSetB).

Are these folders called bboxes and h5 the prediction results?
They were different from the results expected attached to the paper.
Furthermore, the .h5 file in the folder named h5 cannot be opened by ImageJ/Fiji, while the .h5 file of the results expected attached to the paper can be opened by ImageJ/Fiji.

I want to be able to use this software at all costs.
Please be patient with me a little longer.

@rueberger
Copy link
Owner

I'm happy to keep troubleshooting until you get the results you need.

If memory serves, yes those h5 files are the predictions.

I'm also concerned about those tracebacks. Can you run the predictions again and take a close look at main and gpu memory usage? My primary concern would be problems caused by running out of memory. I'll try to look into the fiji h5 issue tomorrow.

I don't think there is any need to run the script to copy the results out of the container, and it sounds like you already know where to find them.

@SakuSakuLab
Copy link
Author

Thank you for your reply.

I'm relieved to see that the prediction results have already been output.

Regarding traceback, as you pointed out, I think it is due to lack of GPU memory. When we monitored the GPU usage, it was almost 100%.

Since my PC has two GPUs (8GB x 2), I added the command "--allowed_gpus 0 1" to do the processing on multiple GPUs. However, only one of the GPUs was being used. I'll be using a Quadro RTX 8000 (48GB) in production, so I don't think that's a problem, but I'm a little concerned because I don't know if I'm the reason why it's not doing the multi-GPU processing.

I would appreciate your advice on the issue of importing .h5 files into ImageJ/Fiji.

@rueberger
Copy link
Owner

rueberger commented Jul 30, 2020

When we monitored the GPU usage, it was almost 100%.

This sounds reasonable, given that the readme notes that a chunksize of 50, 100, 100 used 8.5GB of GPU memory. You'll need to use a smaller chunk size if you want to do inference with the 8GB GPUs.

What about the host memory?

Some background on chunk sizes: as the input image is typically enormous (about 1024^3 for this work), we divide the image into chunks that can fit into GPU memory and run inference simultaneously across the entire chunk. Since overlapping activations are reused, inference is most efficient when chunk sizes are as large as possible. In short: use the largest chunk size you can fit into GPU memory.

Since your quadro has about 4x the memory of what I used for this work, you should use much larger chunk sizes than I did.

Can you tell me more about what you're planning to do with the model? To reiterate - I would be extremely cautious about putting this model as trained into "production". If you're planning to do anything other than reproducing our results, we should talk.

but I'm a little concerned because I don't know if I'm the reason why it's not doing the multi-GPU processing

Fix the GPU memory issue first (by using a smaller chunk size), and if this still a problem I'll look into it.

Can you send me one of the generated .h5s so I can look into the Fiji issue?

@SakuSakuLab
Copy link
Author

Thank you for your reply.

I figured that the problem of getting errors and not being able to open the .h5 file was due to insufficient memory on the GPU. So I used a Quadro RTX 8000 with 48GB of memory to perform the cell-division detection task. While working, I monitored the GPU's memory and found that only 8.8GB of memory was being used out of 48GB. However, I still got the same error as before.

Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 861, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 734, in format
    return fmt.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 465, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 329, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file predict.py, line 578
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 861, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 734, in format
    return fmt.format(record)
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 465, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python2.7/logging/__init__.py", line 329, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file predict.py, line 578
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe

Furthermore, after completing the task, the following error occurred when trying to open the .h5 file with Fiji/ImageJ

Error while opening ‘C:\................’:
ncsa.hdf.hdf5lib.exceptions.HDF5LibraryException: Plugin for dynamically loaded library: Can’t open directory or file [“..\..\src\H5PL.c.line 526 in H5PL_find(): can’t open directory]

I am uploading some of the .h5 files that were created. There are two types of files, but I can only upload one type due to space limitations.
Volume_02.projection.zip
I would appreciate it if you could verify it.

Can you tell me more about what you're planning to do with the model? To reiterate - I would be extremely cautious about putting this model as trained into "production". If you're planning to do anything other than reproducing our results, we should talk.

I'm not going to use the training model trained on your data to guess our data. I've used Deep learning in my research, so I know that much. Right now, I'm focusing on whether TGMM will work on our own PC.

In the future, when I detect cell division with my own data, please tell me how to train. Of course, I think that we will be requesting it in the form of joint research.

@rueberger
Copy link
Owner

Update for you:

I've reproduced the string formatting traceback and am working on the fix. I've also traced the root cause of the GPU parallelism problem, and uncovered a number of other problems.

I sincerely apologize, this package is in much worse shape than I thought it was. I am committed to fixing that, but it may take some time. Please bear with me.

I'm not going to use the training model trained on your data to guess our data. I've used Deep learning in my research, so I know that much. Right now, I'm focusing on whether TGMM will work on our own PC.

Great, just wanted to make sure I'm being clear about the state of the model.

@SakuSakuLab
Copy link
Author

Thank you for your response.

I don't use the cell-detection module right away, so please take your time with the fix.

However, many researchers in Japan want to use TGMM.

I am a member of ABiS (Advanced Bioimaging Support) in Japan.
ABiS is a framework that provides cutting-edge imaging and image analysis technologies to Japanese researchers, and has a track record of supporting many researchers so far.

There are many requests that Japanese researchers want to do full-scale tracking analysis using TGMM, but they are having trouble installing TGMM software.

I want to spread this wonderful software.

Your cooperation is essential. Please cooperate.

@rueberger
Copy link
Owner

Are you in touch with my coauthors? I think Leo Guignard is probably who you want to talk about installing TGMM, I can't help you with that. Happy to put you in touch with them, send me a note at [email protected]

@SakuSakuLab
Copy link
Author

Thanks for letting me know.

I've already been in contact with Leo and I've talked to him about all the problems in TGMM except for the cell-detection module.

I have already solved these problems.

The last problem is about this cell-detection module. If we can solve this problem, we can make full use of the TGMM software.

@rueberger
Copy link
Owner

rueberger commented Aug 4, 2020

Not done yet, but I've fixed most things, including GPU parallelism. If you want to try it out: docker pull rueberger/division_detection:patched_gpu then run the same incantation as before but with 'latest_gpu' replaced with 'patched_gpu'.

You'll probably get an IOError, which you can ignore.

@SakuSakuLab
Copy link
Author

SakuSakuLab commented Aug 7, 2020

Thank you for the correction.

I tried the method you gave me.
I typed the following script into the terminal

$docker pull rueberger/division_detection:patched_gpu

$export DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB

$docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL,destination=/data rueberger/division_detection:patched_ gpu python division_detection/scripts/predict.py /data --chunk_size 50 100 100 --allowed_gpus 0 1

However, I get the following error message

Traceback (most recent call last):
  File "division_detection/scripts/predict.py", line 61, in <module>
    main()
  File "division_detection/scripts/predict.py", line 57, in main
    predict_from_inbox(args.model_name, args.data_dir, args.chunk_size, args.allowed_gpus)
  File "/research/division_detection/division_detection/predict.py", line 115, in predict_from_inbox
    raise NotImplementedError("process_dir must be named either 'h5' or 'klb'. Passed value: {}".format(process_dir))
NotImplementedError: process_dir must be named either 'h5' or 'klb'. Passed value: /data

I'm not sure if it's my own fault or if a fix is still in the works.

Please comment. Best regards.

@rueberger
Copy link
Owner

Whoops, my mistake. I changed the path semantics in a bid to improve the user interface. Previously you had to pass a directory (in this case /data) containing a directory named klb, containing .klb files, which were rewritten as hdf5 files to /data/h5, which was required to not already exist. This means that you couldn't run the prediction script twice without deleting /data/h5, which is annoying.

The prediction script now accepts input in either hdf5 or klb file formats, and requires the full path.

Change the run command to $docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL,destination=/data rueberger/division_detection:patched_ gpu python division_detection/scripts/predict.py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1 and things should work.

@SakuSakuLab
Copy link
Author

Thanks for the fix.

But I'm getting the error again.

I typed the following script into the terminal


$docker pull rueberger/division_detection:patched_gpu

$export DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB

$docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL,destination=/data rueberger/division_detection:patched_ gpu python division_detection/scripts/predict.py/data/klb --chunk_size 50 100 100 --allowed_gpus 0 1

Then I get the following error

python: can't open file 'division_detection/scripts/predict.py/data/klb': [Errno 20] Not a directory

There doesn't seem to be a klb folder under the data folder.

Thank you for your advice.

@rueberger
Copy link
Owner

You're missing a space between the script invocation and the directory path, as the errors reports. The path in the container is /data/klb, not division_detection/scripts/predict.py/data/klb.

@SakuSakuLab
Copy link
Author

I'm sorry for changing the script without permission.
I fixed it as you pointed out and the code worked fine.
The previous code only had one GPU in use, but this time both GPUs are in use.

However, I get the following message.

predicting chunk at (0, 1170, 672)
predicting chunk at (0, 1170, 728)
predicting chunk at (0, 1170, 740)
Caught stop iteration. 
Stopping writer process
Starting predictions for t = 5
Starting writer process
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 5
Process Process-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Starting prediction worker
Output chunk size: [42, 56, 56]
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 0, 0)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 0, 56)
predicting chunk at (0, 0, 112)
predicting chunk at (0, 0, 168)

and

predicting chunk at (0, 840, 56)
predicting chunk at (0, 1170, 740)
predicting chunk at (0, 840, 112)
Caught stop iteration. 
Stopping writer process
Starting predictions for t = 14
predicting chunk at (0, 840, 168)
Starting writer process
Starting prediction worker
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 14
Process Process-13:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Output chunk size: [42, 56, 56]
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 0, 0)
predicting chunk at (0, 840, 224)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 840, 280)
predicting chunk at (0, 0, 56)
predicting chunk at (0, 840, 336)

And I still can't open the .h5 file created by Fiji/ImageJ.

I use a plugin called HDF5 to open the .h5 file. With that plugin, I can open the sample (.h5) files attached to the paper. If you know of any other way to open .h5 files in Fiji/ImageJ, please let me know.

@SakuSakuLab
Copy link
Author

I upload some of the created .h5 files.
Thank you for your confirmation.

Volume_01.projection.h5.zip

@rueberger
Copy link
Owner

I believe that error indicates that the prediction files already exist, which is puzzling because the prediction directory is ephemeral - it does not persist beyond the lifetime of the container. Did you run the prediction script more than once in that container? Sorry about the obtuse error - improving the error in this scenario is one of the enhancements I'm working on.

Let's make sure that you're running the script in a new container and try this again.

You need to stop the container if it's running, and remove it once it's stopped.

  1. Run docker stop div_det - this will fail if the container is already stopped, but that's okay.
  2. Run docker rm div_det
  3. Now you're ready to run the script again. Export the data vol export DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB then run docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu python division_detection/scripts/predict.py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1 again

As far as I know FIJI has no native support for HDF5. We were viewing the predictions in FIJI, but I unfortunately I don't recall how. I'm happy to add an option to output a different file format if we can't get h5s working in FIJI, but first let's make sure you can generate predictions without any errors.

FYI the .h5 you've attached is a 2D projection of the full input volume. Actual predictions are output by default to /results/pretrained/data/dense/*.h5 (in the container!).

To copy the predictions from the container filesystem to the host, run docker cp div_det:/results/pretrained/data/dense /your/desired/results/path while the container is running.

@SakuSakuLab
Copy link
Author

Thank you for fixing the bug.

I have been removing containers every time.
However, I had never tried stopping containers, so I did it again this time to stop and remove them. I then ran the script. Specifically, here's what I did

$Run docker stop div_det
$docker rm div_det
$Export $DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB 
$docker run --runtime=nvidia --name div_det -it --mount type=bind, source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu python division_detection/scripts/predict.py /data/klb -- chunk_size 50 100 100 100 --allowed_gpus 0 1 again. py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1

As a result, I still got the following error, which is the same as before.

predicting chunk at (0, 840, 0)
predicting chunk at (0, 1170, 672)
predicting chunk at (0, 840, 56)
predicting chunk at (0, 1170, 728)
predicting chunk at (0, 840, 112)
predicting chunk at (0, 1170, 740)
predicting chunk at (0, 840, 168)
Caught stop iteration.
Stopping writer process
Starting predictions for t = 9
Starting writer process
Starting prediction worker
Output chunk size: [42, 56, 56]
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 840, 224)
predicting chunk at (0, 0, 0)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 840, 280)
Writing chunk at (0, 0, 0)
Caught exception in writer: Argument sequence too long <type 'exceptions.TypeError'> 9
Process Process-9:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 588, in writer
    pred_coord[2]: pred_coord[2] + pred_size[2]] = prediction[0]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 609, in __setitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 94, in select
    sel[args]
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 261, in __getitem__
    start, count, step, scalar = _handle_simple(self.shape,args)
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 438, in _handle_simple
    args = _expand_ellipsis(args, len(shape))
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 425, in _expand_ellipsis
    raise TypeError("Argument sequence too long")
TypeError: Argument sequence too long
predicting chunk at (0, 0, 56)
predicting chunk at (0, 840, 336)
predicting chunk at (0, 0, 112)
predicting chunk at (0, 840, 392)
predicting chunk at (0, 0, 168)
predicting chunk at (0, 840, 448)




predicting chunk at (0, 1170, 728)
predicting chunk at (0, 840, 168)
predicting chunk at (0, 1170, 740)
Caught stop iteration.
Stopping writer process
Starting predictions for t = 12
Starting writer process
Starting prediction worker
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 12
Process Process-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
Output chunk size: [42, 56, 56]
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 840, 224)
predicting chunk at (0, 0, 0)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 840, 280)
predicting chunk at (0, 0, 56)
predicting chunk at (0, 840, 336)
predicting chunk at (0, 0, 112)
predicting chunk at (0, 840, 392)
predicting chunk at (0, 0, 168)
predicting chunk at (0, 840, 448)



predicting chunk at (0, 1170, 728)
predicting chunk at (0, 504, 224)
predicting chunk at (0, 1170, 740)
Caught stop iteration.
Stopping writer process
Starting predictions for t = 5
Starting writer process
predicting chunk at (0, 504, 280)
Starting prediction worker
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 5
Process Process-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Output chunk size: [42, 56, 56]
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 0, 0)
predicting chunk at (0, 504, 336)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 0, 56)
predicting chunk at (0, 504, 392)
predicting chunk at (0, 0, 112)
predicting chunk at (0, 504, 448)
predicting chunk at (0, 0, 168)





predicting chunk at (0, 672, 560)
predicting chunk at (0, 1170, 740)
Caught stop iteration. 
Stopping writer process
predicting chunk at (0, 672, 616)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 672, 672)
predicting chunk at (0, 672, 728)
predicting chunk at (0, 672, 740)





predicting chunk at (0, 1170, 672)
predicting chunk at (0, 1170, 728)
predicting chunk at (0, 1170, 740)
Caught stop iteration.
Stopping writer process

As you said, let's focus on getting the predictions generated without errors first.

Thank you for your help.

@rueberger
Copy link
Owner

rueberger commented Aug 13, 2020

I haven't been able to replicate this.

However, I had never tried stopping containers, so I did it again this time to stop and remove them

That shouldn't be a problem. You can't remove a container that hasn't stopped, if docker rm div_det worked, the container was already stopped.

Can you please double check what arguments you used when you got the error? The arguments in what you shared (docker run --runtime=nvidia --name div_det -it --mount type=bind, source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu python division_detection/scripts/predict.py /data/klb -- chunk_size 50 100 100 100 --allowed_gpus 0 1 again. py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1) are extremely malformed and could not have possibly produced the output you shared.

You would have gotten the following error for those arguments:

root@4fe5abbdf57a:/research# python division_detection/scripts/predict.py /data/klb -- chunk_size 50 100 100 100 --allowed_gpus 0 1 again. py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1
Using TensorFlow backend.
usage: predict.py [-h] [--model_name MODEL_NAME]
                  [--chunk_size CHUNK_SIZE CHUNK_SIZE CHUNK_SIZE]
                  [--allowed_gpus ALLOWED_GPUS [ALLOWED_GPUS ...]]
                  data_dir
predict.py: error: unrecognized arguments: chunk_size 50 100 100 100 --allowed_gpus 0 1 again. py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1

I'm concerned that you might be using a slightly different image than me. Can you please share the output of docker images --digests rueberger/division_detection? Can you also tell me what docker version says?

Let's try running the script in a slightly different way. The current docker run incantation is long and confusing. Instead, try

docker run --runtime=nvidia --name div_det -it --mount type=bind, source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu bash

which will drop you into a shell in the container. Then verify that /results is empty - please share the output of ls -lha /results. Run the script with

python /research/division_detection/scripts/predict.py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1

Note that you will have to delete the h5 directory first.

@SakuSakuLab
Copy link
Author

Thank you for your kind reply.

The information of docker image is as follows.

$ docker images --digests rueberger/division_detection
REPOSITORY                     TAG                 DIGEST                                                                    IMAGE ID            CREATED             SIZE
rueberger/division_detection   patched_gpu         sha256:7ea3f372b59a5df10b1c53bae402f71d4108cac479c59032ef89c35408108ec6   065c79bff0e2        9 days ago          7.62GB
rueberger/division_detection   latest              sha256:66402d0f53e6fb7ffd6392b7ebd5c3d4db476ed225247ac903b48d7256e6decd   fb90707a01f6        2 years ago         3.25GB
rueberger/division_detection   latest_gpu          sha256:c1ccfdfe3bd56639c2911d6d9b56c4259941bfaa2c6366d528d773cd28350e0a   69015ef73364        2 years ago         7.6GB

The docker version information is as follows.

$ docker version
Client: Docker Engine - Community
 Version:           19.03.12
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        48a66213fe
 Built:             Mon Jun 22 15:45:36 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       48a66213fe
  Built:            Mon Jun 22 15:44:07 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

As you said, I tried running the script in a slightly different way.

$ docker rm div_det
$ export DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB
$ docker run --runtime=nvidia --name div_det -it --mount type=bind, source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu bash

Then I get the following error message.

invalid argument "type=bind," for "--mount" flag: invalid field '' must be a key=value pair
See 'docker run --help'.

I got a message like this and couldn't continue.
I don't know what this message means.

Please tell me what to do.

Thank you.

@rueberger
Copy link
Owner

I believe that was caused by the space between bind, and source in --mount type=bind, source=$DATA_VOL,destination=/data

@SakuSakuLab
Copy link
Author

Thanks for the fix.

As you said, I ran the script as follows:

$ docker rm div_det
$ export DATA_VOL=/home/ubutu3/Software_Guide_Full/Images/DataSetB
$ docker run --runtime=nvidia --name div_det -it --mount type=bind,source=$DATA_VOL,destination=/data rueberger/division_detection:patched_gpu bash

Then I ran the following script:

root@5c37f504dd79:/research# ls -lha /results

I got the following message.

total 8.0K
drwxr-xr-x 1 root root 4.0K Aug  4 13:43 .
drwxr-xr-x 1 root root 4.0K Aug 15 08:04 ..

Finally I ran the following script.

python /research/division_detection/scripts/predict.py /data/klb --chunk_size 50 100 100 --allowed_gpus 0 1

I still got the same error message as before.

predicting chunk at (0, 896, 0)
predicting chunk at (0, 1170, 740)
predicting chunk at (0, 896, 56)
Caught stop iteration. 
Stopping writer process
Starting predictions for t = 9
Starting writer process
Starting prediction worker
predicting chunk at (0, 896, 112)
Output chunk size: [42, 56, 56]
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
Number of chunks per dimension: [2, 23, 15]
    send(obj)
Yielding chunks
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 0, 0)
predicting chunk at (0, 896, 168)
Writing chunk at (0, 0, 0)
Caught exception in writer: Argument sequence too long <type 'exceptions.TypeError'> 9
Process Process-9:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 588, in writer
    pred_coord[2]: pred_coord[2] + pred_size[2]] = prediction[0]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 609, in __setitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 94, in select
    sel[args]
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 261, in __getitem__
    start, count, step, scalar = _handle_simple(self.shape,args)
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 438, in _handle_simple
    args = _expand_ellipsis(args, len(shape))
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/selections.py", line 425, in _expand_ellipsis
    raise TypeError("Argument sequence too long")
TypeError: Argument sequence too long
predicting chunk at (0, 0, 56)
predicting chunk at (0, 896, 224)
predicting chunk at (0, 0, 112)
predicting chunk at (0, 896, 280)
.
.
.
predicting chunk at (0, 1170, 740)
predicting chunk at (0, 224, 740)
Caught stop iteration. 
Stopping writer process
Starting predictions for t = 16
Starting writer process
Starting prediction worker
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 16
Process Process-10:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Output chunk size: [42, 56, 56]
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 280, 0)
predicting chunk at (0, 0, 0)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 280, 56)
predicting chunk at (0, 0, 56)
.
.
.
predicting chunk at (0, 1170, 740)
predicting chunk at (0, 616, 672)
Caught stop iteration. 
Stopping writer process
predicting chunk at (0, 616, 728)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 616, 740)
predicting chunk at (0, 672, 0)
predicting chunk at (0, 672, 56)
predicting chunk at (0, 672, 112)
.
.
.

predicting chunk at (0, 1170, 728)
predicting chunk at (0, 560, 112)
predicting chunk at (0, 1170, 740)
Caught stop iteration. 
Stopping writer process
Starting predictions for t = 5
Starting writer process
predicting chunk at (0, 560, 168)
Starting prediction worker
Caught exception in writer: Unable to create link (name already exists) <type 'exceptions.RuntimeError'> 5
Process Process-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/research/division_detection/division_detection/predict.py", line 573, in writer
    predictions = prediction_file.create_dataset('predictions', shape=vol_shape, dtype='f')
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 109, in create_dataset
    self[name] = dset
  File "/opt/conda/lib/python2.7/site-packages/h5py/_hl/group.py", line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)
Output chunk size: [42, 56, 56]
Number of chunks per dimension: [2, 23, 15]
Yielding chunks
predicting chunk at (0, 0, 0)
predicting chunk at (0, 560, 224)
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
    send(obj)
IOError: [Errno 32] Broken pipe
predicting chunk at (0, 0, 56)
predicting chunk at (0, 560, 280)
predicting chunk at (0, 0, 112)
.
.
.
predicting chunk at (0, 1170, 728)
predicting chunk at (0, 1170, 740)
Caught stop iteration. 
Stopping writer process

Of course, I deleted the h5 directory first.

I hope it helps in fixing the bug.

@rueberger
Copy link
Owner

Just wanted to let you know I haven't forgotten about this, just been very busy...

@SakuSakuLab
Copy link
Author

It's been a while, Rueberger.

Thank you for contacting me.

I know you are already away from Janelia.
I'm sure you have other work to do now.
I am in no hurry as I have other things to do here as well.

It would be nice if you could work on fixing the code when you have time, even if it's just a little bit at a time.

Thank you.

@Urheen
Copy link

Urheen commented Apr 7, 2021

Hi, Rueberger.
Could you work on fixing the code?

Thank you

@Urheen Urheen mentioned this issue Apr 8, 2021
@rueberger
Copy link
Owner

Yup, still on my todo list, but no guarantees I'll be able to get to it anytime soon...

@SakuSakuLab any updates from your end? Last time I checked I still couldn't reproduce the problem you're having

@SakuSakuLab
Copy link
Author

Long time no see, Rueberger.
You look as busy as ever.
I don't have anything to update here.
I'm still getting the error I described before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants