How to run 2 models together in parallel with the same data input in model training? #19119

wydwww · 2020-09-11T12:28:19Z

wydwww
Sep 11, 2020

I am implementing knowledge distillation-based DNN model training, as illustrated in the figure below, to run the teacher and student models (blue and green blocks) in parallel with the same data batch.
My plan is to put a light-weight pre-trained teacher model on CPU which only runs forward pass with frozen parameters. The student model is a large model to be trained on GPU(s).
I suppose moving a light task (teacher's forward pass) to CPU can make it overlapped with the heavy training task on GPU and make this pipeline faster, compared with 2 models running in sequence as in many knowledge distillation projects (see below).
This task is not for model compression.

I've checked some popular repos like NervanaSystems/distiller and peterliht/knowledge-distillation-pytorch. They execute the forward operations of the student and teacher models in sequence (line by line), not in parallel on different devices (GPU or CPU).

I am trying to speed up this training process to run the 2 models at the same time using multiple devices (e.g., loading the small, inference-only model on CPU and not interrupting the GPU training of the heavy model).

What is the proper way to run 2 models in parallel with Module() API of MXNet 1.x? Should I use Python multiprocessing library? Any recommendation on how to create a process to load the small teacher model and run forward() with the same data input?

Answered by leezu

Sep 11, 2020

Note that the Python line-by-line "execution" just refers to telling the multi-threaded backend that it shall execute an operation. If there are no dependencies between the two operations, the backend will execute them in parallel automatically.

View full answer

leezu · 2020-09-11T16:32:34Z

leezu
Sep 11, 2020
Collaborator

Note that the Python line-by-line "execution" just refers to telling the multi-threaded backend that it shall execute an operation. If there are no dependencies between the two operations, the backend will execute them in parallel automatically.

2 replies

wydwww Sep 11, 2020
Author

Thanks. I will check this soon.

wydwww Sep 14, 2020
Author

I see no throughput change on GPU training when running another model on CPU.

ptrendx · 2020-09-11T17:58:27Z

ptrendx
Sep 11, 2020
Collaborator

One of the very common reasons for the serialization between 2 models is workspace usage, which is kind of a hidden dependency. There is an environment variable MXNET_EXEC_NUM_TEMP which specified number of workspaces MXNet will use (by default 1), and setting it to more than 1 can enable parallel execution (at the cost of increased memory usage).

4 replies

wydwww Sep 12, 2020
Author

I am trying to do parallelism on different devices (a light-weight model on CPU and a heavy one to train on GPUs). MXNET_EXEC_NUM_TEMP seems to decide per-GPU parallelism level.

ptrendx Sep 12, 2020
Collaborator

Oh, sorry, did not notice that in your question. I would expect this parallel execution to work without any special handling then by just calling forward on those 2 executors - what is the behavior that you see?

wydwww Sep 12, 2020
Author

I will check this method soon. My previous concern was wrong, as also pointed in the other answer.

wydwww Sep 14, 2020
Author

I see no throughput change on GPU training when running another model on CPU. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run 2 models together in parallel with the same data input in model training? #19119

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to run 2 models together in parallel with the same data input in model training? #19119

wydwww Sep 11, 2020

Replies: 2 comments · 6 replies

leezu Sep 11, 2020 Collaborator

wydwww Sep 11, 2020 Author

wydwww Sep 14, 2020 Author

ptrendx Sep 11, 2020 Collaborator

wydwww Sep 12, 2020 Author

ptrendx Sep 12, 2020 Collaborator

wydwww Sep 12, 2020 Author

wydwww Sep 14, 2020 Author

wydwww
Sep 11, 2020

Replies: 2 comments 6 replies

leezu
Sep 11, 2020
Collaborator

wydwww Sep 11, 2020
Author

wydwww Sep 14, 2020
Author

ptrendx
Sep 11, 2020
Collaborator

wydwww Sep 12, 2020
Author

ptrendx Sep 12, 2020
Collaborator

wydwww Sep 12, 2020
Author

wydwww Sep 14, 2020
Author