Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of experimental.resize #5662

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

banasraf
Copy link
Collaborator

@banasraf banasraf commented Oct 7, 2024

Category:

Refactoring

Description:

This PR improves experimental.resize operator to reduce the CPU overhead of the operator. It contains the following improvements (from most to least significant):

  • using two workspace objects for the CV-CUDA op to get rid of synchronising between minibatches
  • using custom allocator (taking memory from scratchpad) for the input/output TensorBatches
  • improved method for creating input/output TensorBatch objects: moving all common parts out of the loop
  • removed auxiliary TensorLists in favour of using the original i/os directly with PushFramesToBatch method

Additional information:

Affected modules and functionalities:

experimental.resize, nvcvop.h/cc

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@banasraf
Copy link
Collaborator Author

banasraf commented Oct 7, 2024

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [19099938]: BUILD STARTED

Comment on lines 227 to 228
TensorList<GPUBackend> in_frames_;
TensorList<GPUBackend> out_frames_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually help? How much? It's against the trend of more aggressive dynamic allocation.

Copy link
Collaborator Author

@banasraf banasraf Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helped A LOT. For some reason the destructors of these were taking a lot of time.
Anyway, I'm working on removing those auxiliary TensorLists completely, because the .ShareData and .Resize also take significant amount of time.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [19099938]: BUILD FAILED

@banasraf banasraf force-pushed the improve-experimental-resize-performance branch 3 times, most recently from 168b119 to 971ba9a Compare October 29, 2024 15:29
@banasraf banasraf force-pushed the improve-experimental-resize-performance branch from 971ba9a to 44b9ca2 Compare October 29, 2024 15:35
@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [19870509]: BUILD FAILED

@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [19871581]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [19871581]: BUILD PASSED

@@ -23,6 +23,7 @@
#include "dali/kernels/imgproc/resample/params.h"
#include "dali/operators/image/resize/resize_op_impl.h"
#include "dali/operators/nvcvop/nvcvop.h"
#include "dali/core/nvtx.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Comment on lines +267 to +284
for (int64_t i = 0; i < num_frames; ++i) {
if (frame_offset == sample_nframes) {
frame_offset = 0;
do {
++sample_id;
auto sample_shape = input_shape[sample_id];
DALI_ENFORCE(sample_id < t_list.num_samples());
std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],
frame_shape.begin());
frame_stride = volume(frame_shape) * type_size;
sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);
} while (sample_nframes * frame_stride == 0); // we skip empty samples
data =
static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;
}
tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));
data += frame_stride;
frame_offset++;
Copy link
Contributor

@mzient mzient Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that combining the two loops and changing the outer loop condition to "while there are frames left to insert" makes it more readable:

Suggested change
for (int64_t i = 0; i < num_frames; ++i) {
if (frame_offset == sample_nframes) {
frame_offset = 0;
do {
++sample_id;
auto sample_shape = input_shape[sample_id];
DALI_ENFORCE(sample_id < t_list.num_samples());
std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],
frame_shape.begin());
frame_stride = volume(frame_shape) * type_size;
sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);
} while (sample_nframes * frame_stride == 0); // we skip empty samples
data =
static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;
}
tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));
data += frame_stride;
frame_offset++;
int frames_left = num_frames;
while (frames_left) {
if (frame >= sample_nframes) {
++sample_id;
assert(sample_id < t_list.num_samples());
auto sample_shape = input_shape[sample_id];
std::copy(&sample_shape[first_spatial_dim], &sample_shape[input_shape.sample_dim()],
frame_shape.begin());
frame_stride = volume(frame_shape) * type_size;
if (frame_stride == 0) { // this sample is (effectively) empty - skip
sample_nframes = 0;
continue;
}
sample_nframes = calc_num_frames(sample_shape, first_spatial_dim);
data = static_cast<const uint8_t *>(t_list.raw_tensor(sample_id)) + frame_stride * frame_offset;
}
tensors.push_back(AsTensor(data, make_span(frame_shape), dtype, nvcv_layout));
data += frame_stride;
frame_offset++;
frames_left--;
}

do {
++sample_id;
auto sample_shape = input_shape[sample_id];
DALI_ENFORCE(sample_id < t_list.num_samples());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that either user or faulty data could trigger this - it would be an internal error, so I'd recommend using an assert or throwing logic_error at worst.

Comment on lines +169 to +172
nvcvop::PushFramesToBatch(mb_input, input, first_spatial_dim_, mb.sample_offset,
mb.frame_offset, mb.count, sample_layout_);
nvcvop::PushFramesToBatch(mb_output, output, first_spatial_dim_, mb.sample_offset,
mb.frame_offset, mb.count, sample_layout_);
Copy link
Contributor

@mzient mzient Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug. Both inputs and outputs should be inserted in one go and skipping the empty samples should be based solely on the output size. The user may request resizing a non-empty tensor to (0, 0), which is not an error AFAIR. Resizing an empty input to non-empty shape is an error and should be thrown at some point.

Comment on lines +66 to +67
if (volume(in_sample_shape) > 0)
total_frames += volume(&in_sample_shape[0], &in_sample_shape[first_spatial_dim]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug - the emptiness of a frame depends on the output shape, not input. At least in old DALI resize, you can resize a non-empty frame to size 0. I understand that such samples should be skipped (both at input and output).
Resizing an empty frame to a non-zero shape is impossible and should throw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants