Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CWT operator #4860

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

Add support for CWT operator #4860

wants to merge 25 commits into from

Conversation

mwdowski
Copy link

Category: New feature

Description:

TODO

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@mwdowski mwdowski marked this pull request as draft May 18, 2023 20:20
@mwdowski
Copy link
Author

@awolant

@awolant awolant self-assigned this May 19, 2023
Comment on lines 74 to 77
auto* sample_data_gpu = context.scratchpad->AllocateGPU<SampleDesc<T>>(1);
CUDA_CALL(
cudaMemcpyAsync(sample_data_gpu, sample_data, sizeof(SampleDesc<T>),
cudaMemcpyHostToDevice, context.gpu.stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do now (and remove L66), just create sample_data on stack:

Suggested change
auto* sample_data_gpu = context.scratchpad->AllocateGPU<SampleDesc<T>>(1);
CUDA_CALL(
cudaMemcpyAsync(sample_data_gpu, sample_data, sizeof(SampleDesc<T>),
cudaMemcpyHostToDevice, context.gpu.stream));
auto sample_data_gpu = context.scratchpad->ToContiguousGPU(ctx.gpu.stream, sample_data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that you don't need to calculate the required memory ahead of time as we have pooled memory allocator that can deal with on demand, GPU memory allocations pretty fast.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sample_data[0].a = a.tensor_data(0);
sample_data[0].size_a = volume(a.tensor_shape(0));
auto in_size = (args.end - args.begin) * args.sampling_rate;
sample_data[0].size_out = in_size * sample_data[0].size_a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand correctly the sample description describes only one sample at a time?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that time I didn't fully understand how batching works. So yes, at that time sample description did describe only one sample at a time but this has changed. Now batching is supported and this array describes multiple samples.

Comment on lines 53 to 58
ScratchpadEstimator se;
se.add<mm::memory_kind::host, SampleDesc<T>>(1);
se.add<mm::memory_kind::device, SampleDesc<T>>(1);
KernelRequirements req;
req.scratch_sizes = se.sizes;
return req;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With dynamic scratchpad it is not needed anymore.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto out_view = view<T>(output);

kernels::KernelContext ctx;
ctx.gpu.stream = ws.stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use DynamicScratchpad please:

    kernels::DynamicScratchpad scratchpad({}, AccessOrder(ws.stream()));
    kernels::KernelContext ctx;
    ctx.gpu.stream = ws.stream();
    ctx.scratchpad = &scratchpad;

void CwtGpu<T>::Run(KernelContext &context, const OutListGPU<T, DynamicDimensions> &out,
const InListGPU<T, DynamicDimensions> &in, const CwtArgs<T> &args) {
auto num_samples = in.size();
auto *sample_data = context.scratchpad->AllocateHost<SampleDesc<T>>(num_samples);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as dali/kernels/signal/wavelet/wavelet_gpu.cu

using CwtArgs = kernels::signal::wavelets::CwtArgs<T>;
using CwtKernel = kernels::signal::wavelets::CwtGpu<T>;

explicit CwtImplGPU(CwtArgs args) : args_(std::move(args)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a parameter that you want to set once for all or it could differ sample to sample.


namespace dali {

DALI_SCHEMA(Cwt).DocStr("by MW").NumInput(1).NumOutput(1).AddArg("a", "costam",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please extend the operator description and add more info about argument.

@JanuszL
Copy link
Contributor

JanuszL commented May 19, 2023

Great PR. Thank you for your contribution. I have left some food for thoughts. @awolant will probably add more related to the implementation itself.
Again, great work!

#include "dali/core/format.h"
#include "dali/core/util.h"
#include "dali/kernels/kernel.h"
#include "dali/kernels/signal/wavelet/wavelet_args.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this file is missing from the PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I've committed the missing file.

@awolant
Copy link
Contributor

awolant commented May 26, 2023

Very nice draft. Thanks for the contribution.

kubo11 and others added 4 commits May 29, 2023 21:03
This change was mainly about moving from storing wavelets as
functions to functors. Now wavelets can have extra parameters.
This introduced a challenge of making the CUDA kernel accept
these functors so templates were used.
A helper utility was also introduced on operator side. RunForName function
translates wavelet names and runs the right DALI kernel.
Discrete wavelets have been discarded since we're currently focusing
on continuous wavelet transform.

Computation of wavelet input samples has been moved to a separate cuda
kernel which should give a speedup when computing wavelets for multiple
a and b parameters.

Input wavelet samples, their scaled values and b coefficient are stored in
shared memory instead of global memory which should speedup computation.
sample.size_b = b.shape.tensor_size(i);
sample.span = span;
sample.size_in = std::ceil((sample.span.end - sample.span.begin) * sample.span.sampling_rate);
CUDA_CALL(cudaMalloc(&(sample.in), sizeof(T) * sample.size_in));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it either could be a host memory copied to the GPU later or scratchpad should be used for this allocation, not slow cudaMalloc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed cudaMalloc to scratchpad.AllocateGPU.

TensorListView<StorageGPU, const T> &b,
const kernels::signal::WaveletSpan<T> &span,
const std::vector<T> &args) {
if (name == "HAAR") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you shouldn't use an enum for that instead of string. Like DALIInterpType (backend_impl.cc and resampling_attr.h and resampling_attr.cc).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I've added enum DALIWaveletName.

JakubO and others added 9 commits July 5, 2023 01:04
Wavelet constructor exceptions are now being handled correctly.
Morlet wavelet C argument has been removed.
Fix wavelet exceptions and expand cwt operator docstr
Work on implementing operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants