-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CWT operator #4860
base: main
Are you sure you want to change the base?
Conversation
auto* sample_data_gpu = context.scratchpad->AllocateGPU<SampleDesc<T>>(1); | ||
CUDA_CALL( | ||
cudaMemcpyAsync(sample_data_gpu, sample_data, sizeof(SampleDesc<T>), | ||
cudaMemcpyHostToDevice, context.gpu.stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do now (and remove L66), just create sample_data
on stack:
auto* sample_data_gpu = context.scratchpad->AllocateGPU<SampleDesc<T>>(1); | |
CUDA_CALL( | |
cudaMemcpyAsync(sample_data_gpu, sample_data, sizeof(SampleDesc<T>), | |
cudaMemcpyHostToDevice, context.gpu.stream)); | |
auto sample_data_gpu = context.scratchpad->ToContiguousGPU(ctx.gpu.stream, sample_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that you don't need to calculate the required memory ahead of time as we have pooled memory allocator that can deal with on demand, GPU memory allocations pretty fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sample_data[0].a = a.tensor_data(0); | ||
sample_data[0].size_a = volume(a.tensor_shape(0)); | ||
auto in_size = (args.end - args.begin) * args.sampling_rate; | ||
sample_data[0].size_out = in_size * sample_data[0].size_a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand correctly the sample description describes only one sample at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that time I didn't fully understand how batching works. So yes, at that time sample description did describe only one sample at a time but this has changed. Now batching is supported and this array describes multiple samples.
ScratchpadEstimator se; | ||
se.add<mm::memory_kind::host, SampleDesc<T>>(1); | ||
se.add<mm::memory_kind::device, SampleDesc<T>>(1); | ||
KernelRequirements req; | ||
req.scratch_sizes = se.sizes; | ||
return req; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With dynamic scratchpad it is not needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
auto out_view = view<T>(output); | ||
|
||
kernels::KernelContext ctx; | ||
ctx.gpu.stream = ws.stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use DynamicScratchpad please:
kernels::DynamicScratchpad scratchpad({}, AccessOrder(ws.stream()));
kernels::KernelContext ctx;
ctx.gpu.stream = ws.stream();
ctx.scratchpad = &scratchpad;
void CwtGpu<T>::Run(KernelContext &context, const OutListGPU<T, DynamicDimensions> &out, | ||
const InListGPU<T, DynamicDimensions> &in, const CwtArgs<T> &args) { | ||
auto num_samples = in.size(); | ||
auto *sample_data = context.scratchpad->AllocateHost<SampleDesc<T>>(num_samples); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as dali/kernels/signal/wavelet/wavelet_gpu.cu
using CwtArgs = kernels::signal::wavelets::CwtArgs<T>; | ||
using CwtKernel = kernels::signal::wavelets::CwtGpu<T>; | ||
|
||
explicit CwtImplGPU(CwtArgs args) : args_(std::move(args)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is a parameter that you want to set once for all or it could differ sample to sample.
|
||
namespace dali { | ||
|
||
DALI_SCHEMA(Cwt).DocStr("by MW").NumInput(1).NumOutput(1).AddArg("a", "costam", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please extend the operator description and add more info about argument.
Great PR. Thank you for your contribution. I have left some food for thoughts. @awolant will probably add more related to the implementation itself. |
#include "dali/core/format.h" | ||
#include "dali/core/util.h" | ||
#include "dali/kernels/kernel.h" | ||
#include "dali/kernels/signal/wavelet/wavelet_args.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this file is missing from the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I've committed the missing file.
Very nice draft. Thanks for the contribution. |
add WaveletArgs class
This change was mainly about moving from storing wavelets as functions to functors. Now wavelets can have extra parameters. This introduced a challenge of making the CUDA kernel accept these functors so templates were used. A helper utility was also introduced on operator side. RunForName function translates wavelet names and runs the right DALI kernel.
Discrete wavelets have been discarded since we're currently focusing on continuous wavelet transform. Computation of wavelet input samples has been moved to a separate cuda kernel which should give a speedup when computing wavelets for multiple a and b parameters. Input wavelet samples, their scaled values and b coefficient are stored in shared memory instead of global memory which should speedup computation.
Wavelet computing improvements
sample.size_b = b.shape.tensor_size(i); | ||
sample.span = span; | ||
sample.size_in = std::ceil((sample.span.end - sample.span.begin) * sample.span.sampling_rate); | ||
CUDA_CALL(cudaMalloc(&(sample.in), sizeof(T) * sample.size_in)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it either could be a host memory copied to the GPU later or scratchpad should be used for this allocation, not slow cudaMalloc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed cudaMalloc to scratchpad.AllocateGPU.
TensorListView<StorageGPU, const T> &b, | ||
const kernels::signal::WaveletSpan<T> &span, | ||
const std::vector<T> &args) { | ||
if (name == "HAAR") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you shouldn't use an enum for that instead of string. Like DALIInterpType (backend_impl.cc and resampling_attr.h and resampling_attr.cc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. I've added enum DALIWaveletName.
Wavelet computing improvements
Wavelet constructor exceptions are now being handled correctly. Morlet wavelet C argument has been removed.
Fix wavelet exceptions and expand cwt operator docstr
Work on implementing operator
Category: New feature
Description:
TODO
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A