Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL/clfft Integration including CI #26

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

tdd11235813
Copy link

Hi,

the liFFT interface is extended to support context including queue management with an option for asynchronous functionality.
This is realized by the clfft client and there are several possibilities how one can use liFFT in conjunction with clfft. Some examples to show you the API changes and usage of clfft.

  • use-case 1: the default liFFT interface for clfft, where a global OpenCL context is generated in the backend
using TestLibrary = LiFFT::libraries::clFFT::ClFFTNoContextAPI;
using FFT_TYPE = LiFFT::FFT_2D_R2C<TestPrecision>;
auto inWrapped = FFT_TYPE::wrapInput(                                     
                    LiFFT::mem::wrapPtr<false>(input.get(),          
                                               TestExtents(testSize, testSize)));
auto outWrapped = FFT_TYPE::wrapOutput(                                      
                    LiFFT::mem::wrapPtr<true>(output.get(),            
                                              TestExtents(testSize, testSize / 2 + 1)));
auto fft = LiFFT::makeFFT<TestLibrary>(inWrapped, outWrapped);
fft(inWrapped, outWrapped);
  • use-case 2: the user provides a context/queue object to the liFFT API. The ClFFT client offers 3 classes which encapsulate both the context/device and the queue.
    • context classes are: ContextLocal (RAII), ContextGlobal (Singleton) and ContextWrapper (wrap raw OpenCL context, device and queue).
    • other clients like CUDA could provide similar types for CUDA streams
    • makeFFTInQueue is added to the API, otherwise there would be ambiguous overloads
using TestLibrary = LiFFT::libraries::clFFT::ClFFTContextAPI;
using Context = LiFFT::libraries::clFFT::policies::ContextLocal<>;
auto fft = LiFFT::makeFFTInQueue<TestLibrary>(inWrapped,
                                              outWrapped,
                                              context);
fft(inWrapped, outWrapped, context);
  • use-case 3: the user also wants to pass OpenCL memory objects to liFFT
    • FFT_LibPtrWrapper is added to liFFT to handle non-accessible device/library pointers and is a FFT_DataWrapperBase
    • such a lib pointer is flagged as device memory, so the user takes care of memory allocation
    • due to non-accessible lib pointer you cannot use generators to fill data or the liFFT copy policy
      • API could be extended again to support lib pointers along with copy functors (with host2device, ...)
cl_mem dat1 = clCreateBuffer(...);
cl_mem dat2 = clCreateBuffer(...);
// ... data sent to dat1 ...
// wrap OpenCL device pointer 
auto inWrapped = FFT::wrapInputLibPtr(dat1, TestExtents(testSize, testSize));
auto outWrapped = FFT::wrapOutputLibPtr(dat2, TestExtents(testSize, testSize));
auto fft = LiFFT::makeFFTInQueue<ClFFTContextAPI>(inWrapped, outWrapped, context);
fft(inWrapped, outWrapped, context);
  • use-case 4: asynchronous clfft/liFFT (also see testOpenCL.cpp)
using Context = LiFFT::libraries::clFFT::policies::ContextLocal<true>; // enable async context
// ...
{
  Context context;
  using FFT_TYPE = LiFFT::FFT_2D_R2C<TestPrecision>;
  auto inWrapped = FFT_TYPE::wrapInput(
                    LiFFT::mem::wrapPtr<false>(input.get(),
                                               TestExtents(testSize, testSize)));
  auto outWrapped = FFT_TYPE::wrapOutput(
                    LiFFT::mem::wrapPtr<true>(output.get(),
                                              TestExtents(testSize, testSize / 2 + 1)));
  LiFFT::policies::copy(inWrapped, baseR2CInput);

  auto fft = LiFFT::makeFFTInQueue<ClFFTContextAPI>(inWrapped,
                                                     outWrapped,
                                                     context);
  fft(inWrapped, outWrapped, context);

  context.sync_queue(); // to wait until host data with result is present
}
  • cmake example for building clfft
export CMAKE_PREFIX_PATH=$HOME/software/clFFT-cuda8.0-gcc5.4/:/opt/cuda/include/:$CMAKE_PREFIX_PATH
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DLiFFT_ENABLE_CUDA=0 -DLiFFT_ENABLE_OPENCL=1 \
      -DCMAKE_C_COMPILER=gcc-5 -DCMAKE_CXX_COMPILER=g++-5 ..
  • the .travis.yml is updated. It now uses trusty distribution, cuda8 and includes OpenCL testing (CPU, AMD OpenCL). But there are still CUDA+gcc+boost issues, so only few version combinations seem to work.

I hope it provides a usable design now, where we can build on it.

When you have some time, please review and play around with the code :)

@ax3l
Copy link
Member

ax3l commented Aug 14, 2017

pretty awesome, thank you!

@Flamefire whenever you have the time, feel free to have a look! :)

@tdd11235813
Copy link
Author

tdd11235813 commented Aug 24, 2017

The travis files and testOpenCL.cpp have been updated. If you want to have this PR as a single commit, let me know and I squash the commits.

There is another use case which has not been shown here yet. When you want to execute the FFT on a CPU or GPU, you can specify the context target by an enum:

enum class ContextDevice {                
    GPU=CL_DEVICE_TYPE_GPU,               
    CPU=CL_DEVICE_TYPE_CPU,               
    ACCELERATOR=CL_DEVICE_TYPE_ACCELERATOR
};                                        

So you could request an OpenCL context for CPU and one for GPU

using Context = LiFFT::libraries::clFFT::policies::ContextLocal<true>;
Context context_cpu(ContextDevice::CPU);                          
Context context_gpu(ContextDevice::GPU);                          
// ...

If there is no GPU, OpenCL uses CL_DEVICE_TYPE_DEFAULT which depends on the OpenCL implementation.
There are two test cases in testOpenCL.cpp called TestClFFTR2CInplaceTwoArch[Async] to show the difference (one warmup before time of FFT calls are measured including synchronization).

$ ./test/Test --run_test=OpenCL/TestClFFTR2CInplaceTwoArch*
Running 2 test cases...
1: "ClFFT Informations","Device","Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz", <snip>
2: "ClFFT Informations","Device","Tesla P100-PCIE-16GB", <snip>
TwoArch Sync: Time = 101.061
TwoArch ASync: Time = 88.4254

*** No errors detected

I know it is not an exact proof that both FFTs were running concurrently, but it shows the workflow of sync and async architecture specific contexts and it is on the ToDo to play around with liFFT and threaded environments.

@ax3l
Copy link
Member

ax3l commented Oct 20, 2017

@Flamefire if you are interested to take a look at the implementation of clfft or want to merge it feel free to jump in :)

Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got around reviewing this. Looks great to me except one place with a trait. Please provide a short explanation and at least rename the trait has_type to something meaningful. Maybe something like shown here would be more readable? Note that there is a void_t already defined in the code so C++11 compatibility is ok.


// SFINAE test if T has type member
template <typename T>
class has_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble understanding the usage of this. I assume this trait is supposed to return true iff a member isComplex exists, that is constructible from a int?
Then what is the reasoning in using it as used in IntegralTypeImpl below? If I read that correctly then IntegralTypeImpl<Foo> returns Foo for every type Foo that is either an int or float or simply does not have a isComplex member which is true for pretty much any class. Wouldn't that make it pointless?

@tdd11235813
Copy link
Author

tdd11235813 commented Feb 14, 2018

@Flamefire you are absolutely right, this is too messy code and void_t helps here.
Ok, now what was the motivation for a more generic version of IntegralType.
For OpenCL cl_mempointer support I wanted to implement a liFFT integrating LibPtrWrapper which accepts non-integral types, but which must be treated like integral types.
(Note that for float/double and the like we already have PlainPtrWrapper).
cl_mem is such a non-integral, non-liFFT compatible type and is library specific.
liFFT only knows floating points and fundamental integral types, otherwise it calls ::isComplex and ::type at compile-time on the type that was wrapped by IntegralType.
I simplified the check and only use ::type as indicator for having a liFFT compatible type or not.

edit: CI failed, I look into this (uh but it did work on my system :P)
edit2: there was a SFINAE type mismatch, now it has not worked on my system either lol .. god bless travis. doing some tests now and push an update after that.

@Flamefire
Copy link
Contributor

IntegralType is used to determine the actual datatype used. E.g. we can have Complex types which use float, double etc. as its integral type. Based on that the backend can choose the library implementation (fftwf, fftwd for example clMem is a problem, because we cannot get the integral type from the handle alone. This was not considered when designing this library. In the current implementation any type which does not have a nested type member is an integral type, which IMO is wrong. May we'd be better off to leave IntegralType empty or unspecialised for clMem which seems like a good way of saying "I don't know". One could then specialize over some kind of wrapper around clMem which just enhances clMem by its type and is implicitly convertible to clMem (but not from!)
You circumvented the problem by using FFT::wrapInput which basically propagates all properties from the FFT to the pointer. The initial idea of PointerWrappers (which yours belong to) was to enhance raw pointers by required properties so the library can use it to select codepaths and/or check conditions. So your approach is the other way round. While this shortcut might be ok, I think the LibPtrWrapper itself should not be based on the FFT but rather get all the information passed in so one could write e.g. wrapLibHandle(myClMem, Complex<float>, myExtends) or so. FFT::wrapInput could still fill these params with the information it has but just as a "lazy shortcut". "Lazy" because it is shorter, but as mentioned circumvents all validity checks done later.

From the comment there seems to be a misunderstanding on what IntegralType is. It is not a "whether" as its not a bool, but a what. And for decision if it is complex or not there exists IsComplex or something like that.

Oh and while I'm on that comment: You don't need the ::type and ::isComplex members you only need a specialization of the traits. That was one of the things René or Axel strongly suggested back then to allow non-intrusive extensions.

@tdd11235813
Copy link
Author

tdd11235813 commented Feb 27, 2018

thanks for your detailed feedback. I try to summarize the next steps: the goal is to decouple FFT data from FFT executor. FFT properties exist on both sides and become validated at compile-time and that's what we also want for clFFT backend of course.
Thus, a type-agnostic wrapper is required. I would call it LibHandle now.
(probably better to use composition instead of inheritance)

template<typename T, typename TValueType, unsigned T_numDims>
struct LibHandle : public T {
  using type = TValueType;
  using IdxType = types::Vec< T_numDims, size_t >;
protected:  
  IdxType m_extents;
};

I cannot derive from DataContainer like PlainPtrWrapper as cl_mem is not directly accessible like raw pointers. Hence, the name LibHandle instead of LibPointer... to emphasize the difference.
The data side would be:

cl_mem cldata;
// .. this cldata will contain 2D floats ..
LibHandle<cl_mem,float,2> handle = LiFFT::mem::wrapLibHandle<float>(cldata, extents);
// or just: auto handle = LiFFT::mem::wrapLibHandle<float>(cldata, extents);
// .. implicit conversion to base type cl_mem is also possible
// things like copy does not work as there is no accessor defined in handle
// LiFFT::policies::copy(handle, baseR2CInput);

Now the FFT part:

auto in_handle = LiFFT::mem::wrapLibHandle<float>(in_cldata, extents);
auto out_handle = LiFFT::mem::wrapLibHandle<Complex<float>>(out_cldata, extents);
using FFT_TYPE = LiFFT::FFT_2D_R2C<float>;
// add the FFT properties to the handle in a wrapper
auto in_wrapped = FFT_TYPE::wrapInputLibHandle(in_handle);
auto out_wrapped = FFT_TYPE::wrapOutputLibHandle(out_handle);
// make FFT based on wrappers' FFT properties
auto fft = LiFFT::makeFFTInQueue<ClFFTContextAPI>(in_wrapped, out_wrapped,
                                                              context);
// execute
fft(in_wrapped, out_wrapped, context);

Not checked all the possible conflicts under the hood, but what do you think?

@Flamefire
Copy link
Contributor

Yes sounds great. Maybe pack in strides too but have them be available as default params? Not sure if this is requires/usefull, just plain pointers may have strides which are checked (I think) by the accessors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants