Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vectorisation to make this lib blazingly fast #585

Open
chinwobble opened this issue Nov 26, 2024 · 11 comments
Open

Implement vectorisation to make this lib blazingly fast #585

chinwobble opened this issue Nov 26, 2024 · 11 comments

Comments

@chinwobble
Copy link
Contributor

chinwobble commented Nov 26, 2024

Vectorisation using SIMD should provide big performance improvements on large datasets.
There are a few candidates for vectorisation such as in the merge function.
https://github.com/OpenGene/fastp/blob/master/src/stats.cpp#L881-L939

I'm happy to have a go at implementing this if you can provide a sufficient fastq file for me to test with

References:
https://chryswoods.com/vector_c++/portable.html
https://github.com/google/highway/

@sfchen
Copy link
Member

sfchen commented Nov 26, 2024

Thanks, but currently the merge function is not the bottleneck of performance.

@sfchen
Copy link
Member

sfchen commented Nov 26, 2024

And I completely agree with you that SSE/AVX can make this tool even faster. You can take a look at fastplong, which is more time consuming and is being intensively developed.

@teepean
Copy link

teepean commented Nov 26, 2024

@chinwobble AdapterRemoval got a nice speed boost when SIMD instructions were introduced. It is a similar program to fastp so you might get the idea where the speed boost is needed.

https://github.com/MikkelSchubert/adapterremoval

@chinwobble
Copy link
Contributor Author

And I completely agree with you that SSE/AVX can make this tool even faster. You can take a look at fastplong, which is more time consuming and is being intensively developed.

Thanks I will have a go at implementing this!
Can you provide some large test files for me to work with?

@chinwobble
Copy link
Contributor Author

I've had a quick attempt implementing SIMD on the Sequence::reverseComplement and its almost 10x faster on my machine.

(I understand this might not be the bottleneck).

The implementation looks like this:

string Sequence::reverseComplementHwy(string *origin)
{
    auto length = origin->length();
    const auto sequence = reinterpret_cast<const uint8_t*>(&origin[0]);
    uint8_t output[length];
    const auto transform = [](const auto d, auto output, const auto sequence) HWY_ATTR
    {
        const auto a = hn::Set(d, 65UL);
        const auto t = hn::Set(d, 84UL);
        const auto c = hn::Set(d, 67UL);
        const auto g = hn::Set(d, 71UL);
        output = hn::IfThenElse(hn::Eq(sequence, a), t, output);
        output = hn::IfThenElse(hn::Eq(sequence, t), a, output);
        output = hn::IfThenElse(hn::Eq(sequence, c), g, output);
        output = hn::IfThenElse(hn::Eq(sequence, g), c, output);
        return output;
    };

    const hn::ScalableTag<uint8_t> d;
    hn::Transform1(d, output, length, sequence, transform);

    auto retVal = reinterpret_cast<char *>(output);
    std::string reversed(retVal, length);
    return reversed;
}

Comparisons:

Run on (16 X 3693.06 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 0.22, 0.34, 0.25
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_SequenceReverseSerial        296 ns          304 ns      2141852
BM_SequenceReverseHwy          37.5 ns         38.6 ns     18777781

In order to implement this into the repo we would need to a few things:

  • change build system to cmake. This has a number of benefits including making cross platforms build easier. Currently the CI doesn't build properly on ARM64 properly.
  • Add vcpkg which depends on cmake. This make it much easier to install third party libraries. I used https://github.com/google/highway so that the code is portable (works on ARM64 and AMD64, etc).

@sfchen
Copy link
Member

sfchen commented Dec 2, 2024

This seems very promising! Have you ever tried the arm-arch ?

@chinwobble
Copy link
Contributor Author

chinwobble commented Dec 2, 2024

This seems very promising! Have you ever tried the arm-arch ?

Yes I have tested this in CI with mac-os ARM64.
The output looks like this. (Not sure why the clock speed is so slow).

Running ./builds/ninja-multi-vcpkg/Release/bench
Run on (3 X 24 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB
  L1 Instruction 192 KiB
  L2 Unified 12288 KiB (x3)
Load Average: 2.49, 5.27, 5.95
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_SequenceReverseSerial        222 ns          220 ns      3102479
BM_SequenceReverseHwy          32.5 ns         32.1 ns     21874727

@chinwobble
Copy link
Contributor Author

chinwobble commented Dec 3, 2024

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier?

I pushed a branch and the build looks like this.
You can see samples of adding external dependencies and compiler flags.
https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

@sfchen
Copy link
Member

sfchen commented Dec 3, 2024

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier?

I pushed a branch and the build looks like this. You can see samples of adding external dependencies and compiler flags. https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

Hi, thanks for your suggestion. What kind of external dependencies you want to use? I was always trying to keep the simplicity of building such tools, so I didn't use CMake. You know, most of the users are with biological backgrounds, so simplicity is very important.

@chinwobble
Copy link
Contributor Author

chinwobble commented Dec 4, 2024

@sfchen would you be open me changing this repo or fastplong use CMake so that we can manage external dependencies and make linux / mac builds easier?
I pushed a branch and the build looks like this. You can see samples of adding external dependencies and compiler flags. https://github.com/chinwobble/CppCMakeVcpkgTemplate/blob/fastplong/CMakeLists.txt#L19-L26

Hi, thanks for your suggestion. What kind of external dependencies you want to use? I was always trying to keep the simplicity of building such tools, so I didn't use CMake. You know, most of the users are with biological backgrounds, so simplicity is very important.

Hey thanks for your response. I totally understand your emphasis on simplicity and potentially removing external dependencies for contributors.

Here are some problems and solutions to address your concerns:

  1. Problem: Existing projects that use SIMD use architecture specific (x86) intrinsics that are difficult to write and don't work on ARM64. For example
    https://github.com/MikkelSchubert/adapterremoval
    Solution: https://github.com/google/highway solves this problem by providing higher level algorithms that are easy to write and work across many architectures.
  2. Problem: Adding more tools such as cmake complicates the build process and makes it more difficult to contribute to this project.
    When I started looking into this project I couldn't get it to compile on ubuntu after cloning.
    The project took some time for me to setup, there was no debugging / IDE support.
    Solution: I can help you make this project easier to use by adding a CD pipeline to publish the binaries so that "users" can use this tool without compiling themselves. A more ambitious solution for making this tool user friendly is to build a GUI around this tool - that would be quite difficult to setup without cmake managing external dependencies such as imGUI.
    For developers, its much easier to debug issues if you can use the debugging, run run specific tests (see screenshot below).
  3. Does cmake provide any benefits that can't be achieved by just using the Makefile?
    Installing third party dependencies directly is difficult for some libraries that are more than header only libraries.
    The libraries that I think will be useful:
make

I think this can be achieved by putting all commands needed to setup cmake in the Makefile.
This would be much easier to achieve if this repo contained everything we need to build including external dependencies rather than requiring the user to install it separately.
This is how https://github.com/neovim/neovim is setup.

image

@chinwobble
Copy link
Contributor Author

@sfchen I've started adding SIMD and fixed the windows build on this PR.
OpenGene/fastplong#6
Can you have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants