-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
You need some help? #3
Comments
Hey! Thank you for your interest in contributing to this project. I really appreciate it. The biggest missing thing in DL4S is GPU support. The experimental ArrayFire implementation that I built is far from ideal, initial performance benchmarks are not great and ArrayFire uses OpenCL, which doesn't run on iOS and isn't well maintained on macOS. Besides that, there are a few missing features and convolution and pooling operations would need some performance tuning. Missing features besides GPU support that come to mind right now are:
There's really a lot of stuff that can be added to DL4S. Let me know if you want to work on any one of these or if you find anything else missing that you want to add. |
I think I'll first start small with 1D convolutions, and dilated convolutions. Also, in the future, should I make an issue with a proposal of changes I plan to make, make the changes, and then make a pull request, or should I just make a pull request with changes. |
Great! You can directly create a PR, no need to create an issue. You can use the existing code that I wrote for 2D convolutions (im2col/col2im) and introduce the dilation factor. The THNN library behind PyTorch has an implementation with dilation here. |
Awesome! Thanks for the tip, I'll get started working on it! |
I'm very happy to see that people still want to push Swift forward in writing deep learning algorithms (even after Google stopped S4TF project). 🙂 |
I definitely want to help! Check out ARHeadsetKit - it makes heavy use of GPGPU and is entirely in Swift. I have an aspiration to revive Swift for TensorFlow, and I had no idea this repo existed! I'm currently working on tutorials and publicity for ARHeadsetKit, but several months down the road, I think this would be an epic way to delve into machine learning! |
Lack of (real) enthusiasm and inability to use S4TF for everyday tasks were the reason why it failed. I'm sure we can change its fate and get something like Tensor<Half, GPU> running on an iPhone! |
MTLLayeredBuffer will be the key to making this possible! |
Thanks for the interest. As my time resources right now are pretty limited at this time, I am actually quite happy to accept additional maintainers for this project. Let me know if you are interested. Metal support was always on my list, though my experimental metal implementation (feature/metal) isn't up to the task and I unfortunately had to shift my focus to other projects. |
There is definitely room to improve on you Metal support. Just scrolling through the only shader file, I can see a lot of areas where the code can be cleaned up. I have a thousand hours of experience using Metal for GPGPU, and I may be the only person who can pull this off. Again, this is months in the future. However, I might just need a break from AR some day and start exploring your repo. Also, do you have Google Cardboard? |
If you need anything, let me know. And yes, I do own a few Cardboards. |
You could be the first person outside my home to ever experience ARHeadsetKit on Google Cardboard! Would you mind downloading AR MultiPendulum? App Link: https://apps.apple.com/app/ar-multipendulum/id1583322801 |
Unless I’m mistaken, this would make DL4S the first ML scripting API besides Python TensorFlow 2.5 to have hardware acceleration on an a Mac, and the first ever for iOS |
I've been following this project too. This project has already made a good progress in terms of the constructs of DL algorithm. Yeah, for open source world, it will be first DL library with Metal support on macOS and iOS. But Apple have few Metal-accelerated DL frameworks like Metal Performance Shaders, Metal Performance Shaders Graph, and ML Compute. And I've worked heavily with Metal & MPS Graph frameworks. Lol. By the way, about me, I authored Deep Learning with Swift for TensorFlow book this year. I wanted S4TF to succeed (especially Swift in DL world). Although I don't have much time to contribute but I'd love to make contributions whenever possible! One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides Regards |
There have been other Metal accelerated libraries like Serrano. Though none of the libraries I found have a particularly Swifty API and support the dynamic behaviors that DL4S does. Also, there are neural network compilers that target Metal, like TVM. |
Having a generic type solves multiple pain points:
|
I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.
I would also add support for a Float16 and UInt64/Int64 generic types for GPU, even though it might never be used in practice. I also lack knowledge on this, but there might not be MPS support for half-precision NN operations. @RahulBhalley - could you give me some insight into this since you have worked heavily with MPS before?
…Sent from my iPhone
On Oct 28, 2021, at 4:55 AM, Palle ***@***.***> wrote:
One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).
Having a generic type solves multiple pain points:
It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty.
It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32).
Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models)
The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Also, I snagged the “s4tf” organization name, if you want to join it just for kicks.
…Sent from my iPhone
On Oct 28, 2021, at 6:47 PM, Philip Turner ***@***.***> wrote:
I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.
I would also add a Float16 generic type
Sent from my iPhone
>> On Oct 28, 2021, at 4:55 AM, Palle ***@***.***> wrote:
>>
>
> One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).
>
> Having a generic type solves multiple pain points:
>
> It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty.
> It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32).
> Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models)
> The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor.
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub, or unsubscribe.
|
If there is a massive overhead to communicating with a discrete GPU on Intel Macs, would it be acceptable to have something like “iGPU” and “eGPU” device names (in addition to CPU and GPU)? The reason is that integrated GPUs might be faster than either a discrete GPU or a CPU on some models, and developers would have an easy way to choose either one if available (e.g. The main difference would simply be that with iGPU, certain buffers use system shared memory, while with eGPU, they are GPU-private. You might want to mix and match CPU and GPU tensors as well on certain models, which further adds the need for flexibility in workflows. Alternatively, this might be resolved by setting a static type property on GPU (whether integrated/external is preferred) - although that isn’t thread safe and takes away from the flexibility I mentioned above. I just want to throw around ideas and see what options I have to explore a few months down the road. |
That'll be amazing.
@philipturner you should be able to add FP16 precision via MPS (see MPSDataType). I haven't used low-precision computation in MPS but you should be able to provide UInt8-based quantized computation like PyTorch.
Sure, I would love to join it. |
Sure @palle-k. I was just concerned about that I had to write But never mind, it's fine if generic type specification give advantage to compiler-level optimization. |
I sent you an invitation to the organization! |
Thanks, joined. |
@RahulBhalley do you have enough time to try learning my framework ARHeadsetKit? It requires Xcode and I have 3 hours worth of tutorials so far. The people who have already tried the tutorials said they were really fun, and I think you might enjoy them as well. |
Before you read this comment, note that it's monstrously large and maybe a bit overboard on detail. I have some experience emulating double precision on GPUs, in a past attempt to accelerate MultiPendulum. The attempt failed because the dynamic range was the real problem, not precision (in the end, I couldn't even use two parallel CPU cores to accelerate the simulation because of cache incoherence). However, I gained a lot of experience with emulating higher-precision arithmetic. Metal does not allow double-precision arithmetic, even on GPUs that support it natively. Under emulation, it would take dozens of GPU clock cycles to do a single addition or multiplication. However, devices with native double precision already have a fraction of the throughput of single-precision, so it might actually have comparable performance. On devices with a weak GPU or workloads that underutilize the GPU, it will be faster to run double-precision arithmetic on the CPU. This would most likely be the case on the Intel Mac Mini and iPhone 8. However, when the GPU is massive like the M1 Max or AMD 6000 series, running large-scale double-precision computations could be faster than the CPU, even under emulation. There might be some special use cases for DL4S where Tensor<Double, GPU> could benefit performance. For example, it could offload work from the CPU in a mixed CPU-GPU tensor workflow. This means there is a reason to bring Tensor<Double, GPU> to this framework.
A few months down the road, I might make a Metal library for emulating double-precision or quad-precision on iOS and macOS GPUs. The setup for compiling the GPU shaders is very complicated (see the "Xcode Framework" option for ARHeadsetKit's configuration). So, I would post the open-source MSL code in this repository, but compile it locally and upload the The double-precision shaders would have to be compiled separately anyway. This is because Metal's fast math ( Having support for 64-bit integers and 64-bit floating point numbers on the GPU could be a unique advantage to DL4S, since not many ML frameworks have it outside of the CPU (if any). Just having that feature might motivate people to try DL4S when they wouldn't otherwise, or make DL4S vital to some rare ML use cases. With bfloat16, it is proven that the bottleneck for ML is dynamic range, not precision. Emulated double precision will not let you expand past 8 bits of exponent. If developers expect otherwise, they may meet some unfortunate surprises (it broke MultiPendulum). We would need to at least briefly document this behavior if I add support for Tensor<Double, GPU>. My personal repository for double-precision arithmetic would go into greater detail with edge cases and performance limitations. @palle-k, would this scenario be acceptable for DL4S? |
@RahulBhalley could you make your s4tf membership public? I can't change your visibility myself. Here's a picture in case you aren't familiar with the process: |
Done. |
I may also add DocC documentation to enhance the developer experience in Xcode and help with the documentation hosted on Jazzy. I would also make some super awesome Xcode tutorials for this framework, like the ones for ARHeadsetKit. What do both of you think about this? I wish that @palle-k would at least respond to say he read my comments. I have thrown several major ideas there and asked for his feedback, but all I have got is silence. I really don’t mean to be rude, but could you at least add a comment to this thread if you have the time? |
@philipturner Sorry for not replying that often, I really wish I had more time to invest in this project. All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks. I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting. Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it. As you are familiar with GPU programming I would assume your time brings the most value there. |
Thank you so much!
…Sent from my iPhone
On Nov 3, 2021, at 5:23 AM, Palle ***@***.***> wrote
@philipturner Sorry for not replying that often, I really wish I had more time to invest in this project.
All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks.
I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting.
Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it.
As you are familiar with GPU programming I would assume your time brings the most value there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I expect to begin work on this sooner than I anticipated - November or December 2021. @RahulBhalley, generally how busy are you compared to @palle-k? I might benefit from having you routinely validate my commits and contribute ideas. I’ll prioritize getting it up and running first, postponing double and half precision until a time when I have absolutely nothing else to do. I looked over the introduction a second time, and there is a lot of work we need to do to get GPU up and running. Everything from logging on DL4S-Tensorboard to the training algorithm needs to heavily modified for GPU, although the public API will not change. From Apple’s documentation, it isn’t clear if MPS has good acceleration for RNNs. @RahulBhalley, what is your experience on this? I will still research its performance, but knowing your feedback will help me prioritize the tasks of creating a custom graph compiler and RNN shaders. RNNs will be especially difficult, but GPGPU is all about thinking outside of the box to maximize ALU utilization. I may look into performing multiple batches simultaneously, pipelining, etc. We may also need to predict the time span of command buffers to split up work correctly. Too short may waste time in kernel calls, while too long could cause Metal to abort a command buffer. I’m not 100% sure, but that might be an iOS-specific problem. @RahulBhalley, what do you think about these ideas for implementing Metal in DL4S? |
I will be busy but can help with commits validation.
Absolutely, I also believe getting float 32-bit precision tensor ops to run on GPU/NPU should be a priority for now.
I have experience with convolutional networks using Metal Performance Shaders Graph framework. Never used RNNs with this. But I just wanna point out, I wrote a UNet in MPS Graph and the inference speed difference between CoreML (ofc we can't use this format, it's a different world) model and MPS Graph was huge even though CoreML uses MPS in backend. |
Thanks for the information. That model operates on audio, which is a textbook case of an RNN. From the image you showed, Apple didn't put in any real effort into RNNs with MPS. Python TensorFlow actually called MPS under the hood, so they have similar performance. However, Core ML has an entirely different runtime, which clearly optimizes RNNs for the GPU. CoreML was also made by Apple, which shows that MPS is only slow because they don't care to accelerate it (typical of how Apple treats Metal developer tools). You need to pipeline the data to fully utilize the GPU, which might be easier in inference mode—what CoreML is designed for. It also isn't the fault of the GPU itself. As demonstrated by the 5th and 3rd logs, CPU slows down the model and NPU has no effect. That means GPU is faster than CPU, and CoreML didn't even touch the NPU. I just got an M1 Max MBP, which has basically infinite computing power for some applications. It runs 8192 32-bit operations per clock cycle, and we need pipelining and concurrent execution of training data within a batch to even make a dent in that number. We will definitely be making a graph compiler from scratch. |
I think you misunderstood (or maybe not). This model (UNet, a convolutional neural network) does operates on an audio in this case but a spectrogram representation i.e. image not sequence. See the input tensor dimensions are
So this experiment was not linked to RNNs in any way but only convolution ops. |
I did misunderstand. I just looked at it running on audio. I don't really know my terminology when it comes to ML. Even though it didn't run RNNs, that data highlights inefficiencies in Apple's implementation of CNNs. The reason I was concerned about RNNs is that they are small, which poses problems with the underutilization of GPU ALUs. Other than that, the only real concern is that CoreML prefers to run them on the CPU, which is a bad sign for GPGPU. That may only be a problem for inference mode, where you aren't running samples in a batch simultaneously to maximize ALU utilization. I assume you gave me the closest log you had to RNNs that you could find. Is there any other knowledge you have about the topic that might help me? |
Hm, i think it could be interesting to have a dynamic device, at least on Apple Silicon with its shared memory, where we would select the best device (CPU, GPU, NPU) automatically so that best performance is achieved for a given model. Though I guess that could be difficult. |
The dynamic device is good, and I was thinking along those lines previously. However, the NPU is a big problem, and it won't work well. You can't access it quickly through any API—the best you can get is to kindly ask CoreML to use it, but you have to compile an MLModel first which has a massive overhead. Also, the GPU outperforms the NPU on 16-bit (but not 8-bit) math on M1 Pro/Max. And, the NPU doesn't do 32-bit math. |
I could hack into CoreML and bypass protections on MLModels, letting DL4S update weights without recompiling it. I would have to be careful about cache synchronization, and it is not a feature Apple would allow in an app that goes on the App Store. Although I've always wanted to try doing that. |
In ML Compute framework, we can access Apple Neural Engine (ANE) device (if available) and run the neural network on it directly. See MLCDevice. That's how Apple's TensorFlow fork runs on GPU/NPU of M series Macs because it's built on top of ML Compute in backend. Maybe we can use ML Compute in backend (instead of Metal Performance Shaders Graph) for GPU/NPU acceleration but I think this will probably restrict DL4S to ML Compute's pre-defined neural layers and tensor transformation ops. (I couldn't create new layers by inheriting MLCLayer class). Whereas MPS Graph would allow the extension of MPSGraph to add new ops built on top of pre-defined MPSGraph ops. |
No way! They just added that in iOS 15! That’s awesome! However, the ANE’s inability to do 32-bit math means it can’t be used for general-purpose learning. It’s in MLCompute so the framework can be used for inference, as an easier alternative to CoreML. |
How does Metal Performance Shaders use ANE? I don’t see that anywhere in the documentation. |
No, ML Compute uses ANE. |
Being able to retrieve data from a shared MTLBuffer (compared to private) has multiple benefits:
However, I also have counter-arguments to those reasons:
Shared memory is insanely slow on discrete GPUs, which poses a problem. I might optimize everything for Apple silicon, which means the code also runs well on integrated Intel GPUs. Since I use an Apple silicon Mac, it would be much easier to not worry about supporting discrete GPUs when creating the initial Metal support. After Metal works for the first time, I would then add support for using private Metal buffers, with a separate execution path optimized for discrete GPUs. Once supported by DL4S, internal code would switch to using the discrete GPU by default. However, there would be a way to use the integrated one instead (e.g. modifying a static property on GPUDevice). On the Intel Macs, the integrated Intel GPU has its own private memory despite being on the same chip as the CPU. So, there should be an option to choose shared/private MTLBuffer execution path there, with private being the default (once supported). That path should be available on Intel iGPUs, but not Apple on GPUs (DL4S will just ignore it if you select private). In summary, the following will likely take place in the implementation of Metal into DL4S:
What do you all think about this? |
Small change to my above plans: you can select dGPU upon release. You might want to compare iGPU vs dGPU performance, or some ops might be mostly ALU intensive or have weights fall into L1/L2 cache on dGPU. However, iGPU will be selected by default since private MTLBuffer execution doesn't work yet. Another change: I saw a potential use for having a buffer's contents as UnsafeMutableRawPointer: scanning a tensor to see whether any of its elements are NaN. Unless a buffer is extremely large, the cost of sending commands to the GPU outweighs the benefit of GPU acceleration. The GPU type may be semi-dynamic in some situations, where basic operations like what I described above will clearly run faster on the CPU. The need to offload these kinds of operations from the GPU back to the CPU depends on where they are located in a graph. If they're sandwiched between two other GPU operations, keep them on the GPU. |
In my personal for of DL4S, I'm going to rely on ARHeadsetKit's utility functions for a while. I need to sort out some solution for importing them, whether that's copying and pasting, or separating them into another framework. Relying on it as-is would risk a bad commit to ARHeadsetKit ruining DL4S, and force upgrades to the following:
I have plenty of time to figure this out. Just don't be alarmed when you see that in the Swift package. @palle-k if none of what I described sounds like an issue to you, then that may save me time when implementing Metal support. |
@palle-k, you always put a space between ranges in your source code (e.g. May I change all instances of this to I'm going to start work on implementing Metal support now. It will take days to fully absorb your code, so you have plenty of time to respond. I expect my first task to be splitting up the Metal shaders into multiple files. I will reformat the kernel parameters lists to match the style in ARHeadsetKit. The style is much cleaner and groups related parameters near each other. However, I may make the modification of putting parameter lists on the second line of the function declaration, reducing the horizontal size of source code files. Since I'm no longer working on ARHeadsetKit, I can modify my convention a little bit. |
@philipturner I am not very attached to either formatting option. Also, I started this project but with more and more community contributions, I am very much open to decisions made by others. I'll add Swiftlint to this project, so we'll have consistent code style across all contributions. |
@palle-k What is your opinion on permanently importing ARHeadsetKit? |
I just modified the range literal style on my branch.
Hopefully, we can get @ryendu contributing again. I tried reaching out to him, but I haven't gotten a response yet. |
@philipturner what would your rationale be for importing ARHeadsetKit? I feel like it should be the other way around, because ARHeadsetKit focuses on a particular type of application, whereas DL4S should be as general purpose as possible. I saw that ARHeadsetKit has some generally applicable extensions to Metal, which can be useful for DL4S. Wouldn't it be the best way to factor out this core from ARHeadsetKit into some separate library and then make ARHeadsetKit and DL4S depend on this library? Also, any opinions on creating a DL4S organization here on GitHub? |
That's exactly what I was thinking. I figured you wouldn't agree with that, but wanted to ask just in case. I originally thought of separating the utilities from ARHeadsetKit. I decided not to since the utility functions might be a reason for people to try ARHeadsetKit. I even applied ARHeadsetKit for iCHAIT/awesome-macOS using the utilities. The best course of action is to break the dependency on ARHeadsetKit sometime before merging my fork. There are three ways this could happen:
Perhaps I can name it
That sounds great! However, I think DL4S should continue to be under your username, otherwise several projects might break. I just emailed the person who suggested Mish activation, so hopefully, they'll respond. |
@palle-k, @RahulBhalley, and @ryendu would you mind starring ARHeadsetKit? |
@palle-k you have Google Cardboard. This isn't really related to DL4S, but is there any time in the coming months when you think you can go through ARHeadsetKit's tutorials? In an ideal situation, then will take 6 hours to complete. |
I think I might actually refactor ARHeadsetKit tomorrow. That would get it more quickly reviewed on awesome-macOS. |
I'm debating whether to restrict ARHeadsetUtil (the framework that includes just the bare minimum of utility functions from ARHeadsetKit) to Swift tools 5.5. The current Swift tools version on DL4S is 5.2, so I suggest that we bump it up to 5.5. DL4S need to be at least as high as ARHeadsetUtil, otherwise it will fail to compile in older versions of Xcode. The reason why is that DocC documentation fails to build if my package manifest says anything lower than 5.5. Nothing shows up in the Xcode documentation viewer, and Quick Help is also slightly different for MTLLayeredBuffer (although I can fix that easily). Not bumping it up to 5.5 will also prevent us (DL4S) from making our own DocC documentation and possible Xcode tutorials in the future (like SwiftUI and ARHeadsetKit's ones). @palle-k, how attached are you to keeping support for older versions of Xcode? If you have any doubts, I'll bump ARHeadsetUtil's version down to 5.0. |
I don't care that much about the minimum supported Swift version. Feel free to bump it to 5.5 |
Two more questions: I saw a dictionary literal with the convention from RayWenderlich ( I'm thinking about merging my current changes to DL4S today or tomorrow. I'd rather validate changes bit by bit for as long as possible. Soon, I'll have to merge commits into the feature/metal branch on DL4S, but what I have now is harmless. The DL4S logo looks off in dark mode. The black "4" doesn't blend well with the dark background on GitHub. I'm going to make a second version of the logo, which GitHub will select to render in dark mode. I'll change the colors while preserving everything else, and let you know as soon as I'm done. I'll experiment with getting the light vs. dark mode picture selection set up in my fork. |
I have no hard feelings about code formatting, you can use the space before the colon in dictionary literals. Thank you very much for your contributions. |
Hey there @palle-k, I really love this project and I think it is really awesome! I'd really like to help out and contribute, but DL4S seems pretty complete to me. Is there anything that I could help with? I'm really familiar with Swift, and I'm a bit new to open source.
The text was updated successfully, but these errors were encountered: