-
Notifications
You must be signed in to change notification settings - Fork 16
[TASK] Add A New Quant Ball for FP32-MXFP Conversion #26
Copy link
Copy link
Open
Description
Deliverables
- Add an MXFP ball RTL implementation in the prototype lib (under the arch path).
- A Pull Request (PR) containing a test written in C for this operation and a README to introduce your design.
- Report the performance results in this issue.
Task Description
- MXFP is a lower-precision floating-point representation designed to reduce data size and simplify computations in the following process. Using MXFP can improve throughput and hardware efficiency in bandwidth-sensitive workloads, while still maintaining acceptable numerical quality for many ML scenarios.
- You can learn this format and its variants, starting from this paper, "With Shared Microexponents, A Little Shifting Goes a Long Way".
- As we envisage, an FP32 matrix will be loaded into the banks, and then a your customised MXFP instruction will read the data from one bank into the ball you are to implement, before outputting it to another bank.
- You can refer to the previous Pull Request (Completed the development of ReluBall and further improved the operation manual #6) for the detailed implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels