-
Notifications
You must be signed in to change notification settings - Fork 4.4k
add float8 and bfloat8 conversion functions #6495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
chloeeyi seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6495 +/- ##
==========================================
- Coverage 93.27% 92.77% -0.50%
==========================================
Files 845 807 -38
Lines 266119 255130 -10989
==========================================
- Hits 248222 236702 -11520
- Misses 17897 18428 +531 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds float8 and bfloat8 conversion functions to support 8-bit floating-point formats. The implementation provides bidirectional conversions between float16 and two 8-bit formats: E4M3 (1-bit sign, 4-bit exponent, 3-bit mantissa) and E5M2 (1-bit sign, 5-bit exponent, 2-bit mantissa).
Key changes:
- Added float16 ↔ float8 E4M3 conversion functions with proper handling of special values
- Added float16 ↔ bfloat8 E5M2 inline conversion functions using direct bit truncation/extension
- Implemented proper handling for edge cases including zero, infinity, NaN, overflow, and underflow
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/mat.h | Added function declarations and inline implementations for float8 and bfloat8 conversion functions |
| src/mat.cpp | Implemented float16_to_float8 and float8_to_float16 with comprehensive special value handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for your contribution ! |
No description provided.