Add FP8 support to llamafile #549

Djip007 · 2024-08-21T18:38:54Z

Djip007
Aug 21, 2024

As discution start here: #543 (reply in thread):

Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

Type	Sign	Exponent	Trailing significand field	Total bits
FP8 (E4M3)	1	4	3	8
FP8 (E5M2)	1	5	2	8
Half-precision	1	5	10	16
Bfloat16	1	8	7	16
TensorFloat-32	1	8	10	19
Single-precision	1	8	23	32

There is 2 (or 3 E3M4 is give in some paper...)

[jart]
Could you help me find the best C reference implementation for encoding / decoding FP8 numbers of that specific variety used by NVIDIA GPUs? For example, here's what it looks like for BF16.

static inline float ggml_compute_bf16_to_fp32(ggml_bf16_t h) {
    union {
        float f;
        uint32_t i;
    } u;
    u.i = (uint32_t)h.bits << 16;
    return u.f;
}

static inline ggml_bf16_t ggml_compute_fp32_to_bf16(float s) {
    ggml_bf16_t h;
    union {
        float f;
        uint32_t i;
    } u;
    u.f = s;
    if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */
        h.bits = (u.i >> 16) | 64; /* force to quiet */
        return h;
    }
    h.bits = (u.i + (0x7fff + ((u.i >> 16) & 1))) >> 16;
    return h;
}

If you can find me the canonical functions for doing the above with an FP8 design that's compatible with Nemo, then I can implement it into llamafile potentially in a few hours.

I really like to help. Have a look what we can have.

I think E5M2 format can be convert to FP16 the same way we do with BF16 <-> FP32. I'll have a look and get back 😎

Djip007 · 2024-08-21T18:40:36Z

Djip007
Aug 21, 2024
Author

https://github.com/IntelLabs/FP8-Emulation-Toolkit

Look a good starting point I find some day ago... I'll have a look.

4 replies

jart Aug 21, 2024
Maintainer

I can't find any code in there about quantizing / dequantizing fp8. I need something with clarity.

jart Aug 21, 2024
Maintainer

I would even accept a hash map generated from the CUDA API.

Djip007 Aug 21, 2024
Author

There is...
https://github.com/IntelLabs/FP8-Emulation-Toolkit/blob/main/mpemu/pytquant/cpp/fpemu_impl.cpp#L427
But yes not clean as it look to be define for other purposes

Don't "waste" your time with that now, I will study that and hopefuly i'll can give you something simple. 🤞

jart Aug 21, 2024
Maintainer

What a grim reminder that just because something's open source, doesn't mean it's open source.

Djip007 · 2024-08-21T23:46:00Z

Djip007
Aug 21, 2024
Author

https://arxiv.org/pdf/2209.05433v2
Look to be the "reference" paper.

I'll have to read it more carefully, but look like we need to apply a scale factor to. Possibly is common for all weight or per weight.

8 replies

Djip007 Aug 22, 2024
Author

// optimized :
static float from_fp8(uint8_t fp8) {
    union {
        float f;
        uint32_t i;
    } u;
    uint32_t t = fp8;
    int exp = ((fp8 >> 3) & 0x0F) - 7;
    u.i  = (t&0x80)<<24;    // signe
    u.i |= (exp+127)<<23;   // exponent
    u.i |= (t&0x7)<<(20) ;  // mantissa:  bit 2-0 -> 22-20 
    return u.f;
}
[edit]: it miss subnormal

don't have time to test it now... I'll do it tomorrow (with more optimisation / correction?)

jart Aug 22, 2024
Maintainer

Thanks! I've pushed an fp8 branch with our latest FP8 code.

llamafile/llama.cpp/ggml-impl.h

Lines 123 to 166 in 42fa422

    
           static uint8_t to_fp8(float f) { 
        
               uint8_t sign = signbit(f) ? 0x80 : 0; 
        
               if (isnan(f)) return sign | 127; 
        
               if (!f) return sign; 
        
               f = fabsf(f); 
        
               int exp = floorf(log2f(f)); 
        
               float mantissa = f / exp2f(exp) - 1.0f; 
        
               if (exp < -6) { 
        
                   mantissa = f / exp2f(-6); // subnormal 
        
                   exp = -7; 
        
               } 
        
               if (exp > 8) { 
        
                   return sign | 0x7E; // overflow 
        
               } 
        
               uint8_t exp_bits = (exp + 7) & 0x0F; 
        
               uint8_t mantissa_bits = (uint8_t)(mantissa * 8) & 0x07; 
        
               // [jpp] avoid generate NAN ? 
        
               if (exp_bits == 0x0F && mantissa_bits == 0x07) mantissa_bits = 0x06; 
        
               return sign | (exp_bits << 3) | mantissa_bits; 
        
           } 
        
           static float from_fp8(uint8_t fp8) { 
        
               union { 
        
                   float f; 
        
                   uint32_t i; 
        
               } u; 
        
               uint32_t t = fp8; 
        
               if ((fp8 & 127) == 127) 
        
                   return NAN; 
        
               if ((fp8 & 127) >= 8) { 
        
                   int exp = ((fp8 >> 3) & 15) - 7; 
        
                   u.i = (t & 128) << 24;     // sign 
        
                   u.i |= (exp + 127) << 23;  // exponent 
        
                   u.i |= (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20 
        
               } else { 
        
                   const unsigned kSubnormal[] = { 
        
                       0x00000000, 0x3b000000, 0x3b800000, 0x3bc00000, 
        
                       0x3c000000, 0x3c200000, 0x3c400000, 0x3c600000, 
        
                   }; 
        
                   u.i = kSubnormal[fp8 & 127]; 
        
                   u.i |= (t & 128) << 24; 
        
               } 
        
               return u.f; 
        
           }

You can build and use the branch like this:

make -j
o//llama.cpp/quantize/quantize /fast/Mistral-Nemo-Instruct-2407.BF16.gguf /fast/Mistral-Nemo-Instruct-2407.FP8.gguf FP8 8
o//llama.cpp/main/main -m /fast/Mistral-Nemo-Instruct-2407.FP8.gguf -f ~/getty.txt -n 30 -c 1024 --temp 0 --unsecure

You can get that gguf file as follows:

wget https://huggingface.co/Mozilla/Mistral-Nemo-Instruct-2407-llamafile/resolve/main/Mistral-Nemo-Instruct-2407.BF16.llamafile
unzip Mistral-Nemo-Instruct-2407.BF16.llamafile Mistral-Nemo-Instruct-2407.BF16.gguf

jart Aug 22, 2024
Maintainer

In order to make this work, we need an arithmetic expression for encoding Nemo's precious subnormals in from_fp8(). We'll then be able to vectorize ggml_vec_dot_fp8() and similar routines.

Djip007 Aug 22, 2024
Author

Nice work! 👍
Need a little time to look a all of that 😉

Djip007 Aug 24, 2024
Author

for nemo llamafile when extract the GGUF I get:

llama_model_loader: - kv   6:                       llama.context_length u32              = 1024000

not 131072 (128k) as the max ?
I think it is the same as reported here #543 (comment)

Djip007 · 2024-08-22T21:10:57Z

Djip007
Aug 22, 2024
Author

static float from_fp8(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);
	int exp = ((fp8 >> 3) & 0x0F) - 7;
	if (fp8 & 0x78) {
		u.i  = (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20
	} else if (fp8 & 0x04) { // denormalise:
		u.i  = (t & 3)   << 21;
		exp = -7;
	} else if (fp8 & 0x02) {
		u.i  = (t & 1)   << 22;
		exp = -8;
	} else if (fp8 & 0x01) {
		u.i  = 0;
		exp = -9;
	} else {
		u.i = 0;
		exp = -127;
	}
	u.i |= (exp + 127) << 23;  // exponent
	u.i |= (t & 128) << 24;    // sign
	return u.f;
}

with more arithmetic expression, not sure it is best. At least the calculation is good (or at least the same as the current one).

3 replies

jart Aug 23, 2024
Maintainer

Thanks for spotting the pattern. You've got the magic eye. Those else clauses can be made into the expression:

static float from_fp8(uint8_t fp8) {
  union {
    float f;
    uint32_t i;
  } u;
  const uint32_t t = fp8;
  if ((fp8 & 0x7F) == 0x7F)
    return (fp8 == 0xFF) ? (-NAN) : (+NAN);
  int exp = ((fp8 >> 3) & 0x0F) - 7;
  if (fp8 & 0x78) {
    u.i = (t & 7) << 20;  // mantissa: bit 2-0 -> 22-20
  } else if (!(fp8 & 7)) {
    u.i = 0;
    exp = -127;
  } else {
    int lg2mant = bsr(fp8 & 7);
    u.i = ((t & 3) << (23 - lg2mant)) & 0x007fffff;
    exp = -9 + lg2mant;
  }
  u.i |= (exp + 127) << 23;  // exponent
  u.i |= (t & 128) << 24;    // sign
  return u.f;
}

But no vector ISA really has a floor(log2(x)) instruction. So, by applying the select trick:

Write the ternary `b ? x : y` without branches:

    p = b - 1
    (x & ~p) | (y & p)

We can decode subnormals as follows:

static float from_fp8(uint8_t fp8) {
  union {
    float f;
    uint32_t i;
  } u;
  const uint32_t t = fp8;
  if ((fp8 & 0x7F) == 0x7F)
    return (fp8 == 0xFF) ? (-NAN) : (+NAN);
  int exp = ((fp8 >> 3) & 0x0F) - 7;
  if (fp8 & 0x78) {
    u.i = (t & 7) << 20;  // mantissa: bit 2-0 -> 22-20
  } else if (!(fp8 & 7)) {
    u.i = 0;
    exp = -127;
  } else {
    uint32_t mant = fp8 & 7;
    uint32_t p1 = (mant >> 2) - 1;
    uint32_t p2 = ((mant >> 1) & 1) - 1;
    int lg2mant = 2 & ~p1;
    lg2mant |= 1 & ~p2 & p1;
    u.i = ((t & 3) << (23 - lg2mant)) & 0x007fffff;
    exp = -9 + lg2mant;
  }
  u.i |= (exp + 127) << 23;  // exponent
  u.i |= (t & 128) << 24;    // sign
  return u.f;
}

jart Aug 23, 2024
Maintainer

Nice. We can decode 16 fp8 numbers in 24 cycles on znver4 using this.

Djip007 Aug 23, 2024
Author

I love the "timeline view"... how you get it?

Djip007 · 2024-08-22T21:15:54Z

Djip007
Aug 22, 2024
Author

I do a few statistical calculations on weights (mean, standard deviation, min max)...
for now:
[min/max] = [-0.902344, 0.933594] (on all weights...)

0 replies

jart · 2024-08-23T02:10:26Z

jart
Aug 23, 2024
Maintainer

I wrote an avx512 vectorized fp8 loader. On my znver4 Threadripper I managed to get prompt processing tokens per second up from 10 tok/sec to 56 tok/sec on my workstation for Mistral Nemo. See the commit to my fp8 branch here: c10a65c

On the other hand, using BF16 gives me 385.92 tok/sec for prompt processing. The K quants probably go just as fast, and token generation goes even faster. So unless someone can come up with a significantly better faster way of dequantizing FP8 then I really don't think it has anything to offer us until native CPU hardware support becomes available.

Except it is worth supporting, because NVIDIA has hardware support for it. However I don't own an NVIDIA graphics card that has FP8 yet. If someone is willing to devote the eng resources to implementing CUDA support, then I will merge that and ensure it at least works on CPU too. Although realistically, anyone who chooses FP8 will only be interested in doing it on GPU. There's also a question of what you'd do for AMD GPUs. As I mentioned, as long as it can be made to work I'm happy. Even if it's only primarily good for NVIDIA GPU owners.

That's my judgement for now. If anyone wants to volunteer, then send me pull requests to my FP8 branch and we'll keep working on it.

7 replies

Djip007 Aug 23, 2024
Author

static float from_fp8_4(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);
	int exp = ((fp8 >> 3) & 0x0F) - 7;
	if (fp8 & 0x78) {
		u.i  = (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20
		u.i |= (exp + 127) << 23;  // exponent
	} else { // denormalise:
		float m = (float)(t & 0x7);
		u.f = m/ (1<<9);
	}
	u.i |= (t & 128) << 24;    // sign
	return u.f;
}

I test this one... and it work. and __m256h _mm256_cvtepi16_ph (when available) can do the float cast vecterized.

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_cvt

Djip007 Aug 23, 2024
Author

static float from_fp8_5(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);
	int exp = ((fp8 >> 3) & 0x0F) - 7;
	if (fp8 & 0x78) {
		u.i  = (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20
		u.i |= (exp + 127) << 23;  // exponent
	} else { // denormalise:
		u.i  = (t & 7) << (20-6);
		u.i |= (0 + 127) << 23;  // exponent
		u.f -= 1;
	}
	u.i |= (t & 128) << 24;    // sign
	return u.f;
}

This one works too.

Djip007 Aug 23, 2024
Author

static float from_fp8_6(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);
	u.i  = (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20
	if (fp8 & 0x78) {
		const int exp = ((fp8 >> 3) & 0x0F) - 7;
		u.i |= (exp + 127) << 23;  // exponent
	} else { // denormalise:
		u.i |= (-6 + 127) << 23;  // exponent
		u.f -= 1.0/64;            // 2⁻⁶
	}
	u.i |= (t & 128) << 24;    // sign
	return u.f;
}

this forme can be use with fp16 I think (for ARM?)

Djip007 Aug 23, 2024
Author

static float from_fp8_7(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);
	auto denorm = fp8 & 0x78;
	int exp = -6;
	u.i  = (t & 7) << 20;      // mantissa: bit 2-0 -> 22-20
	if (denorm) { exp = ((fp8 >> 3) & 0x0F) - 7; }
	u.i |= (exp + 127) << 23;  // exponent
	if (!denorm) { u.f -= 1.0/64; }            // 2⁻⁶
	u.i |= (t & 128) << 24;    // sign
	return u.f;
}

This form may be simple to vectorise. I need to update my fp8 branch and try to adapte it with that.

jart Aug 24, 2024
Maintainer

Brilliant idea on the floating point subtraction. Latest code update is 52d042f

F8 goes 127 tok/sec on my znver4 threadripper
F16 goes 258 tok/sec on my znver4 threadripper
BF16 goes 381 tok/sec on my znver4 threadripper

static inline float
llamafile_fp8_e4m3_to_fp32(ggml_fp8_t f)
{
    union
    {
        ggml_fp8_t f;
        unsigned char i;
    } in = { f };
    union
    {
        float f;
        unsigned i;
    } u;
    unsigned x = in.i;
    if (x & 127) {
        if (x & 120) {
            u.i = (x & 7) << 20;
            u.i |= (((x >> 3) & 15) + 120) << 23;
        } else {
            u.i = (x & 7) << 14;
            u.i |= 127 << 23;
            u.f -= 1;
        }
    } else {
        u.i = 0;
    }
    u.i |= (x & 128) << 24;
    return u.f;
}

I'm going to try your BF16 and F16 transcoding idea. Those would certainly help on znver4 and armv8.2.

Djip007 · 2024-08-23T22:45:38Z

Djip007
Aug 23, 2024
Author

OK some bench with this fp8 branch:

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp1	2.49
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp2	4.93
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp4	9.87
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp8	7.32
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp16	20.37
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp32	35.25
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp64	48.92
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp128	45.60
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	pp256	57.21
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16	22.81 GiB	tg16	2.49

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp1	2.73
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp2	5.17
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp4	5.51
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp8	7.05
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp16	10.16
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp32	11.15
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp64	11.17
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp128	11.53
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp256	11.77
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	tg16	2.61

Now using this forme of computing fp8 > fp16:

static float from_fp8_8(uint8_t fp8) {
	union {
		float f;
		uint32_t i;
	} u;
	const uint32_t t = fp8;
	// if ((fp8 & 0x7F) == 0x7F) return (fp8==0xFF)?(-NAN):(+NAN);  // not needed here there is no NAN after quantisation.
	auto exp_8 = t & 0x78;
	auto exp_32 = (exp_8+(120<<3))<<20;
	u.i  = (t & 7) << 20;                     // mantissa: bit 2-0 -> 22-20
	u.i |= exp_8 ? exp_32 : (-6 + 127) << 23; // exponent
	if (!exp_8) { u.f -= 1.0/64; }            // 2⁻⁶
	u.i |= (t & 0x80) << 24;                  // sign
	return u.f;
}

I get this vectorised implemetation:

#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512BW__)
#include <immintrin.h>
static __m512 llamafile_from_fp8_e4m3_avx512(__m128i fp8_vec) {
	// extract componants:
	__m128i expo_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x78));
	__m128i mant_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x07));
	__m128i sign_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x80));
	// denorm mask
	__mmask16 is_denorm = _mm_cmpeq_epi8_mask(expo_8, _mm_set1_epi8(0));
	// convert to 32 bits
	__m512i expo_32 = _mm512_cvtepu8_epi32(expo_8);
	__m512i mant_32 = _mm512_cvtepu8_epi32(mant_8);
	__m512i sign_32 = _mm512_cvtepu8_epi32(sign_8);
	// shift
	expo_32 = _mm512_slli_epi32(_mm512_add_epi32(expo_32,_mm512_set1_epi32(120<<3)), 20);
	mant_32 = _mm512_slli_epi32(mant_32, 20);
	sign_32 = _mm512_slli_epi32(sign_32, 24);
	// correction denorm expo:
	expo_32 = _mm512_mask_blend_epi32(is_denorm, expo_32, _mm512_set1_epi32((-6 + 127) << 23));
	// merge mantissa+exponent
	__m512 result = _mm512_castsi512_ps(_mm512_or_si512(expo_32,mant_32));
	// correction denorm mantissa:
	result = _mm512_mask_add_ps(result, is_denorm, result, _mm512_set1_ps(-1.0/64));
	// add sign
    return _mm512_castsi512_ps(_mm512_or_si512(sign_32,_mm512_castps_si512(result)));
}
#endif

now some bench:

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp1	4.66
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp2	8.57
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp4	9.74
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp8	11.77
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp16	14.44
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp32	18.37
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp64	18.71
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp128	18.64
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	pp256	19.07
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.F8	11.41 GiB	tg16	4.49

Now Token generation is good. and prompt processing is not that bad if we think that we do not use bf16. pretty sure we can get a x2 if we convert fp8 to bf16 and use it for compute as is done with bf16 quant.

I may study closely this branch and see if I can add fp8/bf16 in tinyblas. It can be good exercise. Or update my test on blas_bf16 (https://github.com/Djip007/llama.cpp/tree/poc/bf16) but may take some time with all re-factoring that have be done on llama.cpp 😉. may be faster to rewrite it on top of fp8 branch...

6 replies

Djip007 Aug 24, 2024
Author

Thanks.
Mozilla MIT is perfect for me. I you want you can add my nickname in some contrib file.

Djip007 Aug 24, 2024
Author

I see you have add many more code/optim in this branch 😉 👍

static __m512 from_fp8_e4m3_avx512(__m128i fp8_vec) {
	// extract exponent:
	__m128i expo_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x78));
	__m128i mant_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x07));
	__m128i sign_8 = _mm_and_si128(fp8_vec, _mm_set1_epi8(0x80));

	// mask
	//> need AVX512BW + AVX512VL ?
	//__mmask16 is_denorm = _mm_cmpeq_epi8_mask(expo_8, _mm_setzero_si128());
	__m512i expo_32 = _mm512_cvtepu8_epi32(expo_8);
	__m512i mant_32 = _mm512_cvtepu8_epi32(mant_8);
	__m512i sign_32 = _mm512_cvtepu8_epi32(sign_8);
	//> pure AVX512F:
	__mmask16 is_denorm = _mm512_cmpeq_epi32_mask(expo_32, _mm512_setzero_epi32());

	// shift
	expo_32 = _mm512_slli_epi32(_mm512_add_epi32(expo_32,_mm512_set1_epi32(120<<3)), 20);
	mant_32 = _mm512_slli_epi32(mant_32, 20);
	sign_32 = _mm512_slli_epi32(sign_32, 24);

	// correction denorm:
	expo_32 = _mm512_mask_blend_epi32(is_denorm, expo_32, _mm512_set1_epi32((-6 + 127) << 23));

	__m512 result = _mm512_castsi512_ps(_mm512_or_si512(expo_32,mant_32));
	result = _mm512_mask_add_ps(result, is_denorm, result, _mm512_set1_ps(-1.0/64));

    return _mm512_castsi512_ps(_mm512_or_si512(sign_32,_mm512_castps_si512(result)));
}

This code work and is pure AVX512 I think (can really test with CPU that have no AVX512BW && AVX512VL) I expect it to be a little slower on ZEN4 or at least need a little more power but can replace you fallback AVX512 code I think.

Djip007 Aug 24, 2024
Author

for fp8 => bf16:

// for "clean" cast.
extern __inline __m512bh
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_castsi512_bh (__m512i __A)
{
  return (__m512bh) (__A);
}
extern __inline __m256bh
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm256_castsi256_bh (__m256i __A)
{
  return (__m256bh) (__A);
}
extern __inline __m512i
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_castbh_si512 (__m512bh __A)
{
  return (__m512i) (__A);
}

static inline __m512bh
llamafile_fp8_e4m3_to_bf16_avx512(__m256i fp8_vec){
	// extract components:
	__m256i expo_8 = _mm256_and_si256(fp8_vec, _mm256_set1_epi8(0x78));
	__m256i mant_8 = _mm256_and_si256(fp8_vec, _mm256_set1_epi8(0x07));
	__m256i sign_8 = _mm256_and_si256(fp8_vec, _mm256_set1_epi8(0x80));

	// denorm mask
	//> need AVX512BW + AVX512VL ?
	__mmask32 is_denorm = _mm256_cmpeq_epi8_mask(expo_8, _mm256_setzero_si256());
	__m512i expo_16 = _mm512_cvtepu8_epi16(expo_8);
	__m512i mant_16 = _mm512_cvtepu8_epi16(mant_8);
	__m512i sign_16 = _mm512_cvtepu8_epi16(sign_8);
	//> pure AVX512F:
	//__mmask32 is_denorm = _mm512_cmpeq_epi16_mask(expo_16, _mm512_setzero_si512());
	__mmask16 is_denorm_low  = is_denorm;
	__mmask16 is_denorm_high = is_denorm>>16;

	// shift
	expo_16 = _mm512_slli_epi16(_mm512_add_epi32(expo_16,_mm512_set1_epi16(120<<3)), 4);
	mant_16 = _mm512_slli_epi16(mant_16, 4);
	sign_16 = _mm512_slli_epi16(sign_16, 8);

	// correction denorm exp:
	expo_16 = _mm512_mask_blend_epi16(is_denorm, expo_16, _mm512_set1_epi16((-6 + 127) << 7));

	__m512i em = _mm512_or_si512(expo_16,mant_16);

	// correction denorm mantissa using fp32 Aritmetics:
	__m256bh low_bh  = _mm256_castsi256_bh(_mm512_castsi512_si256(em));
	__m256bh high_bh = _mm256_castsi256_bh(_mm512_extracti32x8_epi32 (em, 1));
	__m512 low  = _mm512_cvtpbh_ps( low_bh);
	__m512 high = _mm512_cvtpbh_ps(high_bh);
	low  = _mm512_mask_add_ps( low, is_denorm_low ,  low, _mm512_set1_ps(-1.0/64));
	high = _mm512_mask_add_ps(high, is_denorm_high, high, _mm512_set1_ps(-1.0/64));
	__m512bh result = _mm512_cvtne2ps_pbh(high,low);

    return _mm512_castsi512_bh(_mm512_or_si512(sign_16,_mm512_castbh_si512(result)));
}

The result is the same as yours, so code look correct. did not test speed for now.

jart Aug 25, 2024
Maintainer

This fp8->bf16 conversion goes 101 tok/sec on my workstation, whereas mine goes 120 tok/sec. That intermediate fp8->fp32->bf16 cast you're doing looks costly. That's what I was trying to avoid with the selector trick.

One of the bottlenecks with doing fp8->bf16/f16 is we need to return WANT_QUANTIZATION where we ask GGML to quantize the B matrix. Right now llamafile_fp32_to_fp8_e4m3() is pretty slow, since it calls libm functions.

Djip007 Aug 25, 2024
Author

That intermediate fp8->fp32->bf16 cast you're doing looks costly

Yes that what I was thinking when I finish it, Just don't have time to get it benchmark yesterday. i only they have add the possibility to make a simple ADD on this bf16. At least it can serve as an example with AVX512_FP16 and NEON_FP16.

And if I can get a "correct" quantisation with a scale factor as base, I may test if it is so important to have this 16 small value or if we can replace it with 0.

fp8->bf16/f16 is we need to return WANT_QUANTIZATION where we ask GGML to quantize the B matrix.
we can do like you did with ggml_fp8_bf16_t and add ggml_fp32_bf16_t

But don't waste your time with it. I'll do a few tries and if I find good things I'll make a merge request on this FP8 branch.

Djip007 · 2024-08-25T02:16:42Z

Djip007
Aug 25, 2024
Author

we may take some time to read that:
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8
https://github.com/neuralmagic/AutoFP8

That how FP8 quantisation are performe on vllm (I think...)

Look they have per tensor scale factor.

This model was obtained by quantizing the weights and activations of Mistral-Nemo-Instruct-2407 to FP8 data type, ready for inference with vLLM >= 0.5.0. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat.

It achieves an average score of 71.28 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 71.61.

5 replies

jart Aug 25, 2024
Maintainer

Yes hence my surprise that it even works at all. I wonder if it'll work with other models. Oh wow apparently our FP8 works with Meta LLaMA 3.1 8B too. It doesn't get the best perplexity scores but they're not terrible. FP8 is getting 8.5 but Q6_K does much better at 6.95 since Kawrakow is the best at this. For Nemo the difference is 11.7748 vs. 5.6997, not to mention Q6_K gets 300 tok/sec at prompt processing on my machine. I think working on this has been really fun, but that's a tremendous gulf.

Djip007 Aug 25, 2024
Author

but that's a tremendous gulf.

I know... thank for all your work, I think I have most I need to make more experiment.
I may only missing what to do to get the neuralmagic FP8 converted to GGUF ( adding input_scale and weight_scale).

Thanks for all this work, It is for me a good base to study more on llama.cpp/llamafile.

jart Aug 25, 2024
Maintainer

Yes! Come back and collaborate any time. You have a sharp intellect. I enjoyed working with you.

Djip007 Aug 31, 2024
Author

Thank you...
I'm still working on something 😎
(but takes longer than I thought 😉)

Djip007 Sep 25, 2024
Author

OK after many work finaly can test with scale factor on FP8 tensor ... and it looks much better with nemo.
Too late to do any more testing tonight.. More Coming Soon (I hope)

Djip007 · 2024-09-27T08:25:47Z

Djip007
Sep 27, 2024
Author

19/10/2024 obsolete!!! it can be "much" better!

Ok some more work. I create my own backend for more control on compute... (more later o, that)
Some Perplexity result for Mistral-Nemo-Instruct-2407 with
wikitext-2-raw/wiki.valid.raw -s 31337

BF16 with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.4760 +/- 0.04266
- A: BF16 => FP32
- C=A@B : FP32+= FP32*FP32
Q8_0 with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.4807 +/- 0.04270
Q6_K with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.5088 +/- 0.04286
Q5_K_M with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.5309 +/- 0.04315
Q5_K_S with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.5802 +/- 0.04367
Q4_K_M with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.6166 +/- 0.04399
Q3_K_L llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.9136 +/- 0.04645
Q3_K_M llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 6.9948 +/- 0.04678
Q3_K_S with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 8.7498 +/- 0.05943
Q2_K with llamafile-0.8.13 -ctk fp16 -ctv fp16 for reference on zen3 => Final estimate: PPL = 10.3693 +/- 0.07108
E4M3 V1: my backend on zen4 => Final estimate: PPL = 10.1849 +/- 0.07544
(ie: 1 scale factor per weight tensor)
- A: E4M3 => BF16
- B: FP32 => BF16
- C=alfa*A@B : FP32+= BF16*BF16
E4M3 V2.0: my backend on zen4 => Final estimate: PPL = 6.5673 +/- 0.04432
(ie: scale factor per weight tensor line)
- A: E4M3 => BF16 (full decoding)
- B: FP32 => BF16
- C|i,j]=alfa[i]*dot(A[i,:],B[j::]) : FP32+= BF16*BF16
E4M3 V2.1: my backend on zen4 => Final estimate: PPL = 6.5673 +/- 0.04432
(ie: scale factor per weight tensor line)
- A: E4M3 => BF16 with subnormal mapped to 0 (for higer speed)
- B: FP32 => BF16
- C|i,j]=alfa[i]*dot(A[i,:],B[j::]) : FP32+= BF16*BF16
E4M3 V2.2: my backend on zen4 => Final estimate: PPL = 6.5676 +/- 0.04433
(ie: scale factor per weight tensor line)
- A: E4M3 => BF16 (decode with /2^120)
- B: FP32 => BF16
- C|i,j]=alfa[i]*2^120*dot(A[i,:],B[j::]) : FP32+= BF16*BF16

Note: the 3 V2 FP8 decoding have the same perplexity so we can go for the faster (V2.2 ...)

5 replies

ikawrakow Sep 30, 2024

BF16 with llamafile-0.8.13 for reference on zen3 => Final estimate: PPL = 6.4760 +/- 0.04266

6.4760 seems a bit high? I get PPL = 6.3394 +/- 0.03895 for Mistral-Nemo-Instruct-2407 on Wikitext2 using BF16 for model weights and BF16 for kv-cache. With FP16 kv-cache it is 6.3410 +/- 0.03896. With fp16 model weights and kv-cache I get 6.3439 +/- 0.03900.

Djip007 Sep 30, 2024
Author

on which dataset? wikitext-2-raw/wiki.valid.raw like me or the wikitext-2-raw/wiki.test.raw

zzzzz@framework:~/LLM$ ll wikitext-2-raw/
total 13072
-rw-rw----. 1 zzzzz zzzzz  1290590 15 août   2016 wiki.test.raw
-rw-rw----. 1 zzzzz zzzzz 10940747 15 août   2016 wiki.train.raw
-rw-rw----. 1 zzzzz zzzzz  1146846 15 août   2016 wiki.valid.raw

(and yes sorry I get the wrong one... but it was for all my bench... => I may need to re-do all 😎

and that with llamafile-0.8.13 too ...

llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB

ikawrakow Oct 1, 2024

Oh, I see. I always use wiki.test.raw, not wiki.valid.raw. I get your result for wiki.valid.raw.

I think that in order to compare fp8 accuracy to integer quantization, it is better to use integer quantization that uses the importance matrix (a.k.a. imatrix). Here is what I get on wiki.valid.raw with imatrix quantization:

Q6_K: 6.5020
Q5_K_M: 6.5122
Q5_K_S: 6.5112
Q4_K_M: 6.5775
Q4_K_S: 6.5913

I.e., fp8 is somewhere between Q4_K_M and Q5_K_S

If I use SOTA quantization from https://github.com/ikawrakow/ik_llama.cpp/tree/main, I have

IQ6_K: 6.4863
IQ5_K: 6.5009
IQ4_K: 6.5594

I.e, fp8 is slightly worse than IQ4_K, which uses a 4.5 bits-per-weight non-linear quantization method. It is likely slightly slower than your fp8 for prompt processing, but nearly 2X faster for token generation (TG) due to TG being memory bound.

Djip007 Oct 1, 2024
Author

OK some more "reference" Perplexity:
- llamafile-0.8.13
- AMD Ryzen 9 5950X 16-Core Processor (znver3)
- wiki.test.raw KV=FP16

BF16: 6.3394 +/- 0.03895
Q8_0: 6.3450 +/- 0.03899
Q6_K: 6.3723 +/- 0.03914
Q5_K_M: 6.3982 +/- 0.03945
Q5_K_S: 6.4405 +/- 0.03987
Q4_K_M: 6.4872 +/- 0.04022
Q4_K_S: 6.5598 +/- 0.04079
Q3_K_L: 6.7735 +/- 0.04243
Q3_K_M: 6.8591 +/- 0.04283
Q3_K_S: 8.6179 +/- 0.05470
Q2_K: 10.3693 +/- 0.07108

Djip007 Oct 1, 2024
Author

for now it is "juste" some experiment with fp8 ... there is more possibility to test:
There is 3 format for FP8:

E3M4 : TODO
E4M3 : The one we test
E5M2 : TODO

and there is more "quantisation" possible:

C=A@B => the first test... not really good
C[i,j]=alfa*dot(A[i,:],B[j,:]) => may be not to bad on some model
C[i,j]=alfa[i]*dot(A[i,:],B[j,:]) => the last tested
C[i,j]=alfa[i,j/K_bloc]*dot(A[i,j:j+K_bloc],B[j:j+K_bloc,:]) => TODO
C[i,j]~=alfa[i/I_bloc,j/K_bloc]*dot(A[i,:],B[j,:]) => (? may be nice on RDNA3 APU with wmma of 16x16x16 ... or not)
... and 2 more for "stacked" matrix

For now I have to "finish" my new kernel that may have lower memory pressure for best scaling with high core count ... (if that work...)

Djip007 · 2024-09-27T21:44:02Z

Djip007
Sep 27, 2024
Author

Some benchmark.
It is done with

cpu_info	model_filename
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.BF16

The llamafile colonne is the current 0.8.13 release with BF16/Q8/Q6/Q5 compute for reference
The other are with my Test Backend. the NxM is the size of the micro kernel computed (5x5 / 4x6 use the "same" compute as with llamafile). the 16x16 is "completely" different and not finish more work have to be done for high "pp" and hope to be as fast as BF16_5x5 at the end...)
The FP8_ isdecode with V3 (ie: /2^120)

test	llamafile BF16	llamafile Q8_0	llamafile Q6_K	llamafile Q5_KM	BF16_5x5	BF16_4x6	BF16_16x16	F8_5x5	F8_4x6	F8_16x16 V1
pp1	2.51	4.64	6.05	6.91	2.35	2.21	2.46	4.38	4.37	4.65
pp2	4.96	9.19	11.95	13.72	4.66	4.52	4.86	8.70	8.75	8.99
pp3	4.06	13.75	17.75	20.28	6.29	6.56	7.01	12.94	12.94	13.52
pp4	9.88	9.62	22.88	26.09	8.72	9.27	7.71	16.78	16.99	18.29
pp5	12.25	12.61	27.98	32.24	11.41	11.83	12.27	21.98	22.07	22.59
pp6	8.04	25.27	33.09	37.65	14.16	14.07	14.53	25.53	26.28	26.24
pp7	8.83	17.24	36.15	41.12	15.26	14.71	16.04	29.54	29.56	30.01
pp8	7.23	19.53	37.81	43.45	18.61	18.45	19.27	33.15	33.33	32.44
pp9	11.90	30.64	29.11	32.65	20.08	20.33	21.29	35.13	36.72	34.01
pp10	23.63	21.82	31.42	35.15	22.47	23.06	22.82	39.92	40.17	35.31
pp11	14.55	23.84	32.96	37.49	22.75	25.17	26.03	40.37	43.17	36.03
pp12	15.78	32.80	34.73	39.51	24.89	26.61	27.83	43.12	46.53	36.64
pp13	11.64	24.59	36.60	41.39	27.65	27.80	29.33	45.81	45.36	37.58
pp14	17.98	26.29	38.25	43.45	29.03	29.43	31.79	48.22	47.43	38.19
pp15	32.19	34.01	39.79	44.67	30.58	30.47	33.07	50.49	50.06	38.59
pp16	19.58	26.56	47.72	48.36	31.27	30.96	34.26	47.95	51.99	38.96
pp32	33.83	30.71	52.13	52.64	48.46	44.67	37.97	57.60	59.25	40.82
pp64	49.92	33.81	55.91	55.41	63.58	59.34	40.24	66.05	67.73	41.82
pp128	48.71	36.00	57.39	56.46	68.15	66.81	41.93	66.94	70.86	42.26
pp256	58.42	34.79	58.14	56.62	70.01	66.21	42.95	68.36	70.45	42.33
pp512	57.51	34.30	57.48	55.55	66.22	64.23	42.42	64.87	65.68	41.43
tg16	2.48	4.55	6.01	6.88	2.28	2.07	2.36	4.31	4.31	4.61

0 replies

jart · 2024-09-28T03:46:56Z

jart
Sep 28, 2024
Maintainer

@Djip007 Am I understanding correctly that you got FP8 E4M3 to:

Have perplexity nearly as good as Q6_K
Have prediction speed that's competitive with Q6_K
Have prefill speed that's as good as BF16

If so, that's outstanding; I am interested in including your work in llamafile. What especially interests me is that you figured out how to make it work with flush to zero. Were those perplexity scores measured while flushing subnormals to zero? What about the benchmarks?

For your kernel shape, what you want to do is pick whatever shape uses the most vector registers without spilling to the stack. For BF16 on AVX512 and NEON which have 32 registers that shape was either 5x5 or 8x3.

Also can you share your code?

8 replies

Djip007 Sep 28, 2024
Author

I am really stupid we can have much faster E4M3 decoding !!!!

E4M3:   S EEEE MMM
  - S.1111.111   => NAN   
  - S.0000.000   => 0
  - S.0000.MMM   => subnormal: 0.MMM * 2^-6
  - S.EEEE.MMM   =>            1.MMM * 2^(E-7)

now if we use for BF16 decoding:

 S.EEEE.MMM   <=>  S.0000EEEE.MMM0000

we have that:

 - S.1111.111   => S.00001111.1110000  : this on is not NAN ... but we do not have NAN so we do not matter !!!  
 - S.0000.000   => S.00000000.0000000  == 0  OK
 - S.0000.MMM   => S.00000000.MMM0000  == 0.MMM * 2^-126  (it is a subnormal on BF16 too)   
 - S.EEEE.MMM   => S.0000EEEE.MMM0000  == 1.MMM * 2^(E-127)  

OK the last 2 is too small... now if we apply a factor 2^120:
  0 * 2^120 = 0                                    => OK
  0.MMM * 2^-126 * 2^120  == 0.MMM * 2^-6          => OK 
  1.MMM * 2^(E-127) * 2^120  == 1.MMM * 2^(E-7)    => OK

so if we do simple decoding we only have to get a 2^120 factor to get the good result ... in our case we can do that with alfa at the end... on the FP32 / mm512 ...

hop I am correct!!!

(in fact we may "loose" some value for small value on B... but need more time to figure it... may be not that much with FP32 acc on BF16...)

Edit (not completely sure):
with calculations made in dot2(BF16) => FP32:

we lose precision for values of B<2^-13
the values of B<2^-20 are treated as == 0...
may not be too serious...

Djip007 Sep 28, 2024
Author

Look promising: I use:

static inline __m512bh
llamafile_fp8_e4m3_to_bf16_avx512(__m256i fp8_vec) {
    __m512i fp8_v16 = _mm512_cvtepu8_epi16(fp8_vec);

    // extract sign and expo+mantice
    __m512i em_16   = _mm512_and_si512(fp8_v16, _mm512_set1_epi16(0x7F));
    __m512i sign_16 = _mm512_and_si512(fp8_v16, _mm512_set1_epi16(0x80));

    // shift them
    em_16 = _mm512_slli_epi16(em_16, 4);
    sign_16 = _mm512_slli_epi16(sign_16, 8);

    return _mm512_castsi512_bh(_mm512_or_si512(sign_16,em_16));
}

and the matmul is compute like that:
C[i,j] = alfa[i]*2¹²⁰*somme_k(As[i,k]*B[k,i])

the bench is good: with the 4x6 kernel:

cpu_info	model_filename	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp1	4.37
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp2	8.75
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp3	12.94
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp4	16.99
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp5	22.07
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp6	26.28
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp7	29.56
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp8	33.33
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp9	36.72
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp10	40.17
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp11	43.17
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp12	46.53
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp13	45.36
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp14	47.43
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp15	50.06
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp16	51.99
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp32	59.25
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp64	67.73
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp128	70.86
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp256	70.45
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	pp512	65.68
AMD Ryzen 9 7940HS (znver4)	Mistral-Nemo-Instruct-2407.E4M3C_32x1_4x6	tg16	4.31

and perplexity look good:
Final estimate: PPL = 6.5676 +/- 0.04433

Djip007 Sep 28, 2024
Author

For your kernel shape, what you want to do is pick whatever shape uses the most vector registers without spilling to the stack. For BF16 on AVX512 and NEON which have 32 registers that shape was either 5x5 or 8x3.

OK more info on that...

your kernel and my 5x5 and 4x6 use C like that:
float C[N][M][16] (well more __mm512[N][M])
and with 32 register the largest is 5x5 / 4x6 / 6x4 / 8x3 / 3x8 ...

in all case for tg I use 8x3 kernel (and for pp1,pp2,pp3) ... for higher ppN I use 5x5 or 4x6.

=> I may have call it 5x5_32x1 (MxN_KxM) ...

for the 16x2/16x16 . things is some more different: It is more like what you try on the llama-matmul branch...
First I reorder the weight (A) with A[M/2][K/2][16][2] at load time. (for remember A&C are transposed with llama.cpp...)
- C is like that: float C[N=16][M=16] <=> __mm512[16]
- A is load like that: bf16 A[M=16][K=2] <=> __mm512bh
- C is load like that: bf16 C[broadcast_16][K=2] <=> __mm512bh so with it I do not need to hsum C at the end of the compute.

=> I may have call it 16x16_2x16 (MxN_KxM) ...

I can give more element on that, but may be on a other "Discussion"... it may be interesting for the llama-matmul branch. I have a look on it and I think have some "comment" for that ;)

(ps: I update the benchmark result with more correct name and only best E4M3 decoding speed...)

jart Sep 29, 2024
Maintainer

I tried replacing llamafile_fp8_e4m3_to_bf16_avx512() in the fp8 branch with your new code. It isn't working for me. Did you need to change any of the other functions, like llamafile_fp32_to_fp8_e4m3()?

Djip007 Sep 29, 2024
Author

yes with the last one the converted value is "wrong" we need to apply a CORRECTION=2^120 at the end:
so for only good speed you can do:
https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas_cpu.h#L600C58-L600C60

            for (j = 0; j < RN; ++j)
                for (i = 0; i < RM; ++i)
                    store(INDEX(C, ldc, jj + j, ii + i), Cf[j][i]);
// have to be replace with:
            for (j = 0; j < RN; ++j)
                for (i = 0; i < RM; ++i)
                    store(INDEX(C, ldc, jj + j, ii + i), Cf[j][i]*CORRECTION<TA>());

# with something like that.
template<typename T> float constexpr CORRECTION() { return 1;}
template<> float constexpr CORRECTION<ggml_fp8_bf16_t> { return 2^120};
template<> float constexpr CORRECTION<ggml_fp8_t> { return 2^120};

but I don not test it , may be more complicated / case than that.

For good perplexity it have to be more like:

            for (j = 0; j < RN; ++j)
                for (i = 0; i < RM; ++i)
                    store(INDEX(C, ldc, jj + j, ii + i), Cf[j][i]*scale[ii+i]*CORRECTION<TA>());

I try to publish my code...

Djip007 · 2024-09-29T13:43:51Z

Djip007
Sep 29, 2024
Author

Also can you share your code?

OK I publish it, it is not finish and more like a POC..
To have more control what I can experiment I create a new backend. with it I have "full" control on weight load, so I can change type and reformat... only matmul is supported (the other are compute by the backend_cpu)
It load only .BF16.gguf and add transforme the weight as needed...
The main .h is here: https://github.com/Djip007/llamafile/blob/feature/fp8/llama.cpp/ggml-bf16.h
not mush to look with it.

for compute the first kernel is largely inspired by your tinyblas_sgemm. you can find it here sgemm

The load for fp8 can be found here: load_fp8
The "compute" of scale factor for A<fp8_t> is here scale_fp8

for the experiment kernel you can find it here bloc

But it is really experimental, I want to do much more work...

0 replies

Djip007 · 2024-10-10T23:26:44Z

Djip007
Oct 10, 2024
Author

Some more news,

I have added more FP8 format on my backend.
using KV as BF16 on wiki.test.raw for Mistral-Nemo-Instruct-2407 I get this perplexity

FP8: (19/10/2024: obsolete !! )

	G	C	K0	K1	K2	K3	B
E5M2		6.7303	6.7447	6.7461		6.7487
E4M3	10.7782	6.4238	6.4214	6.4177	6.4153	6.4156	6.4181
E3M4	19.1127	6.3625	6.3535	6.3507	6.3501	6.3521
E2M5		215.847	30.4841	27.3160		20.1875

for reference:

BF16 Final estimate: PPL = 6.3397 +/- 0.03895
Q8_0 Final estimate: PPL = 6.3445 +/- 0.03898
Q6_K Final estimate: PPL = 6.3745 +/- 0.03915
Q5_K_M Final estimate: PPL = 6.3969 +/- 0.03943
Q5_K_S Final estimate: PPL = 6.4410 +/- 0.03988

G => scale global A[k,m] => scale (float)
C => scale per colonne A[k,m] => scale[m]
Kn => scale per bloc of collone A[k,m] => scale[k/FACT, m] (K0 FACT=1024, K1 FACT=512, K2 FACT=256, K3 FACT=128)
B => scale per bloc: A[k,m] => scale[K/32, m/16]

3 replies

Djip007 Oct 11, 2024
Author

I update my branch if someone want to bench other model...

(hop it is usable...)

Djip007 Oct 12, 2024
Author

Some more perplexity (with some surprise) for Mistral-7B-Instruct-v0.3

BF16       : PPL =   6.1798 +/- 0.03742
FP8_E5M2_C : PPL =   6.3757 +/- 0.03948
FP8_E4M3_C : PPL =   6.2258 +/- 0.03812
FP8_E3M4_C : PPL =   6.1481 +/- 0.03731
FP8_E2M5_C : PPL = 672.2103 +/- 3.05890
Q8_0       : PPL =   6.1760 +/- 0.03738
Q6_K       : PPL =   6.1877 +/- 0.03748
Q5_K_M     : PPL =   6.1956 +/- 0.03752

Yes I check it twice ... le FP8_E3M4_C has better perplexity than original BF16

Djip007 Oct 12, 2024
Author

Meta-Llama-3.1-8B-Instruct.BF16:

BF16       : PPL =   7.3213 +/- 0.04672
FP8_E5M2_C : PPL =   7.7868 +/- 0.05125
FP8_E4M3_C : PPL =   7.3931 +/- 0.04799
FP8_E3M4_C : PPL =   7.3087 +/- 0.04689
FP8_E2M5_C : PPL = 278.4532 +/- 2.43753
Q8_0       : PPL =   7.3293 +/- 0.04678
Q6_K       : PPL =   7.3676 +/- 0.04710
Q5_K_M     : PPL =   7.4043 +/- 0.04735
Q5_K_S     : PPL =   7.4354 +/- 0.04758

same for this one, the best perplexity is with FP8_E3M4_C

Djip007 · 2024-10-17T21:53:24Z

Djip007
Oct 17, 2024
Author

I was so focused on the speed that I forgot to look at the quality of the quantization.
I converted the fp32 to fp8 by "tunc" and not by rounding to close... with that the perplexity is "much" better

With a single scale for a bloc of 256 weight:

FP8_E3M4_K2 : PPL = 6.346961 ± 0.039010
FP8_E4M3_K2 : PPL = 6.363081 ± 0.039116

In this case I even have add "output.weight" fp8 quantisation.

Now I have to re-compute all the perplexity 😎

Q8_0 : PPL = 6.3445 +/- 0.03898
Q6_K : PPL = 6.3745 +/- 0.03915

1 reply

Djip007 Oct 17, 2024
Author

template<int _E>
struct FP8 {
    uint8_t bits;
    using type = FP8<_E>;
    static constexpr int E=_E;
    static constexpr int M=7-_E;
    static constexpr int E_BIAS=EXP2<_E-1>()-1;
    static constexpr float MAX() { return (2-EXP2<-M+1>())*EXP2<EXP_I2<_E-1>()>(); }
    static constexpr float MIN() { return EXP2<-M>()*EXP2<2-EXP_I2<_E-1>()>(); }

    void operator=(fp32_t value) {
        union {
            float f;
            uint32_t bits;
        } in = {value};
        bits = (in.bits >> 24) & 0x80;
        in.bits &= 0x7fffffff;
        GGML_ASSERT(in.bits < 0x7f800000);
        if (in.f >= MAX()) {
            bits |= 0x7E;
        } else if (in.f<MIN()) {
            // OK
        } else {
            in.f *= EXP2<E_BIAS-127>();
            in.bits += 1<<(22-M); // for rounding!
            bits |= (in.bits >> (23-M)) & 0x7F;
        }
    }
};

Djip007 · 2024-10-18T22:18:02Z

Djip007
Oct 18, 2024
Author

OK this is better with "correct" rounding:

#> zen4 / BF16 / wiki.test.raw / rounding / Mistral-Nemo-Instruct.

PPL	G	C	K0	K1	K2	K3
E5M2	6.507828	6.494686	6.451147	6.468471
E4M3	23.813402	6.373370	6.370346	6.363045	6.363081	6.368533
E3M4	16.258468	6.358655	6.353624	6.351525	6.346961	6.350090
E2M5			38.999262	32.310033		20.429764

KLD	G	C	K0	K1	K2	K3
E5M2	0.028775	0.022342	0.020308	0.024514
E4M3	1.363115	0.006050	0.006038	0.005680	0.004972	0.004783
E3M4	0.957248	0.003117	0.001970	0.001740	0.001592	0.001506
E2M5			1.809456	1.615140		1.158232

top P	G	C	K0	K1	K2	K3
E5M2	91.430	92.095	92.335	91.650
E4M3	53.460	95.701	95.717	95.985	96.184	96.248
E3M4	58.476	97.078	97.696	97.837	97.950	97.975
E2M5			47.136	48.920		54.661

BF16	PPL	KLD	top P
FP16	6.339475	0.000002	99.941
Q8_0	6.345049	0.000854	98.503
Q6_K	6.373544	0.006197	95.778
Q5_K_M	6.397938	0.010158	94.834
Q5_K_S	6.437732	0.013509	94.231
Q4_K_M	6.489127	0.026584	92.176

Now most are from Q6_K and Q8_0. (some more result in progress) But as you see I have best result with E3M4 format... that may not be a good point for AMD/Nvidia that have implement E4M3 in hardware. (but to be fare only the weight are quantize, the compute is done after convert on BF16.)

liste of quantized tensor for the FP8 are :

        static constexpr std::list<std::string> LIST_WEIGHT_CONVERT() { return {
            "ffn_down.weight",
            "ffn_gate.weight",
            "ffn_up.weight",
            "ttn_k.weight",
            "ttn_q.weight",
            "ttn_v.weight",
            "ttn_output.weight",
            "output.weight",
        };}

(before I do not quantize the "output.weight")

1 reply

Djip007 Oct 18, 2024
Author

And for the story, I had some more complet result for "trunc" implementation and for example I had:

FP8_E3M4_K2 (trunc / without "output.weight"):
PPL: 6.352277
Mean ln(PPL(Q)/PPL(base)) : 0.002401
KLD: 0.008193
Same top p: 95.762

if we compare with:
FP8_E3M4_K1 (round / with "output.weight"):
PPL: 6.351525
Mean ln(PPL(Q)/PPL(base)) : 0.002283
KLD: 0.001740
Same top p: 97.837

PPL / Mean ln(PPL(Q)/PPL(base)) is close, but KLD / top P are really different.

For Q6_K I get:
PPL: 6.373544
Mean ln(PPL(Q)/PPL(base)) : 0.005744
KLD: 0.006197
Same top p: 95.778

Djip007 · 2024-10-19T17:44:00Z

Djip007
Oct 19, 2024
Author

some more...

Meta-Llama-3.1-8B-Instruct

(zen4 / BF16 / wiki.test.raw / rounding / output-weight)
ref: BF16: Final estimate: PPL = 7.3210 +/- 0.04673

BF16	PPL	KLD	top P
Q8_0	7.328835	0.000645	98.722
Q6_K	7.366503	0.004642	96.543
Q5_K_M	7.404675	0.008682	95.473
Q5_K_S	7.433822	0.013253	94.508
E4M3_K2	7.287770	0.018865	94.011
E3M4_K2	7.271830	0.016156	94.720
E5M2_C	7.394024	0.032715	90.912
E4M3_C	7.295561	0.019220	93.830
E3M4_C	7.278014	0.016454	94.650

when we see what we get when model are not train with FP8 quantize award ?

Mistral-7B-Instruct-v0.3

( zen4 / BF16 / wiki.test.raw / rounding / output-weight )
BF16: Final estimate: PPL = 6.1796 +/- 0.03742

BF16	PPL	KLD	top P
Q8_0	6.175839	0.000514	98.962
Q6_K	6.187470	0.003625	97.307
Q5_K_M	6.197782	0.006144	96.499
Q5_K_S	6.216826	0.008645	95.866
E4M3_K2	6.178027	0.002287	97.601
E3M4_K2	6.177088	0.000844	98.580
E5M2_C
E4M3_C	6.177142	0.002681	97.529
E3M4_C	6.170937	0.001532	98.199

this one is better!

0 replies

Djip007 · 2024-10-28T01:16:55Z

Djip007
Oct 28, 2024
Author

OK for now I create a branch with FP8 base support. It is not for speed, I only add base part (quantize / convert / dot) Djip007@cbd3abd

I add 4 FP8 format:

E5M2 & E4M3 : has we did here . it is for use with FP8 native quantised model like https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct-FP8. but to be usable we need to add load of the 'weight_scale' and add the mul OP.
E4M3_Q & E3M4_Q: this have are more like Q8_0, but with bloc size of 256 (Q8_0 have bloc size of 32)

Now I need to make it fast..

for E5M2 & E4M3, we can do it is tinyblas like @jart did here.
for E4M3_Q & E3M4_Q, it is more complicated to add it like that. For now I think the simple is to use classic blis gemm structure (like in https://github.com/Mozilla-Ocho/llamafile/tree/llama-matmul) and add dequantise in pack_block with the transpose.

0 replies

Add FP8 support to llamafile #549

Djip007 Aug 21, 2024

Replies: 16 comments · 51 replies

Djip007 Aug 21, 2024 Author

jart Aug 21, 2024 Maintainer

jart Aug 21, 2024 Maintainer

Djip007 Aug 21, 2024 Author

jart Aug 21, 2024 Maintainer

Djip007 Aug 21, 2024 Author

Djip007 Aug 22, 2024 Author

jart Aug 22, 2024 Maintainer

jart Aug 22, 2024 Maintainer

Djip007 Aug 22, 2024 Author

Djip007 Aug 24, 2024 Author

Djip007 Aug 22, 2024 Author

jart Aug 23, 2024 Maintainer

jart Aug 23, 2024 Maintainer

Djip007 Aug 23, 2024 Author

Djip007 Aug 22, 2024 Author

jart Aug 23, 2024 Maintainer

Djip007 Aug 23, 2024 Author

Djip007 Aug 23, 2024 Author

Djip007 Aug 23, 2024 Author

Djip007 Aug 23, 2024 Author

jart Aug 24, 2024 Maintainer

Djip007 Aug 23, 2024 Author

Djip007 Aug 24, 2024 Author

Djip007 Aug 24, 2024 Author

Djip007 Aug 24, 2024 Author

jart Aug 25, 2024 Maintainer

Djip007 Aug 25, 2024 Author

Djip007 Aug 25, 2024 Author

jart Aug 25, 2024 Maintainer

Djip007 Aug 25, 2024 Author

jart Aug 25, 2024 Maintainer

Djip007 Aug 31, 2024 Author

Djip007 Sep 25, 2024 Author

Djip007 Sep 27, 2024 Author

ikawrakow Sep 30, 2024

Djip007 Sep 30, 2024 Author

ikawrakow Oct 1, 2024

Djip007 Oct 1, 2024 Author

Djip007 Oct 1, 2024 Author

Djip007 Sep 27, 2024 Author

jart Sep 28, 2024 Maintainer

Djip007 Sep 28, 2024 Author

Djip007 Sep 28, 2024 Author

Djip007 Sep 28, 2024 Author

jart Sep 29, 2024 Maintainer

Djip007 Sep 29, 2024 Author

Djip007 Sep 29, 2024 Author

Djip007 Oct 10, 2024 Author

FP8: (19/10/2024: obsolete !! )

for reference:

Djip007 Oct 11, 2024 Author

Djip007 Oct 12, 2024 Author

Djip007 Oct 12, 2024 Author

Djip007 Oct 17, 2024 Author

Djip007 Oct 17, 2024 Author

Djip007 Oct 18, 2024 Author

Djip007 Oct 18, 2024 Author

Djip007
Aug 21, 2024

Replies: 16 comments 51 replies

Djip007
Aug 21, 2024
Author

jart Aug 21, 2024
Maintainer

jart Aug 21, 2024
Maintainer

Djip007 Aug 21, 2024
Author

jart Aug 21, 2024
Maintainer

Djip007
Aug 21, 2024
Author

Djip007 Aug 22, 2024
Author

jart Aug 22, 2024
Maintainer

jart Aug 22, 2024
Maintainer

Djip007 Aug 22, 2024
Author

Djip007 Aug 24, 2024
Author

Djip007
Aug 22, 2024
Author

jart Aug 23, 2024
Maintainer

jart Aug 23, 2024
Maintainer

Djip007 Aug 23, 2024
Author

Djip007
Aug 22, 2024
Author

jart
Aug 23, 2024
Maintainer

Djip007 Aug 23, 2024
Author

Djip007 Aug 23, 2024
Author

Djip007 Aug 23, 2024
Author

Djip007 Aug 23, 2024
Author

jart Aug 24, 2024
Maintainer

Djip007
Aug 23, 2024
Author

Djip007 Aug 24, 2024
Author

Djip007 Aug 24, 2024
Author

Djip007 Aug 24, 2024
Author

jart Aug 25, 2024
Maintainer

Djip007 Aug 25, 2024
Author

Djip007
Aug 25, 2024
Author

jart Aug 25, 2024
Maintainer

Djip007 Aug 25, 2024
Author

jart Aug 25, 2024
Maintainer

Djip007 Aug 31, 2024
Author

Djip007 Sep 25, 2024
Author

Djip007
Sep 27, 2024
Author

Djip007 Sep 30, 2024
Author

Djip007 Oct 1, 2024
Author

Djip007 Oct 1, 2024
Author

Djip007
Sep 27, 2024
Author

jart
Sep 28, 2024
Maintainer

Djip007 Sep 28, 2024
Author

Djip007 Sep 28, 2024
Author

Djip007 Sep 28, 2024
Author

jart Sep 29, 2024
Maintainer

Djip007 Sep 29, 2024
Author

Djip007
Sep 29, 2024
Author

Djip007
Oct 10, 2024
Author

Djip007 Oct 11, 2024
Author

Djip007 Oct 12, 2024
Author

Djip007 Oct 12, 2024
Author

Djip007
Oct 17, 2024
Author

Djip007 Oct 17, 2024
Author

Djip007
Oct 18, 2024
Author

Djip007 Oct 18, 2024
Author