AI Model Format & Types

Model Formats

Model formats, which determine how a model is stored, optimized, and loaded. Here are the most common ones:

These are the original, unoptimized versions of models, typically in FP32 or BF16 precision.

.safetensors – A secure alternative to .bin, used mainly in Hugging Face Transformers.
.pt / .pth – PyTorch model format.
.ckpt – Checkpoint format, used in TensorFlow and some older PyTorch models.

These formats reduce memory usage while maintaining reasonable accuracy:

.gguf – The latest format for llama.cpp, supporting various quantization levels (2-bit, 3-bit, 4-bit, etc.).
.ggml – Older version of gguf, used in llama.cpp but mostly replaced by gguf.
.gptq – GPTQ quantization for faster inference on GPUs.
.awq – Activation-aware quantization for improved efficiency on GPUs.
.safetensors (quantized) – Used with quantized Hugging Face models.

These formats optimize for specific hardware or frameworks:

.tflite – TensorFlow Lite format for mobile and edge devices.
.onnx – Open Neural Network Exchange, used for cross-platform compatibility.
.trt – TensorRT format optimized for NVIDIA GPUs.
.mlmodel – CoreML format for Apple devices.

Model types refer to the architectures and capabilities of different LLMs, such as:

These define how a model is built and functions:

Decoder-only (Autoregressive models) – Used for text generation (e.g., GPT-4, LLaMA, Mistral).
Encoder-only (Bidirectional models) – Used for understanding and classification (e.g., BERT, RoBERTa).
Encoder-Decoder (Seq2Seq models) – Used for translation, summarization, etc. (e.g., T5, FLAN-T5).

So while formats (like .gguf, .onnx) define how a model is stored and optimized, model types refer to architectures and purposes.