Skip to content
/ pruna Public

Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead.

License

Notifications You must be signed in to change notification settings

PrunaAI/pruna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Pruna AI Logo

Element Simply make AI models faster, cheaper, smaller, greener! Element


Documentation


GitHub License GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub Release GitHub commit activity PyPI - Downloads Codacy

Website X (formerly Twitter) URL Devto Reddit Discord Huggingface Replicate


Pruna AI Logo

Pruna Cool Introduction

Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead. It provides a comprehensive suite of compression algorithms including caching, quantization, pruning, distillation and compilation techniques to make your models:

  • Faster: Accelerate inference times through advanced optimization techniques
  • Smaller: Reduce model size while maintaining quality
  • Cheaper: Lower computational costs and resource requirements
  • Greener: Decrease energy consumption and environmental impact

The toolkit is designed with simplicity in mind - requiring just a few lines of code to optimize your models. It supports various model types including LLMs, Diffusion and Flow Matching Models, Vision Transformers, Speech Recognition Models and more.

Pruna Pro

To move at top speed, we offer Pruna Pro, our enterprise solution that unlocks advanced optimization features, our OptimizationAgent, priority support, and much more.

Pruna Cool Installation

Pruna is currently available for installation on Linux, MacOS and Windows. However, some algorithms impose restrictions on the operating system and might not be available on all platforms.

Before installing, ensure you have:

  • Python 3.9 or higher
  • Optional: CUDA toolkit for GPU support

Option 1: Install Pruna using pip

Pruna is available on PyPI, so you can install it using pip:

pip install pruna

Option 2: Install Pruna from source

You can also install Pruna directly from source by cloning the repository and installing the package in editable mode:

git clone https://github.com/pruna-ai/pruna.git
cd pruna
pip install -e .

Pruna Cool Quick Start

Before we start: Pruna allows to collect a minimal set of aggregated, non-personal telemetry data to help us identify popular algorithms and improve the product. Telemetry is enabled by default because your participation helps us make Pruna better. However, if you'd prefer not to share this, you can always disable telemetry with:

from pruna.telemetry import set_telemetry_metrics

set_telemetry_metrics(False)  # disable telemetry for current session
set_telemetry_metrics(False, set_as_default=True)  # disable telemetry globally

Getting started with Pruna is easy-peasy pruna-squeezy!

First, load any pre-trained model. Here's an example using Stable Diffusion:

from diffusers import StableDiffusionPipeline
base_model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

Then, use Pruna's smash function to optimize your model. You can customize the optimization process using SmashConfig:

from pruna import smash, SmashConfig

# Create and smash your model
smash_config = SmashConfig()
smash_config["cacher"] = "deepcache"
smashed_model = smash(model=base_model, smash_config=smash_config)

Your model is now optimized and you can use it as you would use the original model:

smashed_model("An image of a cute prune.").images[0]

Pruna provides a variety of different compression and optimization algorithms, allowing you to combine different algorithms to get the best possible results:

from pruna import smash, SmashConfig

# Create and smash your model
smash_config = SmashConfig()
smash_config["cacher"] = "deepcache"
smash_config["compiler"] = "stable_fast"
smashed_model = smash(model=base_model, smash_config=smash_config)

You can then use our evaluation interface to measure the performance of your model:

from pruna.evaluation.task import Task
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.data.pruna_datamodule import PrunaDataModule

task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string("LAION256")) 
eval_agent = EvaluationAgent(task) 
eval_agent.evaluate(smashed_model)

This was the minimal example, but you are looking for the maximal example? You can check out our documentation for an overview of all supported algorithms as well as our tutorials for more use-cases and examples.

Pruna Heart Pruna Pro

Pruna has everything you need to get started on optimizing your own models. To push the efficiency of your models even further, we offer Pruna Pro. To give you a glimpse of what is possible with Pruna Pro, let us consider three of the most widely used diffusers pipelines and see how much smaller and faster we can make them. In addition to popular open-source algorithms, we use our proprietary Auto Caching algorithm. We compare the fidelity of the compressed models. Fidelity measures the similarity between the images of the compressed models and the images of the original model.

Stable Diffusion XL

For Stable Diffusion XL, we compare Auto Caching with DeepCache (available with Pruna). We combine these caching algorithms with torch.compile to get an additional 9% reduction in inference latency, and we use HQQ 8-bit quantization to reduce the size of the model from 8.8GB to 6.7GB.

SDXL Benchmark

FLUX [dev]

For FLUX [dev], we compare Auto Caching with the popular TeaCache algorithm. In this case, we used Stable Fast to reduce the latency of Auto Caching by additional 13%, and HQQ with 8-bit reduced the size of FLUX from 33GB to 23GB.

FLUX [dev] Benchmark

HunyuanVideo

For HunyuanVideo, we compare Auto Caching with TeaCache. Applying HQQ 8-bit quantization to the model reduced the size from 41GB to 29GB.

HunyuanVideo Benchmark

Pruna Cool Algorithm Overview

Since Pruna offers a broad range of compression algorithms, the following table provides an overview of all methods available in Pruna and those exclusive to Pruna Pro. For a detailed description of each algorithm, have a look at our documentation.

Algorithm
Pruna Pro
Type
Hardware
Model Format
CPU GPU πŸ€— Transformers CausalLM πŸ€— Diffusers Pipeline πŸ€— Transformers Whisper torch Module
AWQ quantizer Check Check
GPTQ quantizer Check Check
HQQ quantizer Check Check Check
Int8 quantizer Check Check Check
QUANTO quantizer Check Check Check Check
Torch Dynamic quantizer Check Check Check Check
HIGGS Check quantizer Check Check
torchao Check quantizer Check Check Check Check Check Check
PERP Check recoverer Check Check Check Check
c_translate compiler Check Check
IPEX Check compiler Check Check
Stable Fast compiler Check Check
torch.compile compiler Check Check Check Check Check Check
x-fast Check compiler Check Check Check Check Check
DeepCache1 cacher Check Check Check
Adaptive Caching Check cacher Check Check Check
Auto Caching Check cacher Check Check Check
FLUX Caching2 Check cacher Check Check Check
Periodic Caching Check cacher Check Check Check
HYPER3 Check distiller Check Check Check
Structured Pruning pruner Check Check Check
Unstructured Pruning pruner Check Check Check Check
ifw batcher Check Check
ws2t batcher Check Check

1. Only available for unet-based diffusers pipelines.
2. Only available for FLUX models.
3. Only available for FLUX, SD-XL, SD-v1-4, SD-v1-5, SD-3.5.



Pruna AI Logo


Pruna Sad FAQ and Troubleshooting

If you can not find an answer to your question or problem in our documentation, in our FAQs or in an existing issue, we are happy to help you! You can either get help from the Pruna community on Discord, join our Office Hours or open an issue on GitHub.

Pruna Heart Contributors

The Pruna package was made with πŸ’œ by the Pruna AI team. Contribute to the repository to become part of the Pruna family!

Contributors are displayed in a random order to avoid any perceived ranking.

Pruna Emotional Citation

If you use Pruna in your research, feel free to cite the project! πŸ’œ

    @misc{pruna,
    title = {Efficient Machine Learning with Pruna},
    year = {2023},
    note = {Software available from pruna.ai},
    url={https://www.pruna.ai/}
    }

Pruna AI Logo

About

Pruna is a model optimization framework built for developers, enabling you to deliver faster, more efficient models with minimal overhead.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages