SmashConfig User Manual

SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.

Defining a SmashConfig

Define a SmashConfig using the following code:

from pruna.algorithms.SmashConfig import SmashConfig
smash_config = SmashConfig()

After creating a SmashConfig, you can set the parameters for optimization:

smasher_config['task'] = 'text_image_generation'
smash_config['compilers'] = ['diffusers2']

Passing a SmashConfig to the Smash Function

Pass a SmashConfig to the smash function as follows:

from pruna.smash import smash

mashed_model = smash(
    model=pipe,
    api_key='<your-api-key>',  # Replace <your-api-key> with your actual API key
    smash_config=smash_config,
)

SmashConfig Parameters

Task

The task parameter specifies the type of model you want to optimize. Supported tasks include:

image_classification: Optimize image classification models.
image_instance_segmentation: Optimize instance segmentation models.
image_keypoint_detection: Optimize keypoint detection models.
image_object_detection: Optimize object detection models.
image_semantic_segmentation: Optimize semantic segmentation models.
image_image_generation: Optimize image generation models.
image_image_inpainting: Optimize image inpainting models.
image_image_control: Optimize image control models.
image_video_generation: Optimize video generation models.
text_image_generation: Optimize text-to-image generation models.
text_video_generation: Optimize text-to-video generation models.
text_text_generation: Optimize text generation models.
text_text_translation: Optimize text translation models.
text+image_image_generation: Optimize text and image generation models.
audio_text_transcription: Optimize audio-to-text transcription models.

Optimization Methods

There are four types of optimization methods:

Compilation - Use smash_config['compilers']
Quantization - Use smash_config['quantizers']
Pruning - Use smash_config['pruners']
Factorization - Use smash_config['factorizers']

Compilation Methods

Compilation methods optimize the model for specific hardware. Supported methods include:

all:
- Optimize computer vision models for any hardware.
- Required Argument: - device: ‘cpu’ or ‘cuda’. e.g. smash_config['device'] = 'cuda'
- Time: Approximately 15-20 minutes.
- Quality: Similar to the original model.
diffusers:
- Optimize diffusers models for NVIDIA GPUs.
- Required Argument: - None.
- Time: Approximately 15-20 minutes.
- Quality: Same as the original model.
diffusers2:
- Optimize diffusers2 models for NVIDIA GPUs.
- Optional Argument: - save_dir: Working directory during compilation (temporary directory created if not specified). e.g. smash_config['save_dir'] = '/tmp/optimized_model.pkl'
- Time: About 10 seconds.
- Quality: Same as the original model.
c_translation:
- Transform Huggingface transformers translation models to C++ code.
- Required Argument: - tokenizer: Associated tokenizer. e.g. smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')
- Optional Argument: - n_quantization_bits: 8 or 16 bits (default 16). e.g. smash_config['n_quantization_bits'] = 8
- Time: A few minutes.
- Quality: Same as the original model.
c_generation:
- Compiles generation models from Huggingface’s transformers library into C++ code.
- Required Argument: - tokenizer: The tokenizer associated with your generation model.
- Optional Argument: - n_quantization_bits: Specify 8 or 16 bits (16 by default).
- Time: A few minutes.
- Quality: Equivalent to the original model.
c_whisper:
- Converts whisper models from Huggingface’s transformers library to C++ code.
- Required Argument: - processor: The processor for your whisper model.
- Optional Argument: - n_quantization_bits: Choose between 8 or 16 bits (16 if unspecified). e.g. smash_config['n_quantization_bits'] = 8
- Time: A few minutes.
- Quality: Same as the original model.
ifw:
- Optimizes whisper models from Huggingface’s transformers library using advanced batching and chunking techniques.
- Required Arguments: - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"') - device: Target hardware (‘cpu’ or ‘cuda’). e.g. smash_config['device'] = 'cuda'
- Time: Seconds.
- Quality: Comparable to the original model.
s2t:
- Enhances c_whisper or whisper models from Huggingface’s transformers library with reduced hallucination issues and advanced techniques.
- Required Argument: - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')
- Time: Seconds.
- Quality: Maintains original model performance.
hypertiles:
- Compiles diffusers models for optimal inference speed on target GPUs.
- Time: About 15-20 minutes.
- Quality: Similar to the original model.
step_caching:
- Optimizes diffusers models by intelligently selecting diffusion steps.
- Time: Seconds.
- Quality: Very close to the original model.
cv_fast:
- Rapidly compiles computer vision models for NVIDIA GPUs.
- Time: Approximately 10 seconds.
- Quality: Unchanged from the original model.

Quantization

Quantization methods reduce the precision of the model’s weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:

llm-int8:
- Quantizes the model to either 8-bit or 4-bit integers.
- Required Argument:
  
  n_quantization_bits: 4 or 8 bits. e.g. smash_config['n_quantization_bits'] = 8
- Time: A few minutes.
- Quality: Lower than the original model with 4 bits worse than 8 bits.
gptq:
- Quantizes the model to 8-bit ir 4-bit or 3-bit or 2-bit integers.
- Required Argument:
  
  n_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['n_quantization_bits'] = 4
- Time: A few minutes to an hour depending on the size model.
- Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.

Pruning

Coming Soon!

Factorization

Coming Soon!