SmashConfig User Manual

SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig.

Defining a SmashConfig

Define a SmashConfig using the following code:

from pruna.algorithms.SmashConfig import SmashConfig
smash_config = SmashConfig()

After creating a SmashConfig, you can set the parameters for optimization:

smasher_config['task'] = 'text_image_generation'
smash_config['compilers'] = ['diffusers2']

Passing a SmashConfig to the Smash Function

Pass a SmashConfig to the smash function as follows:

from pruna.smash import smash

mashed_model = smash(
    model=pipe,
    api_key='<your-api-key>',  # Replace <your-api-key> with your actual API key
    smash_config=smash_config,
)

SmashConfig Parameters

Task

The task parameter specifies the type of model you want to optimize. Supported tasks include:

  • image_classification: Optimize image classification models.

  • image_instance_segmentation: Optimize instance segmentation models.

  • image_keypoint_detection: Optimize keypoint detection models.

  • image_object_detection: Optimize object detection models.

  • image_semantic_segmentation: Optimize semantic segmentation models.

  • image_image_generation: Optimize image generation models.

  • image_image_inpainting: Optimize image inpainting models.

  • image_image_control: Optimize image control models.

  • image_video_generation: Optimize video generation models.

  • text_image_generation: Optimize text-to-image generation models.

  • text_video_generation: Optimize text-to-video generation models.

  • text_text_generation: Optimize text generation models.

  • text_text_translation: Optimize text translation models.

  • text+image_image_generation: Optimize text and image generation models.

  • audio_text_transcription: Optimize audio-to-text transcription models.

Optimization Methods

There are four types of optimization methods:

  1. Compilation - Use smash_config['compilers']

  2. Quantization - Use smash_config['quantizers']

  3. Pruning - Use smash_config['pruners']

  4. Factorization - Use smash_config['factorizers']

Compilation Methods

Compilation methods optimize the model for specific hardware. Supported methods include:

  • all:
    • Optimize computer vision models for any hardware.

    • Required Argument: - device: ‘cpu’ or ‘cuda’. e.g. smash_config['device'] = 'cuda'

    • Time: Approximately 15-20 minutes.

    • Quality: Similar to the original model.

  • diffusers:
    • Optimize diffusers models for NVIDIA GPUs.

    • Required Argument: - None.

    • Time: Approximately 15-20 minutes.

    • Quality: Same as the original model.

  • diffusers2:
    • Optimize diffusers2 models for NVIDIA GPUs.

    • Optional Argument: - save_dir: Working directory during compilation (temporary directory created if not specified). e.g. smash_config['save_dir'] = '/tmp/optimized_model.pkl'

    • Time: About 10 seconds.

    • Quality: Same as the original model.

  • c_translation:
    • Transform Huggingface transformers translation models to C++ code.

    • Required Argument: - tokenizer: Associated tokenizer. e.g. smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')

    • Optional Argument: - n_quantization_bits: 8 or 16 bits (default 16). e.g. smash_config['n_quantization_bits'] = 8

    • Time: A few minutes.

    • Quality: Same as the original model.

  • c_generation:
    • Compiles generation models from Huggingface’s transformers library into C++ code.

    • Required Argument: - tokenizer: The tokenizer associated with your generation model.

    • Optional Argument: - n_quantization_bits: Specify 8 or 16 bits (16 by default).

    • Time: A few minutes.

    • Quality: Equivalent to the original model.

  • c_whisper:
    • Converts whisper models from Huggingface’s transformers library to C++ code.

    • Required Argument: - processor: The processor for your whisper model.

    • Optional Argument: - n_quantization_bits: Choose between 8 or 16 bits (16 if unspecified). e.g. smash_config['n_quantization_bits'] = 8

    • Time: A few minutes.

    • Quality: Same as the original model.

  • ifw:
    • Optimizes whisper models from Huggingface’s transformers library using advanced batching and chunking techniques.

    • Required Arguments: - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"') - device: Target hardware (‘cpu’ or ‘cuda’). e.g. smash_config['device'] = 'cuda'

    • Time: Seconds.

    • Quality: Comparable to the original model.

  • s2t:
    • Enhances c_whisper or whisper models from Huggingface’s transformers library with reduced hallucination issues and advanced techniques.

    • Required Argument: - processor: Processor for your whisper model. e.g. smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')

    • Time: Seconds.

    • Quality: Maintains original model performance.

  • hypertiles:
    • Compiles diffusers models for optimal inference speed on target GPUs.

    • Time: About 15-20 minutes.

    • Quality: Similar to the original model.

  • step_caching:
    • Optimizes diffusers models by intelligently selecting diffusion steps.

    • Time: Seconds.

    • Quality: Very close to the original model.

  • cv_fast:
    • Rapidly compiles computer vision models for NVIDIA GPUs.

    • Time: Approximately 10 seconds.

    • Quality: Unchanged from the original model.

Quantization

Quantization methods reduce the precision of the model’s weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include:

  • llm-int8:
    • Quantizes the model to either 8-bit or 4-bit integers.

    • Required Argument:
      • n_quantization_bits: 4 or 8 bits. e.g. smash_config['n_quantization_bits'] = 8

    • Time: A few minutes.

    • Quality: Lower than the original model with 4 bits worse than 8 bits.

  • gptq:
    • Quantizes the model to 8-bit ir 4-bit or 3-bit or 2-bit integers.

    • Required Argument:
      • n_quantization_bits: 2, 3, 4, or 8 bits. e.g. smash_config['n_quantization_bits'] = 4

    • Time: A few minutes to an hour depending on the size model.

    • Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits.

Pruning

Coming Soon!

Factorization

Coming Soon!