SmashConfig User Manual ========================= SmashConfig is an essential tool in Pruna for configuring parameters to optimize your models. This manual explains how to define and use SmashConfig. Defining a SmashConfig ------------------------ Define a SmashConfig using the following code: .. code-block:: python from pruna.algorithms.SmashConfig import SmashConfig smash_config = SmashConfig() After creating a SmashConfig, you can set the parameters for optimization: .. code-block:: python smash_config['compilers'] = ['diffusers2'] Passing a SmashConfig to the Smash Function --------------------------------------------- Pass a SmashConfig to the smash function as follows: .. code-block:: python from pruna.smash import smash mashed_model = smash( model=pipe, api_key='', # Replace with your actual API key smash_config=smash_config, ) SmashConfig Parameters ------------------------ Optimization Methods ^^^^^^^^^^^^^^^^^^^^ There are four types of optimization methods: 1. Compilation - Use ``smash_config['compilers']`` 2. Quantization - Use ``smash_config['quantizers']`` 3. Pruning - Use ``smash_config['pruners']`` 4. Factorization - Use ``smash_config['factorizers']`` Compilation Methods ^^^^^^^^^^^^^^^^^^^ Compilation methods optimize the model for specific hardware. Supported methods include: - **all**: - Optimize computer vision models for any hardware. - Required Argument: - `device`: 'cpu' or 'cuda'. e.g. ``smash_config['device'] = 'cuda'`` - Time: Approximately 15-20 minutes. - Quality: Similar to the original model. - **diffusers**: - Optimize `diffusers` models for NVIDIA GPUs. - Required Argument: - None. - Time: Approximately 15-20 minutes. - Quality: Same as the original model. - **diffusers2**: - Optimize `diffusers2` models for NVIDIA GPUs. - Optional Argument: - `save_dir`: Working directory during compilation (temporary directory created if not specified). e.g. ``smash_config['save_dir'] = '/tmp/optimized_model.pkl'`` - Time: About 10 seconds. - Quality: Same as the original model. - **c_translation**: - Transform Huggingface `transformers` translation models to C++ code. - Required Argument: - `tokenizer`: Associated tokenizer. e.g. ``smash_config['tokenizer'] = AutoTokenizer.from_pretrained('facebook/opt-125m')`` - Optional Argument: - `n_quantization_bits`: 8 or 16 bits (default 16). e.g. ``smash_config['n_quantization_bits'] = 8`` - Time: A few minutes. - Quality: Same as the original model. - **c_generation**: - Compiles generation models from Huggingface's `transformers` library into C++ code. - Required Argument: - `tokenizer`: The tokenizer associated with your generation model. - Optional Argument: - `n_quantization_bits`: Specify 8 or 16 bits (16 by default). - Time: A few minutes. - Quality: Equivalent to the original model. - **c_whisper**: - Converts whisper models from Huggingface's `transformers` library to C++ code. - Required Argument: - `processor`: The processor for your whisper model. - Optional Argument: - `n_quantization_bits`: Choose between 8 or 16 bits (16 if unspecified). e.g. ``smash_config['n_quantization_bits'] = 8`` - Time: A few minutes. - Quality: Same as the original model. - **ifw**: - Optimizes whisper models from Huggingface's `transformers` library using advanced batching and chunking techniques. - Required Arguments: - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')`` - `device`: Target hardware ('cpu' or 'cuda'). e.g. ``smash_config['device'] = 'cuda'`` - Time: Seconds. - Quality: Comparable to the original model. - **s2t**: - Enhances c_whisper or whisper models from Huggingface's `transformers` library with reduced hallucination issues and advanced techniques. - Required Argument: - `processor`: Processor for your whisper model. e.g. ``smash_config['processor'] = AutoProcessor.from_pretrained('"openai/whisper-large-v3"')`` - Time: Seconds. - Quality: Maintains original model performance. - **hypertiles**: - Compiles `diffusers` models for optimal inference speed on target GPUs. - Time: About 15-20 minutes. - Quality: Similar to the original model. - **step_caching**: - Optimizes `diffusers` models by intelligently selecting diffusion steps. - Time: Seconds. - Quality: Very close to the original model. - **cv_fast**: - Rapidly compiles computer vision models for NVIDIA GPUs. - Time: Approximately 10 seconds. - Quality: Unchanged from the original model. Quantization ^^^^^^^^^^^^ Quantization methods reduce the precision of the model's weights and activations making them much smaller in terms of memory required at the cost of some quality loss. Supported methods include: - **llm-int**: - Quantizes the model to either 8-bit or 4-bit integers. - Required Argument: - `n_quantization_bits`: 4 or 8 bits. e.g. ``smash_config['n_quantization_bits'] = 8`` - Time: A few minutes. - Quality: Lower than the original model with 4 bits worse than 8 bits. - **gptq**: - Quantizes the model to 8-bit ir 4-bit or 3-bit or 2-bit integers. - Required Argument: - `n_quantization_bits`: 2, 3, 4, or 8 bits. e.g. ``smash_config['n_quantization_bits'] = 4`` - Time: A few minutes to an hour depending on the size model. - Quality: Lower than the original model with 2 bits worse than 3 bits worse than 4 bits worse than 8 bits. Pruning ^^^^^^^ Coming Soon! Factorization ^^^^^^^^^^^^^ Coming Soon!