Highlights
H100 Support for LLM.int8()
PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!
As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32
or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.
Performance Improvements
This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.
The improvements were measured using the 🤗optimum-benchmark tool.
For more benchmark results, see benchmarking/README.md.
LLM.int8()
- Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
- H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
- Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
- Batch size = 8: 66.62 tokens/s => 110.95 tokens/s
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
- Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
- Batch size = 8: 24.28 tokens/s => 44.15 tokens/s
NF4/FP4
- Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
- H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.
Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:
- Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
- Batch size = 8: 110.70 tokens/s => 111.29 tokens/s
Example throughput improvement for Qwen 2.5 3B Instruct on T4:
- Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
- Batch size = 8: 69.8 tokens/s => 76.80 tokens/s
Changes
Packaging Changes
The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.
CUDA Toolkit Versions
- Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
- The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.
Breaking
🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0
.
New
- A new public API for int8 dequantization has been added:
bitsandbytes.functional.int8_vectorwise_dequant()
. This functionality is being integrated into 🤗PEFT and 🤗transformers. - We've continued to make documentation updates. The
bitsandbytes.functional
module now has an API documentation page.
Deprecations
A number of public API functions have been marked for deprecation and will emit FutureWarning
when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.
k-bit quantization
The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False
is not recommended and support will be removed in a future release.
LLM.int8() deprecations:
As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.
The following relevant functions from bitsandbytes.functional
are now deprecated :
- dequant_min_max
- dequantize_no_absmax
- extract_outliers
- get_special_format_str
- get_transform_buffer
- get_transform_func
- mm_dequant (replacement: int8_mm_dequant)
- igemmlt (replacement: int8_linear_matmul)
- nvidia_transform
- transform
- quantize_no_absmax
- vectorwise_dequant
- vectorwise_quant (~replacement: int8_vectorwise_quant)
- vectorwise_mm_dequant (~replacement: int8_mm_dequant)
General Deprecations
Additionally the following functions from bitsandbytes.functional
are deprecated:
- _mul
- arange
- post_call
- pre_call
What's Changed
- refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
- README: Replace special Unicode text symbols with regular characters by @akx in #1385
- Update CI tools & fix typos by @akx in #1386
- Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
- [Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
- LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401
New Contributors
Full Changelog: 0.44.1...0.45.0