Release 0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference · bitsandbytes-foundation/bitsandbytes

Highlights

H100 Support for LLM.int8()

PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

Performance Improvements

This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.

The improvements were measured using the 🤗optimum-benchmark tool.

For more benchmark results, see benchmarking/README.md.

LLM.int8()

Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
Batch size = 8: 66.62 tokens/s => 110.95 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
Batch size = 8: 24.28 tokens/s => 44.15 tokens/s

NF4/FP4

Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
Batch size = 8: 110.70 tokens/s => 111.29 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
Batch size = 8: 69.8 tokens/s => 76.80 tokens/s

Changes

Packaging Changes

The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.

CUDA Toolkit Versions

Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.

Breaking

🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0.

New

A new public API for int8 dequantization has been added: bitsandbytes.functional.int8_vectorwise_dequant(). This functionality is being integrated into 🤗PEFT and 🤗transformers.
We've continued to make documentation updates. The bitsandbytes.functional module now has an API documentation page.

Deprecations

A number of public API functions have been marked for deprecation and will emit FutureWarning when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.

k-bit quantization

The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False is not recommended and support will be removed in a future release.

LLM.int8() deprecations:

As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

The following relevant functions from bitsandbytes.functional are now deprecated :

dequant_min_max
dequantize_no_absmax
extract_outliers
get_special_format_str
get_transform_buffer
get_transform_func
mm_dequant (replacement: int8_mm_dequant)
igemmlt (replacement: int8_linear_matmul)
nvidia_transform
transform
quantize_no_absmax
vectorwise_dequant
vectorwise_quant (~replacement: int8_vectorwise_quant)
vectorwise_mm_dequant (~replacement: int8_mm_dequant)

General Deprecations

Additionally the following functions from bitsandbytes.functional are deprecated:

_mul
arange
post_call
pre_call

What's Changed

refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
README: Replace special Unicode text symbols with regular characters by @akx in #1385
Update CI tools & fix typos by @akx in #1386
Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
[Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401

New Contributors

@oshiteku made their first contribution in #1420

Full Changelog: 0.44.1...0.45.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference