Skip to content

0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference

Latest
Compare
Choose a tag to compare
@matthewdouglas matthewdouglas released this 05 Dec 16:07
· 5 commits to main since this release

Highlights

H100 Support for LLM.int8()

PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

Performance Improvements

This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.

The improvements were measured using the 🤗optimum-benchmark tool.

For more benchmark results, see benchmarking/README.md.

LLM.int8()

  • Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
  • H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
  • Batch size = 8: 66.62 tokens/s => 110.95 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
  • Batch size = 8: 24.28 tokens/s => 44.15 tokens/s

NF4/FP4

  • Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
  • H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
  • Batch size = 8: 110.70 tokens/s => 111.29 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
  • Batch size = 8: 69.8 tokens/s => 76.80 tokens/s

Changes

Packaging Changes

The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.

CUDA Toolkit Versions

  • Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
  • The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.

Breaking

🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0.

New

Deprecations

A number of public API functions have been marked for deprecation and will emit FutureWarning when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.

k-bit quantization

The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False is not recommended and support will be removed in a future release.

LLM.int8() deprecations:

As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

The following relevant functions from bitsandbytes.functional are now deprecated :

  • dequant_min_max
  • dequantize_no_absmax
  • extract_outliers
  • get_special_format_str
  • get_transform_buffer
  • get_transform_func
  • mm_dequant (replacement: int8_mm_dequant)
  • igemmlt (replacement: int8_linear_matmul)
  • nvidia_transform
  • transform
  • quantize_no_absmax
  • vectorwise_dequant
  • vectorwise_quant (~replacement: int8_vectorwise_quant)
  • vectorwise_mm_dequant (~replacement: int8_mm_dequant)

General Deprecations

Additionally the following functions from bitsandbytes.functional are deprecated:

  • _mul
  • arange
  • post_call
  • pre_call

What's Changed

New Contributors

Full Changelog: 0.44.1...0.45.0