Add Support for mixed quantized BitNet Architecture Inference #2683

JoseCarlosGarcia95 · 2024-12-27T09:29:57Z

Introduction

Hello again! My name is José Carlos, and I am fully focused on advancing the capabilities of BitNet within the Candle project. BitNet remains a personal passion of mine due to its unique ability to balance performance and efficiency in language models.

This PR builds on my previous work by introducing advanced quantization support tailored specifically for BitNet models. My primary goal is to make BitNet as efficient and accessible as possible for research and real-world applications.

Model supported

Falcon3-1.58
Llama3-8B-1.58

Changes Made

Added Support for New Quantization Method q2_b0:
- Implemented a double quantization strategy specifically for BitNet models:
  - BitLinear Layers: Quantized separately using the q2_b0 method.
  - Non-BitLinear Layers: Quantized independently to maximize overall model efficiency.
- The q2_b0 method works by splitting the weight matrix into two smaller matrices containing only binary values (0 and 1), which significantly optimizes storage and computation.
Extended Quantization CLI:
- Enhanced the CLI to support BitNet-specific quantization workflows:
cargo run quantize ~/Downloads/Falcon3-1B-Instruct-1.58bit/model*.safetensors ~/Downloads/Falcon3-1B-Instruct-1.58bit/config.json --out-file ggml-model.gguf --quantization q4_0 -b --bitnet-quantization q2b0
Support for Quantifying Models Directly in Candle:
- Models can now be quantized directly within Candle. This new capability adds metadata during the quantization process, which was not possible before. This makes the workflow more seamless and eliminates the need for external tools to manage metadata.

Testing

To test the quantized BitNet model, you can use the following command:

cargo run --example quantized-bitnet --release --features metal

Feedback and Collaboration

This PR is currently a draft, as I continue to develop and refine GPU support for BitNet. If you are interested in contributing or have feedback on the current implementation, I would love to hear from you.

Thank you for your time and support as I continue to focus solely on advancing BitNet in Candle! 😊

Signed-off-by: José Carlos García <hola@josecarlos.me>

JoseCarlosGarcia95 · 2025-01-01T14:39:53Z

Q2B1 Quantification Results on MacBook M2 Pro (16GB RAM)

Model	Tensors Loaded	Size (GB)	Loading Time (s)	Prompt Tokens Processed	Processing Speed (Tokens/s)	Tokens Generated	Generation Speed (Tokens/s)
1B	291	0.65	0.02	5	66.54	82	70.97
3B	355	1.16	0.02	5	42.94	99	44.22
7B	451	2.22	0.01	5	27.66	27	25.76
8B	515	2.47	0.01	6	18.69	99	24.89
10B	643	2.93	0.02	5	21.43	99	18.37

JoseCarlosGarcia95 added 5 commits December 22, 2024 07:59

wip

36e1dcc

Signed-off-by: José Carlos García <hola@josecarlos.me>

initial: q support

23373d1

Initial quantized support

e7e23e3

wip

81fe483

Pre-eliminar qbitnet implementation

677d03a

JoseCarlosGarcia95 mentioned this pull request Dec 27, 2024

Add Support for BitNet Architecture Inference #2664

Open

JoseCarlosGarcia95 added 3 commits December 30, 2024 10:39

fix an issue while quantizing llama models

1ada7fa

initial metal support

4fee75a

Add initial commit

4c669dc

JoseCarlosGarcia95 marked this pull request as ready for review December 31, 2024 09:37

JoseCarlosGarcia95 added 3 commits December 31, 2024 10:40

fix unit tests

49d44b5

Q2B1: Add new quant with optimized performance

eefc336

Q2B1 is the default quant on example

9da753a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for mixed quantized BitNet Architecture Inference #2683

Add Support for mixed quantized BitNet Architecture Inference #2683

JoseCarlosGarcia95 commented Dec 27, 2024 •

edited

Loading

JoseCarlosGarcia95 commented Jan 1, 2025

Add Support for mixed quantized BitNet Architecture Inference #2683

Are you sure you want to change the base?

Add Support for mixed quantized BitNet Architecture Inference #2683

Conversation

JoseCarlosGarcia95 commented Dec 27, 2024 • edited Loading

Introduction

Model supported

Changes Made

Testing

Feedback and Collaboration

JoseCarlosGarcia95 commented Jan 1, 2025

Q2B1 Quantification Results on MacBook M2 Pro (16GB RAM)

JoseCarlosGarcia95 commented Dec 27, 2024 •

edited

Loading