Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for mixed quantized BitNet Architecture Inference #2683

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JoseCarlosGarcia95
Copy link

@JoseCarlosGarcia95 JoseCarlosGarcia95 commented Dec 27, 2024

Introduction

Hello again! My name is José Carlos, and I am fully focused on advancing the capabilities of BitNet within the Candle project. BitNet remains a personal passion of mine due to its unique ability to balance performance and efficiency in language models.

This PR builds on my previous work by introducing advanced quantization support tailored specifically for BitNet models. My primary goal is to make BitNet as efficient and accessible as possible for research and real-world applications.


Model supported

  • Falcon3-1.58
  • Llama3-8B-1.58

Changes Made

  1. Added Support for New Quantization Method q2_b0:

    • Implemented a double quantization strategy specifically for BitNet models:
      • BitLinear Layers: Quantized separately using the q2_b0 method.
      • Non-BitLinear Layers: Quantized independently to maximize overall model efficiency.
    • The q2_b0 method works by splitting the weight matrix into two smaller matrices containing only binary values (0 and 1), which significantly optimizes storage and computation.
  2. Extended Quantization CLI:

    • Enhanced the CLI to support BitNet-specific quantization workflows:

    cargo run quantize ~/Downloads/Falcon3-1B-Instruct-1.58bit/model*.safetensors ~/Downloads/Falcon3-1B-Instruct-1.58bit/config.json --out-file ggml-model.gguf --quantization q4_0 -b --bitnet-quantization q2b0

  3. Support for Quantifying Models Directly in Candle:

    • Models can now be quantized directly within Candle. This new capability adds metadata during the quantization process, which was not possible before. This makes the workflow more seamless and eliminates the need for external tools to manage metadata.

Testing

To test the quantized BitNet model, you can use the following command:

cargo run --example quantized-bitnet --release --features metal


Feedback and Collaboration

This PR is currently a draft, as I continue to develop and refine GPU support for BitNet. If you are interested in contributing or have feedback on the current implementation, I would love to hear from you.

Thank you for your time and support as I continue to focus solely on advancing BitNet in Candle! 😊

@JoseCarlosGarcia95 JoseCarlosGarcia95 marked this pull request as ready for review December 31, 2024 09:37
@JoseCarlosGarcia95
Copy link
Author

Q2B1 Quantification Results on MacBook M2 Pro (16GB RAM)

Model Tensors Loaded Size (GB) Loading Time (s) Prompt Tokens Processed Processing Speed (Tokens/s) Tokens Generated Generation Speed (Tokens/s)
1B 291 0.65 0.02 5 66.54 82 70.97
3B 355 1.16 0.02 5 42.94 99 44.22
7B 451 2.22 0.01 5 27.66 27 25.76
8B 515 2.47 0.01 6 18.69 99 24.89
10B 643 2.93 0.02 5 21.43 99 18.37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant