-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized metal. #1594
Quantized metal. #1594
Conversation
- Add a device param, wherever needed. - Create new QMetal storage thing that implements QuantizedType. - Update everywhere needed. Fix Python. Fixing examples. Fix: fmt + clippy + stub. Moving everything around. Only missing the actual implems. Fixing everything + adding dequantized kernels. More work. Fixing matmul. Fmt + Clippy Some clippy fixes. Working state. Q2K Metal -> Bugged (also present in GGML). Q4K CPU -> Bugged (present previously, new test catch it). Q5K CPU -> Bugged (present previously). Q8_1 Both -> Never really implemented it seems Q8K metal -> Never implemented in metal Fixing Q2K bug (present in ggml).
All the ggmldtype bits seems like an orthogonal refactoring that probably is orthogonal to metal? Could this be split in a separate PR? Also all the fences bits seem orthogonal too and could be extracted. |
Also |
No it's not. It's core to it. The reason is that a lot of the previous code was using the GgmlType (the block type, not the dtype) as a generic. This doesn't work for the metal bit since the buffer that actually store the data are untyped (unlike Vec), therefore we need to change that around. (I know we could use PhantomData, but seems very anti-pattern here, and overall the code seems much simpler like this). For |
Merged #1523 directly. |
Working quantized state for candle.
High level overview:
I think we should keep the surface for on metal quantize/dequantize so we can easily implement them later. They are part of the ggml API imho.
test_device!
in order to get similar testing behavior as regularTensor
.quantized.metal
is a direct copy of ggml'sggml-metal.metal
. This choice was made so further dev could be made faster and bugs mentionned after can be imported more easily. All the glue logic is incandle_metal_kernels
.candle_metal_kernels
. Ggml uses different kernels based on size of matmul and hardware capacity. This wasn't implemented here, but could with the current API.Worthy bug already discovered (not fixed in this PR since they do not belong here):