[BUG]: Size Mismatch Issue When Loading Model Checkpoints Trained with Tensor Parallel if vocab_size % tp_size != 0
#6167
Labels
bug
Something isn't working
Is there an existing issue for this bug?
🐛 Describe the bug
Describe the bug
A size mismatch error occurs when loading model checkpoints trained with tensor parallel enabled, if the
vocab_size
is not divisible bytp_size
.To Reproduce
Let's modify the official Llama benchmark to reproduce with minimize work.
benchmark.py
(modify llama model vocab_size):benchmark.py
(add to the end ofmain
function):entroypoint:
the script will fail with RuntimeError after execuating
model = AutoModelForCausalLM.from_pretrained()
:Others
No error reported if we set
vocab_size=65536
.No error reported if we set
--tp 1 --pp 2
.Similar error reported if we set
--tp 2 --pp 2
.Environment
colossalai: latest(8b0ed61)
cluster: single node with H20 * 8.
feel free to ask for furher environment information (but i think it probably not crucial to this issue ^_^)
The text was updated successfully, but these errors were encountered: