请问在lora微调时，如果embedding layer的长度比tokenizer的词表长度稍大，设置“resize_vocab=True”是会使用embedding layer中原来未被使用的部分，还是把embedding layer的长度增大？ #4807

CloudyDory · 2024-07-13T13:56:20Z

CloudyDory
Jul 13, 2024

最近在使用lora微调一个预训练好的模型，使用和预训练时一样的词表。但lora的输出中有如下warning:

[WARNING|logging.py:313] 
2024-07-13 21:23:37,098 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/13/2024 21:23:37 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/13/2024 21:23:37 - WARNING - llamafactory.data.template - New tokens have been added, make sure `resize_vocab` is True.
07/13/2024 21:23:37 - INFO - llamafactory.data.loader - Loading dataset Alpaca-En/data.json...
07/13/2024 21:23:37 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/13/2024 21:23:37 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/13/2024 21:23:37 - WARNING - llamafactory.data.template - New tokens have been added, make sure `resize_vocab` is True.

我在预训练时使用的词表包含50280个词，但网络的embedding_size是51200，所以embedding layer中还有一小部分剩余空间。如果设置“resize_vocab=True”，是会使用embedding layer中原来未被使用的部分，还是把embedding layer的长度增大？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问在lora微调时，如果embedding layer的长度比tokenizer的词表长度稍大，设置“resize_vocab=True”是会使用embedding layer中原来未被使用的部分，还是把embedding layer的长度增大？ #4807

{{title}}

Replies: 0 comments

Select a reply

请问在lora微调时，如果embedding layer的长度比tokenizer的词表长度稍大，设置“resize_vocab=True”是会使用embedding layer中原来未被使用的部分，还是把embedding layer的长度增大？ #4807

CloudyDory Jul 13, 2024

Replies: 0 comments

CloudyDory
Jul 13, 2024