-
Notifications
You must be signed in to change notification settings - Fork 103
Vision Transformer cifar10 bug #134
Comments
I got the same problem. And if I change the config file to vit_pipeline.py, the error will be : TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not list |
hpcaitech/ColossalAI#1100 |
Thanks, Liu. I pulled the latest codes of ColossalAI and ColossalAi-Examples, then I got another error about Traceback (most recent call last):
File "train_with_cifar10.py", line 13, in <module>
from titans.model.vit.vit import _create_vit_model
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/__init__.py", line 3, in <module>
from . import model
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/__init__.py", line 2, in <module>
from . import gpt
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/__init__.py", line 1, in <module>
from .gpt import *
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/gpt.py", line 6, in <module>
from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline' Even if I solved this problem, I got another problem from Traceback (most recent call last):
File "train_with_cifar10.py", line 119, in <module>
main()
File "train_with_cifar10.py", line 54, in main
model = _create_vit_model(**model_kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/vit/vit.py", line 103, in _create_vit_model
model = VisionTransformer(**model_kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/model/utils.py", line 52, in wrapper
f(module, *args, **kwargs)
File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/decorator/no_support.py", line 57, in new_init
origin_init(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'hidden_size' |
I think your problem will be resolved by pulling the lastest codes of Titans as well. Sorry about the unstable APIs, we will improve related issues in future release. |
Thanks, Liu. The problem was solved by reinstalling I used 4 A6000 GPUs with |
Hi @edwardhorp Thank you for your feedback, we have located the reason and are working on it. We will let you know once it is fixed! |
The reason for training process stuck is that different pipeline stage got different overflow status, if the overflow rank do not join the clip grad norm, the all reduce process will be stuck. This bug has been fixed in PR(hpcaitech/ColossalAI#1175). |
🐛 Describe the bug
When I run a vit experiment by the following command
I got
Environment
I install ColossalAI via
Other environment information is collected via this
The text was updated successfully, but these errors were encountered: