Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

Vision Transformer cifar10 bug #134

Open
gaow0007 opened this issue Jun 8, 2022 · 7 comments · Fixed by #149
Open

Vision Transformer cifar10 bug #134

gaow0007 opened this issue Jun 8, 2022 · 7 comments · Fixed by #149

Comments

@gaow0007
Copy link

gaow0007 commented Jun 8, 2022

🐛 Describe the bug

When I run a vit experiment by the following command

node=76
prefix="srun --nodes=1 --gres=gpu:4 --cpus-per-task=4 --ntasks=1 -w SG-IDC1-10-51-2-$node"
$prefix colossalai run --nproc_per_node 4  train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py --host=10.51.2.$node

I got

tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
tensor shape 128
Traceback (most recent call last):
  File "train_with_cifar10.py", line 122, in <module>
    main()
  File "train_with_cifar10.py", line 116, in main
    main()
  File "train_with_cifar10.py", line 116, in main
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    engine.execute_schedule(data_iter, return_output_label=False)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/engine/schedule/_pipeline_schedule.py", line 303, in forward_backward_step
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor = comm.recv_forward(ft_shape,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 194, in recv_forward
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    input_tensor, _ = _communicate(recv_prev=True,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 119, in _communicate
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(recv_prev_shape, dtype,
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 49, in create_recv_buffer_with_shapes
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
  File "/mnt/lustre/wgao/miniconda3/envs/ColossalAI/lib/python3.8/site-packages/colossalai/communication/p2p.py", line 30, in _get_tensor_shape
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration
    tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
TypeError: reduce() arg 2 must support iteration

Environment

I install ColossalAI via

pip install colossalai==0.1.6+torch1.10cu10.2 -f https://release.colossalai.org

Other environment information is collected via this

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.3.0
Clang version: Could not collect
CMake version: version 3.19.3
Libc version: glibc-2.17

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 470.63.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] colossalai==0.1.6+torch1.10cu10.2
[pip3] numpy==1.22.4
[pip3] torch==1.11.0
[pip3] torchvision==0.12.0
[conda] colossalai                0.1.6+torch1.10cu10.2          pypi_0    pypi
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi
``
@edwardhorp
Copy link

I got the same problem. And if I change the config file to vit_pipeline.py, the error will be :

TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not list

@YuliangLiu0306
Copy link
Contributor

YuliangLiu0306 commented Jun 13, 2022

hpcaitech/ColossalAI#1100
This PR resolved related bugs. You can try again with the lastest main branch code.

@edwardhorp
Copy link

Thanks, Liu. I pulled the latest codes of ColossalAI and ColossalAi-Examples, then I got another error about titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 13, in <module>
    from titans.model.vit.vit import _create_vit_model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/__init__.py", line 3, in <module>
    from . import model
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/__init__.py", line 2, in <module>
    from . import gpt
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/__init__.py", line 1, in <module>
    from .gpt import *
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/gpt/gpt.py", line 6, in <module>
    from colossalai.builder.pipeline import partition_uniform
ModuleNotFoundError: No module named 'colossalai.builder.pipeline'

Even if I solved this problem, I got another problem from titans:

Traceback (most recent call last):
  File "train_with_cifar10.py", line 119, in <module>
    main()
  File "train_with_cifar10.py", line 54, in main
    model = _create_vit_model(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/model/vit/vit.py", line 103, in _create_vit_model
    model = VisionTransformer(**model_kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/model/utils.py", line 52, in wrapper
    f(module, *args, **kwargs)
  File "/root/conda/envs/colossalai/lib/python3.8/site-packages/titans/decorator/no_support.py", line 57, in new_init
    origin_init(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'hidden_size'

@YuliangLiu0306
Copy link
Contributor

I think your problem will be resolved by pulling the lastest codes of Titans as well. Sorry about the unstable APIs, we will improve related issues in future release.

@edwardhorp
Copy link

Thanks, Liu. The problem was solved by reinstalling titans. But the training process will be stuck at the 86/196 step.

I used 4 A6000 GPUs with colossalai run --nproc_per_node 4 train_with_cifar10.py --config configs/vit_1d_tp2_pp2.py

@binmakeswell
Copy link
Member

Hi @edwardhorp Thank you for your feedback, we have located the reason and are working on it. We will let you know once it is fixed!

@YuliangLiu0306
Copy link
Contributor

YuliangLiu0306 commented Jun 28, 2022

The reason for training process stuck is that different pipeline stage got different overflow status, if the overflow rank do not join the clip grad norm, the all reduce process will be stuck. This bug has been fixed in PR(hpcaitech/ColossalAI#1175).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants