Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

[Compatibility] Runining OPT using PyTorch 1.12 and Gemini placement_policy = 'cuda' failed #166

Open
feifeibear opened this issue Jul 28, 2022 · 3 comments
Assignees

Comments

@feifeibear
Copy link
Contributor

feifeibear commented Jul 28, 2022

🐛 Describe the bug

Just run the examples/language/opt/run_clm.py will reproduce the error.
The program crashed with no error information.
After I replace placement_policy as 'cuda'. It is OK.

    placement_policy = 'cuda'
    chunk_manager = ChunkManager(chunk_size, process_group=pg,
                                 enable_distributed_storage=True,
                                 init_device=GeminiManager.get_default_device(placement_policy))
    gemini_manager = GeminiManager(placement_policy, chunk_manager)
    model = ZeroDDP(model, gemini_manager)
    logger.info(f'{model.__class__.__name__} has been created', ranks=[0])

Environment

colossalai 0.1.8+torch1.12cu11.3

@feifeibear
Copy link
Contributor Author

I also tried placement_policy = 'cpu'
It also crashed. The error stack is listed as follows

0%| | 0/444 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Traceback (most recent call last):
File "run_clm.py", line 575, in
main()
File "run_clm.py", line 528, in main
optimizer.backward(loss)
File "/home/lcfjr/codes/ColossalAI/colossalai/zero/zero_optimizer.py", line 151, in backward
self.module.backward(loss)
File "/home/lcfjr/codes/ColossalAI/colossalai/nn/parallel/data_parallel.py", line 246, in backward
loss.backward()
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 388, in backward
return handle_torch_function(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/home/lcfjr/codes/ColossalAI/colossalai/tensor/colo_tensor.py", line 171, in torch_function
ret = func(*args, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 130, in backward
outputs = ctx.run_function(*detached_inputs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 674, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 315, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
0%| | 0/444 [00:06<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2895986) of binary: /home/lcfjr/miniconda3/envs/dev/bin/python3
Traceback (most recent call last):
File "/home/lcfjr/miniconda3/envs/dev/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@wgimperial
Copy link

Encountered the same problem, is there a solution?

@virgulvirgul
Copy link

After I replace placement_policy as 'cuda'. It is OK.

Got same error, fixed after these changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants