You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
0/1000 [00:00<?, ?it/s]/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
[WARNING|logging.py:168] 2024-12-20 18:40:09,732 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False.
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]: scaled_loss.backward(retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank1]: return user_fn(self, *args)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank2]: launch()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]: run_exp()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank2]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank2]: return inner_training_loop(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank2]: self.accelerator.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank2]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank2]: self.engine.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank2]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank2]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank2]: scaled_loss.backward(retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank2]: torch.autograd.backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank2]: _engine_run_backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank2]: return user_fn(self, *args)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank2]: raise RuntimeError(
[rank2]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank3]: return inner_training_loop(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank3]: self.accelerator.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank3]: return user_fn(self, *args)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank3]: raise RuntimeError(
[rank3]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank0]: self.engine.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
0%| | 0/1000 [00:11<?, ?it/s]
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883573 closing signal SIGTERM
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883574 closing signal SIGTERM
W1220 18:40:21.148000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883576 closing signal SIGTERM
E1220 18:40:21.376000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 2883575) of binary: /usr/local/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/llama_factory/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
0/1000 [00:00<?, ?it/s]/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
[WARNING|logging.py:168] 2024-12-20 18:40:09,732 >>
use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
.[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank1]: return inner_training_loop(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]: scaled_loss.backward(retain_graph=retain_graph)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank1]: return user_fn(self, *args)
[rank1]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank1]: raise RuntimeError(
[rank1]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank2]: launch()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]: run_exp()
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank2]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank2]: return inner_training_loop(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank2]: self.accelerator.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank2]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank2]: self.engine.backward(loss, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank2]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank2]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank2]: scaled_loss.backward(retain_graph=retain_graph)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank2]: torch.autograd.backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank2]: _engine_run_backward(
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank2]: return user_fn(self, *args)
[rank2]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank2]: raise RuntimeError(
[rank2]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank3]: return inner_training_loop(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank3]: self.accelerator.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank3]: return user_fn(self, *args)
[rank3]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank3]: raise RuntimeError(
[rank3]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/home/lzx/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 163, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank0]: self.engine.backward(loss, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank0]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
[rank0]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]: scaled_loss.backward(retain_graph=retain_graph)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/autograd/function.py", line 307, in apply
[rank0]: return user_fn(self, *args)
[rank0]: File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 317, in backward
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: none of output has requires_grad=True, this checkpoint() is not necessary
0%| | 0/1000 [00:11<?, ?it/s]
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883573 closing signal SIGTERM
W1220 18:40:21.147000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883574 closing signal SIGTERM
W1220 18:40:21.148000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2883576 closing signal SIGTERM
E1220 18:40:21.376000 2883502 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 2883575) of binary: /usr/local/miniconda3/envs/llama_factory/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/llama_factory/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/lzx/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-12-20_18:40:21
host : gp-SYS-4029GP-TRT2-EC028B
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2883575)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected behavior
我通过这里的代码在workflow里面freeze的大部分参数
我的代码在一块3090上训练可以,在6卡4090d的服务器上使用lora微调可以,但是冻结参数直接sft会有这个报错。
我已经将batchsize=1,cutoff_len=256,显存还是爆炸了,因此您发的那个FAQ是没法解决我这个问题。
我想知道这是什么原因,如何修改。是不是我冻结的参数在使用deepspeed的时候没有均匀分配到每张卡上?非常感谢!!!
Others
我的yaml参数如下
model
model_name_or_path: /home/Llama-3.2-11B-Vision-Instruct
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
dataset
dataset: tenk
dataset: identity,alpaca_en_demo
template: mllama
cutoff_len: 256
max_samples: 800
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/llama3-11B-Vision-Instruct/tenk/DCT_4_on_2k_5/vision_32/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
optim: paged_adamw_8bit
eval
val_size: 0.001
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
The text was updated successfully, but these errors were encountered: