Too large training loss #155

qyc-98 · 2022-07-13T08:26:57Z

🐛 Describe the bug

Hi

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence'))
the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence'))
after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence'))
after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i
I got

my device is 8 rtx3090
training batch is 128 across three sequence parallel settings.

my training config is

Thanks!

binmakeswell · 2022-11-15T05:45:37Z

Hi @qyc-98 Thank you for your feedback. We will try to reproduce your issue.

By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link
https://github.com/hpcaitech/ColossalAI/tree/main/examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too large training loss #155

Too large training loss #155

qyc-98 commented Jul 13, 2022 •

edited

Loading

binmakeswell commented Nov 15, 2022

Too large training loss #155

Too large training loss #155

Comments

qyc-98 commented Jul 13, 2022 • edited Loading

🐛 Describe the bug

Environment

binmakeswell commented Nov 15, 2022

qyc-98 commented Jul 13, 2022 •

edited

Loading