Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

Too large training loss #155

Open
qyc-98 opened this issue Jul 13, 2022 · 1 comment
Open

Too large training loss #155

qyc-98 opened this issue Jul 13, 2022 · 1 comment

Comments

@qyc-98
Copy link

qyc-98 commented Jul 13, 2022

🐛 Describe the bug

Hi

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence'))
the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence'))
after 2330 steps the training loss is 13.044

when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence'))
after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i
I got
image

my device is 8 rtx3090
training batch is 128 across three sequence parallel settings.

my training config is

image

Thanks!

@binmakeswell
Copy link
Member

Hi @qyc-98 Thank you for your feedback. We will try to reproduce your issue.

By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link
https://github.com/hpcaitech/ColossalAI/tree/main/examples

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants