You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.
I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence'))
the training loss in the beginning was and after 2330 steps the training loss is 13.044
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence'))
after 2330 steps the training loss is 13.044
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence'))
after 2330 steps the training loss is 6.5549
Environment
after running colossalai check -i
I got
my device is 8 rtx3090
training batch is 128 across three sequence parallel settings.
my training config is
Thanks!
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
Hi
I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence'))
the training loss in the beginning was and after 2330 steps the training loss is 13.044
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence'))
after 2330 steps the training loss is 13.044
when my setting is:
parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence'))
after 2330 steps the training loss is 6.5549
Environment
after running colossalai check -i
I got
my device is 8 rtx3090
training batch is 128 across three sequence parallel settings.
my training config is
Thanks!
The text was updated successfully, but these errors were encountered: