-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrote simple streaming dataloader #10
Conversation
Does this still work with distributed data parallel? I'm a bit concerned that this setup is more complex than needed. In general I think we'll need to support:
I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler. I've added the above to #8 (sorry I should have done this earlier). |
When writing it I was thinking about distributed data parallel, and it could work, but I have not been able to test that. Currently the code assumes that the the tokenizer comes from tiktoken and that the streamed dataset is not tokenized. Am open to closing the PR if there is an easier solution; that decision I would leave to you. |
After looking into this a bit more, there is also a builtin in Huggingface for this, the function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.
At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.
Good idea, I added a minimal test for the dataloader, in particular that it gives back the right shape and that different ddp ranks give different data.
Right, the tokenizer is still a bit of a mystery to me, and the current behaviour is likely not what we want. For instance, encoding and decoding a story
gives
Now that you point it out like that, I should at least understand what's happening there before merging... |
Now I understand what was going on, the dataloader needs to convert to lowercase for this tokenizer. Now it works as expected! By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done? Apart from that, ready for merging I believe. |
No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo? |
My opinion is that the tokenizer lowercasing and removing whitespace is worth it for most research usecases. But it also adds complexity and confusion for people as evidenced here. Tokenizers not preserving all information when tokenizing and detokenizing could certainly lead to bugs in the user's code, for instance (although at least in this case it's pretty obvious). Currently on balance I am in favor of lowercase and some form of whitespace normalizing, but that's a weakly held opinion. |
I see, thanks! Then I propose to just keep with the current tokenizer setup for the beta training run and maybe adjust for the final run if we encounter issues with it. |
Description
Added a streaming dataloader to avoid storing large datasets.
Related Issue
Progress towards #8