Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrote simple streaming dataloader #10

Merged
merged 14 commits into from
Oct 17, 2024
Merged

Wrote simple streaming dataloader #10

merged 14 commits into from
Oct 17, 2024

Conversation

lennart-finke
Copy link
Collaborator

Description

Added a streaming dataloader to avoid storing large datasets.

Related Issue

Progress towards #8

@danbraunai-apollo
Copy link
Collaborator

Does this still work with distributed data parallel? I'm a bit concerned that this setup is more complex than needed. In general I think we'll need to support:

  1. The dataset and tokenizer will be hosted on huggingface.
  2. The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

I've added the above to #8 (sorry I should have done this earlier).

@lennart-finke
Copy link
Collaborator Author

When writing it I was thinking about distributed data parallel, and it could work, but I have not been able to test that.

Currently the code assumes that the the tokenizer comes from tiktoken and that the streamed dataset is not tokenized.

Am open to closing the PR if there is an easier solution; that decision I would leave to you.

@lennart-finke
Copy link
Collaborator Author

lennart-finke commented Aug 15, 2024

After looking into this a bit more, there is also a builtin in Huggingface for this, the function split_dataset_by_node . If that makes sense, next I'll adapt the file you mentioned to use this.

@danbraunai danbraunai changed the base branch from typing to main October 15, 2024 05:52
Copy link
Owner

@danbraunai danbraunai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.

At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.

simple_stories_train/tokenizer.py Outdated Show resolved Hide resolved
simple_stories_train/tokenizer/stories-3072.json Outdated Show resolved Hide resolved
simple_stories_train/train_llama.py Outdated Show resolved Hide resolved
simple_stories_train/train_llama.py Outdated Show resolved Hide resolved
@lennart-finke
Copy link
Collaborator Author

Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.

Good idea, I added a minimal test for the dataloader, in particular that it gives back the right shape and that different ddp ranks give different data.

At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.

Right, the tokenizer is still a bit of a mystery to me, and the current behaviour is likely not what we want.

For instance, encoding and decoding a story

In a magical forest, there was a kind bear who loved to help. One day, he found a sparkling gem among the leaves. Curiously, he touched it and turned into a fluffy bunny! The bunny hopped with joy but soon noticed a sad squirrel up in a tree. The bunny hopped closer and asked, "Why are you sad?" The squirrel replied, "I can't find my acorns. I need them for winter!" The bunny wanted to help, so he put on his bunny ears and searched everywhere in the forest. "Don't worry, I'll find your acorns!" he promised.

gives

a magical forest , there was a kind bear who loved to help . day , he found a sparkling gem among the leaves . , he touched it and turned into a fluffy bunny ! bunny hopped with joy but soon noticed a sad squirrel up in a tree . bunny hopped closer and asked , " are you sad ? " squirrel replied , " can ' t find my acorn ##s . need them for winter ! " bunny wanted to help , so he put on his bunny ears and searched everywhere in the forest . " ' t worry , ' l ##l find your acorn ##s ! " he promised .

Now that you point it out like that, I should at least understand what's happening there before merging...

@lennart-finke
Copy link
Collaborator Author

lennart-finke commented Oct 15, 2024

Now I understand what was going on, the dataloader needs to convert to lowercase for this tokenizer. Now it works as expected!

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

Apart from that, ready for merging I believe.

@danbraunai
Copy link
Owner

danbraunai commented Oct 15, 2024

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

@nix-apollo
Copy link

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

My opinion is that the tokenizer lowercasing and removing whitespace is worth it for most research usecases. But it also adds complexity and confusion for people as evidenced here. Tokenizers not preserving all information when tokenizing and detokenizing could certainly lead to bugs in the user's code, for instance (although at least in this case it's pretty obvious).

Currently on balance I am in favor of lowercase and some form of whitespace normalizing, but that's a weakly held opinion.

@lennart-finke
Copy link
Collaborator Author

I see, thanks! Then I propose to just keep with the current tokenizer setup for the beta training run and maybe adjust for the final run if we encounter issues with it.

@lennart-finke lennart-finke merged commit a932f89 into main Oct 17, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants