Wrote simple streaming dataloader #10

lennart-finke · 2024-08-06T10:15:00Z

Description

Added a streaming dataloader to avoid storing large datasets.

Related Issue

Progress towards #8

danbraunai-apollo · 2024-08-07T09:35:40Z

Does this still work with distributed data parallel? I'm a bit concerned that this setup is more complex than needed. In general I think we'll need to support:

The dataset and tokenizer will be hosted on huggingface.
The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

I've added the above to #8 (sorry I should have done this earlier).

lennart-finke · 2024-08-07T10:16:47Z

When writing it I was thinking about distributed data parallel, and it could work, but I have not been able to test that.

Currently the code assumes that the the tokenizer comes from tiktoken and that the streamed dataset is not tokenized.

Am open to closing the PR if there is an easier solution; that decision I would leave to you.

lennart-finke · 2024-08-15T16:23:44Z

After looking into this a bit more, there is also a builtin in Huggingface for this, the function split_dataset_by_node . If that makes sense, next I'll adapt the file you mentioned to use this.

danbraunai

Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.

At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.

simple_stories_train/tokenizer.py

simple_stories_train/tokenizer/stories-3072.json

simple_stories_train/train_llama.py

lennart-finke · 2024-10-15T15:38:14Z

Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.

Good idea, I added a minimal test for the dataloader, in particular that it gives back the right shape and that different ddp ranks give different data.

At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.

Right, the tokenizer is still a bit of a mystery to me, and the current behaviour is likely not what we want.

For instance, encoding and decoding a story

In a magical forest, there was a kind bear who loved to help. One day, he found a sparkling gem among the leaves. Curiously, he touched it and turned into a fluffy bunny! The bunny hopped with joy but soon noticed a sad squirrel up in a tree. The bunny hopped closer and asked, "Why are you sad?" The squirrel replied, "I can't find my acorns. I need them for winter!" The bunny wanted to help, so he put on his bunny ears and searched everywhere in the forest. "Don't worry, I'll find your acorns!" he promised.

gives

a magical forest , there was a kind bear who loved to help . day , he found a sparkling gem among the leaves . , he touched it and turned into a fluffy bunny ! bunny hopped with joy but soon noticed a sad squirrel up in a tree . bunny hopped closer and asked , " are you sad ? " squirrel replied , " can ' t find my acorn ##s . need them for winter ! " bunny wanted to help , so he put on his bunny ears and searched everywhere in the forest . " ' t worry , ' l ##l find your acorn ##s ! " he promised .

Now that you point it out like that, I should at least understand what's happening there before merging...

lennart-finke · 2024-10-15T16:16:37Z

Now I understand what was going on, the dataloader needs to convert to lowercase for this tokenizer. Now it works as expected!

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

Apart from that, ready for merging I believe.

danbraunai · 2024-10-15T18:32:31Z

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

nix-apollo · 2024-10-16T13:48:22Z

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

My opinion is that the tokenizer lowercasing and removing whitespace is worth it for most research usecases. But it also adds complexity and confusion for people as evidenced here. Tokenizers not preserving all information when tokenizing and detokenizing could certainly lead to bugs in the user's code, for instance (although at least in this case it's pretty obvious).

Currently on balance I am in favor of lowercase and some form of whitespace normalizing, but that's a weakly held opinion.

lennart-finke · 2024-10-17T14:23:18Z

I see, thanks! Then I propose to just keep with the current tokenizer setup for the beta training run and maybe adjust for the final run if we encounter issues with it.

lennart-finke and others added 2 commits August 6, 2024 11:12

Wrote simple streaming dataloader

7657b75

Fixed dependencies

8dd4879

lennart-finke and others added 8 commits August 15, 2024 17:38

Merge branch 'typing' into dataloader

bed3c23

Added Nix's tokenizer into a dataloader suggested by Dan

71e1c96

Verified that adding tokens is operational

9dc1ddb

Integrated data loader with training

b6e156c

Patched streaming data loading

085c231

Adding BOS, EOS to tokenizer

4cdab85

Added ddp support

0587c58

Changed Readme

eeaffc9

lennart-finke requested a review from danbraunai October 14, 2024 09:35

danbraunai changed the base branch from typing to main October 15, 2024 05:52

danbraunai approved these changes Oct 15, 2024

View reviewed changes

Addressed feedback, added tests for ddp dataloader

3657133

PC added 2 commits October 15, 2024 17:41

Added manual testing for tokenizer+dataloader pipeline

df285aa

Corrected tokenizing behaviour with lowercase conversion

4dd3d70

Merge branch 'main' into dataloader

2cf917b

lennart-finke merged commit a932f89 into main Oct 17, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrote simple streaming dataloader #10

Wrote simple streaming dataloader #10

lennart-finke commented Aug 6, 2024

danbraunai-apollo commented Aug 7, 2024

lennart-finke commented Aug 7, 2024

lennart-finke commented Aug 15, 2024 •

edited

Loading

danbraunai left a comment

lennart-finke commented Oct 15, 2024

lennart-finke commented Oct 15, 2024 •

edited

Loading

danbraunai commented Oct 15, 2024 •

edited

Loading

nix-apollo commented Oct 16, 2024

lennart-finke commented Oct 17, 2024

Wrote simple streaming dataloader #10

Wrote simple streaming dataloader #10

Conversation

lennart-finke commented Aug 6, 2024

Description

Related Issue

danbraunai-apollo commented Aug 7, 2024

lennart-finke commented Aug 7, 2024

lennart-finke commented Aug 15, 2024 • edited Loading

danbraunai left a comment

Choose a reason for hiding this comment

lennart-finke commented Oct 15, 2024

lennart-finke commented Oct 15, 2024 • edited Loading

danbraunai commented Oct 15, 2024 • edited Loading

nix-apollo commented Oct 16, 2024

lennart-finke commented Oct 17, 2024

lennart-finke commented Aug 15, 2024 •

edited

Loading

lennart-finke commented Oct 15, 2024 •

edited

Loading

danbraunai commented Oct 15, 2024 •

edited

Loading