-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Librispeech] Add 'all' config #4184
[Librispeech] Add 'all' config #4184
Conversation
Fix #4179 |
"chapter_id": chapter_id, | ||
"file": audio_file, | ||
"text": transcript, | ||
"subset": subset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is not available anymore as the PR was closed or merged. |
This comment was marked as resolved.
This comment was marked as resolved.
You could do: ds = load_dataset("librispeech_asr", "all") # <- note that we have to pass a config
train_ds = ds["train"]
dev_clean_ds = ds["dev-clean"]
dev_other_ds = ds["dev-other"]
test_clean_ds = ds["test-clean"]
test_other_ds = ds["test-other"] |
So, Why is that? The docs say:
datasets/src/datasets/builder.py Line 228 in cd3ce34
Or maybe you could just define |
Oh crap this is outdated documentation. No it doesn't take the first config by default. EDIT: opened a PR to fix this: #4186 |
But defining So should we define Don't most datasets have some default config? |
Yes that would work, and I also find it reasonable to do it :)
Most datasets only have one configuration, so the single configuration is the default one. Then other datasets gave several configurations, and whether they have a default one is decided case-by-case. e.g. |
Thanks a lot for the feedback! Using @albertz - you should now be able to do the following: load_dataset("librispeech_asr") # <- run this once to download, prepare dataset and cache everything
# The following operations will be very fast since all the downloading and processing is already cached
train_1 = load_dataset("librispeech_asr", split="train.clean.100")
print(train_1)
train_2 = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360")
print(train_2)
train_full = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360+train.other.500")
print(train_full)
dev_clean_ds = load_dataset("librispeech_asr", split="validation.clean")
print(dev_clean_ds)
dev_other_ds = load_dataset("librispeech_asr", split="validation.other")
print(dev_other_ds)
test_clean_ds = load_dataset("librispeech_asr", split="test.clean")
print(test_clean_ds)
test_other_ds = load_dataset("librispeech_asr", split="test.other")
print(test_other_ds) |
Think this way we have the best of both worlds. Also @lhoestq, I think we could highlight better in the docs that it's possible to combine different splits. We do this actually quite a lot for speech. For Common Voice many people include "validation" in the training if the data is too small, e.g.: https://github.com/huggingface/transformers/blob/ff06b177917384137af2d9585697d2d76c40cdfc/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L147 Should we maybe add a short section to the loading tutorial here: https://huggingface.co/docs/datasets/v2.1.0/en/loading#hugging-face-hub ? (Happy to do it) |
Is there any advantage or difference in calling Note in our case, we cannot really use the caching mechanism because we have a recipe pipeline used by multiple users (and I think a common cache dir for all users might end up in problems) and we basically would use So with ds = datasets.load_from_disk(...)
train = ds["train"] Or with your latest proposal, it would look like: ds = datasets.load_from_disk(...)
train_ds = datasets.concatenate_datasets(
[ds["train.clean.100"], ds["train.clean.360"], ds["train.other.500"]]) right? |
I see the use case! The only advantage by calling @lhoestq what do you think is the best approach with @albertz, you could also define the load_dataset("librispeech_asr", cache_dir="/easy/to/access/directory") |
@@ -159,6 +271,9 @@ def _generate_examples(self, files): | |||
id_, transcript = line.split(" ", 1) | |||
audio_file = f"{id_}.flac" | |||
speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]] | |||
audio_file = ( | |||
os.path.join(local_extracted_archive, audio_file) if local_extracted_archive else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq is the whole speech data now stored twice in non-streaming mode? Once as the original flac and once as an array of bytes?
Should we change line 266: from audio_data[id_] = f.read()
to just audio_data[id_] = f
and change the line: audio = {"path": transcript["file"], "bytes": audio_data[transcript["id"]]}
at the moment it looks to me like both the bytes and the file is saved in non-streaming mode which would create a huge overhead in memory no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening a new PR for this to change it internally in src/datasets
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be fixed in #4187 :)
@albertz, I took a read through rwth-i6/i6_core#253 . I think the best would be the following:
|
@lhoestq - I think this one is good to go |
Oh, so you say that our current implementation in rwth-i6/i6_core#253 is broken? Because our cache dir is just some temp directory which will be removed afterwards, and we just store what we get out of So, you say we anyway need to share the cache dir among users? But we would want to make sure that after the initial download and preparation of the data, this is set to readonly, because we want to make sure that other people will not modify the data in any way. Right? But then, we don't really need the |
Oh, I wasn't aware that audio files are handled this way. Then we should have the cache directory as an additional job output, so that we keep the audio files.
No, the cache dir can still be a directory in the job output folder. Then the audio paths in the corresponding dataset column correspond to the flac files in that directory. This way the "output" of the job is contained into the job directory and we don't write files to a global cache directory that is independent of the sisyphus graph. If we want to share the audio data between different users, we can just link to a central instance of the job (similar to how we do it with the |
@dthulke - that's a good point actually! So you can do both things:
ds = load_dataset("librispeech_asr")
def read_file(batch):
with open(batch["file"], "r") as f:
batch["bytes"] = f.read()
return batch
ds = ds.map(read_file)
ds.save_to_disk("/path") <- the saved arrow object will now contain everything you need however this is not recommend - it's should be much easier to just save the path to the downloaded audio files.
=> Now what you could do as well would be to simply move all the audio files to the folder you want (the We discussed storing audio files quite a bit, e.g. see: #3663 and had (too many) changes around this topic recently, but we've come to the conclusion that the best is to leave the audio format in the format it was originally ( |
So what I would suggest here is to do the following:
or
|
Also relevant here: #3663 |
I also added some documentation about how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good thanks :)
@patrickvonplaten can you also update the dataset_infos.json file with this new config ?
datasets-cli test ./datasets/librispeech_asr --save_infos --name all
To fix the CI you can also replace audio-speaker-identification
by speaker-identification
in the task tags of the dataset card
@dthulke But this is what I mean. When we share the job output folder, it means we share the cache dir among users. I wonder if We could enforce that by making the @patrickvonplaten @dthulke But in any case, we actually prefer the data content to be inside the dataset (the arrow files). Lots of small files would be very problematic for our cache manager. We have one main copy of the data on NFS, but accessing the NFS directly by all computing nodes is not feasible, so the cache manager will have copies of the files on the nodes. So it means, whenever we access some file, we query the cache manager DB whether the file is already cached somewhere (some other computing node) and if so, it copies it from the other computing node and not from NFS. This works very well when there are not too many files (but the files can be big). So, we want to have only a few but big files. Even for NFS access this is much better. I also commented in #3663. |
Thanks a lot for your input! We've discussed quite a bit with @lhoestq and we think the best approach is the following: a) For canonical datasets like librispeech and common voice I think we want to keep the dataset filenames because of i) no breaking changes and ii) reasons explained in #3663 However it's also trivial to write your own datasetset downloading script of librispeech and just not extract the folder e.g. this line: https://huggingface.co/datasets/common_voice/blob/main/common_voice.py#L671 And then it'll be allowed to save the bytes and the dataset will be self-contained out-of-the-box when using b) Now, one major problem that you guys uncovered is that ds = load_dataset("....") # <- here we have a dependency on the filepathes
ds[0]["audio"]["bytes"] # <- will not work
ds.save_to_disk("/local/path") # <- now we want to have a self-contained dataset in arrow format, so we load the files into bytes and save it in arrow format
# now you can delete everything besides "/local/path"
ds = load_from_disk("/local/path") # <- this will work So either option a) where you define your own librispeech data downloading script (you guys could just sign up here: https://huggingface.co/join) and upload a dataset loading script in private mode so that no one can see it and you would always store the audio as bytes or b) where you first load then save to disk then delete cache would work. Hope that fits in your vision :-) |
@patrickvonplaten sounds like a good approach to me. For b) this could even be configurable with a parameter like |
I don't exactly understand. In all cases, we need to extract it to prepare the dataset, or not? No matter if we want to store the raw bytes inside the dataset or leaving them as local files. Just in the first case, we can safely delete the extracted files after the dataset preparation.
For us, this sounds exactly like what we want. But regarding not introducing breaking changes, wouldn't this maybe also break some setups for users who don't expect this new behavior? |
@albertz I would suggest to move the discussion on implementation details on our side to the following issue: rwth-i6/i6_core/issues/257 |
I like the idea of adding
If it sounds good to you I'll open an issue to discuss this and track the advancements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks !
@@ -159,6 +256,11 @@ def _generate_examples(self, files): | |||
id_, transcript = line.split(" ", 1) | |||
audio_file = f"{id_}.flac" | |||
speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]] | |||
audio_file = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to have extracted files in non-streaming mode
Closed #4179. |
Hi when I run: from datasets import load_dataset after downloading those files and the spliting process, here is a error like this:ExpectedMoreSplits Traceback (most recent call last) /disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs) /disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs) /disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs) /disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits) ExpectedMoreSplits: {'validation.other', 'test.other', 'test.clean', 'validation.clean', 'train.other.500', 'train.clean.360', 'train.clean.100'} Could you please tell me where I was wrong and did I miss any files? |
Hi @wxlsummer, Let's continue the discussion in the Community tab of the dataset: https://huggingface.co/datasets/openslr/librispeech_asr/discussions/11 |
Add
"all"
config to LibrispeechClosed #4179