Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Librispeech] Add 'all' config #4184

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Apr 19, 2022

Add "all" config to Librispeech

Closed #4179

@patrickvonplaten
Copy link
Contributor Author

Fix #4179

"chapter_id": chapter_id,
"file": audio_file,
"text": transcript,
"subset": subset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertz @lhoestq - this "subset" arg is one of "clean-100", "clean-360", "other-500" for:

ds = load_dataset("librispeech_asr", "all", split="train")
ds[0]

This should allow the user to easily define which subsets are needed I think

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 19, 2022

The documentation is not available anymore as the PR was closed or merged.

@albertz

This comment was marked as resolved.

@patrickvonplaten
Copy link
Contributor Author

Just that I understand: With this change, simply doing load_dataset("librispeech_asr") is possible and returns the whole dataset?

And to get the subsets, I do sth like:

ds = load_dataset("librispeech_asr")
train_ds = ds["train"]
dev_clean_ds = ds["dev-clean"]
dev_other_ds = ds["dev-other"]
test_clean_ds = ds["test-clean"]
test_other_ds = ds["test-other"]

?

You could do:

ds = load_dataset("librispeech_asr", "all")  # <- note that we have to pass a config
train_ds = ds["train"]
dev_clean_ds = ds["dev-clean"]
dev_other_ds = ds["dev-other"]
test_clean_ds = ds["test-clean"]
test_other_ds = ds["test-other"]

@albertz
Copy link

albertz commented Apr 20, 2022

So, load_dataset("librispeech_asr") is not possible, it must be load_dataset("librispeech_asr", "all")?

Why is that?

The docs say:

name: `str` name, optional configuration for the dataset that affects the data generated on disk. Different
    `builder_config`s will have their own subdirectories and versions.
    If not provided, uses the first configuration in self.BUILDER_CONFIGS

If not provided, uses the first configuration in self.BUILDER_CONFIGS

Or maybe you could just define DEFAULT_CONFIG_NAME?

@lhoestq
Copy link
Member

lhoestq commented Apr 20, 2022

If not provided, uses the first configuration in self.BUILDER_CONFIGS

Oh crap this is outdated documentation. No it doesn't take the first config by default.

EDIT: opened a PR to fix this: #4186

@albertz
Copy link

albertz commented Apr 20, 2022

No it doesn't take the first config by default.

But defining DEFAULT_CONFIG_NAME would work?

So should we define DEFAULT_CONFIG_NAME = "all" here as well? I think this is a reasonable default config.

Don't most datasets have some default config?

@lhoestq
Copy link
Member

lhoestq commented Apr 20, 2022

But defining DEFAULT_CONFIG_NAME would work?

So should we define DEFAULT_CONFIG_NAME = "all" here as well? I think this is a reasonable default config.

Yes that would work, and I also find it reasonable to do it :)

Don't most datasets have some default config?

Most datasets only have one configuration, so the single configuration is the default one. Then other datasets gave several configurations, and whether they have a default one is decided case-by-case.

e.g. glue is a benchmark and doesn't have a default task, one must choose which task of glue they want to use explicitely.

@patrickvonplaten
Copy link
Contributor Author

Thanks a lot for the feedback!

Using "all" now as the default config. I changed the layout a bit so that there is not a single "train", but instead we have multiple "train.clean.100", "train.clean.360", "train.other.500". This way we don't even need to do filtering and it's also cleaner IMO.

@albertz - you should now be able to do the following:

load_dataset("librispeech_asr")  # <- run this once to download, prepare dataset and cache everything

# The following operations will be very fast since all the downloading and processing is already cached
train_1 = load_dataset("librispeech_asr", split="train.clean.100")
print(train_1)
train_2 = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360")
print(train_2)
train_full = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360+train.other.500")
print(train_full)
dev_clean_ds = load_dataset("librispeech_asr", split="validation.clean")
print(dev_clean_ds)
dev_other_ds = load_dataset("librispeech_asr", split="validation.other")
print(dev_other_ds)
test_clean_ds = load_dataset("librispeech_asr", split="test.clean")
print(test_clean_ds)
test_other_ds = load_dataset("librispeech_asr", split="test.other")
print(test_other_ds)

Patrick von Platen added 2 commits April 20, 2022 11:25
@patrickvonplaten
Copy link
Contributor Author

Think this way we have the best of both worlds. Also @lhoestq, I think we could highlight better in the docs that it's possible to combine different splits. We do this actually quite a lot for speech. For Common Voice many people include "validation" in the training if the data is too small, e.g.: https://github.com/huggingface/transformers/blob/ff06b177917384137af2d9585697d2d76c40cdfc/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L147

Should we maybe add a short section to the loading tutorial here: https://huggingface.co/docs/datasets/v2.1.0/en/loading#hugging-face-hub ? (Happy to do it)

@albertz
Copy link

albertz commented Apr 20, 2022

Is there any advantage or difference in calling load_dataset multiple times for each split? Or why not just call load_dataset once and then access each split?

Note in our case, we cannot really use the caching mechanism because we have a recipe pipeline used by multiple users (and I think a common cache dir for all users might end up in problems) and we basically would use load_dataset("librispeech_asr").save_to_disk(...) and then later load_from_disk(...). (See here: rwth-i6/i6_core#253)

So with load_from_disk, we cannot really provide the split this way, so we anyway would do sth like:

ds = datasets.load_from_disk(...)
train = ds["train"]

Or with your latest proposal, it would look like:

ds = datasets.load_from_disk(...)
train_ds = datasets.concatenate_datasets(
  [ds["train.clean.100"], ds["train.clean.360"], ds["train.other.500"]])

right?

@patrickvonplaten
Copy link
Contributor Author

Is there any advantage or difference in calling load_dataset multiple times for each split? Or why not just call load_dataset once and then access each split?

Note in our case, we cannot really use the caching mechanism because we have a recipe pipeline used by multiple users (and I think a common cache dir for all users might end up in problems) and we basically would use load_dataset("librispeech_asr").save_to_disk(...) and then later load_from_disk(...). (See here: rwth-i6/i6_core#253)

So with load_from_disk, we cannot really provide the split this way, so we anyway would do sth like:

ds = datasets.load_from_disk(...)
train = ds["train"]

Or with your latest proposal, it would look like:

ds = datasets.load_from_disk(...)
train_ds = datasets.concatenate_datasets(
  [ds["train.clean.100"], ds["train.clean.360"], ds["train.other.500"]])

right?

I see the use case! The only advantage by calling datasets multiple times is that one can easily "merge" splits with "+", but yeah you can do the exact same with concatenate.

@lhoestq what do you think is the best approach with load_from_disk?

@albertz, you could also define the cache_dir when doing load_dataset(...) which will then put all the relevant arrow files int the cache dir that you defined, e.g.:

load_dataset("librispeech_asr", cache_dir="/easy/to/access/directory")

@@ -159,6 +271,9 @@ def _generate_examples(self, files):
id_, transcript = line.split(" ", 1)
audio_file = f"{id_}.flac"
speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]]
audio_file = (
os.path.join(local_extracted_archive, audio_file) if local_extracted_archive else None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq is the whole speech data now stored twice in non-streaming mode? Once as the original flac and once as an array of bytes?

Should we change line 266: from audio_data[id_] = f.read() to just audio_data[id_] = f and change the line: audio = {"path": transcript["file"], "bytes": audio_data[transcript["id"]]} at the moment it looks to me like both the bytes and the file is saved in non-streaming mode which would create a huge overhead in memory no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening a new PR for this to change it internally in src/datasets ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be fixed in #4187 :)

Patrick von Platen added 2 commits April 20, 2022 12:49
@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Apr 20, 2022

@albertz, I took a read through rwth-i6/i6_core#253 .

I think the best would be the following:

  1. Do ds = load_dataset(..., cache_dir="/dir/that/is/easy/to/access") <- having merged this PR, this will save all the original .flac files in the cache_dir
  2. Do ds.save_to_disk("local/path") this should then only save the arrow.format with a path string to the audio files which are located in cache_dir <- this won't require a lot of memory after [Librispeech] Add 'all' config #4184 (comment) is fixed and can be done for each person individually.
  3. ds = datasets.load_from_disk("local/path") can the be used. An object of ds will then have a path variable that links to the original audio files in the cache_dir. You can change these audio files then easily to .mp3. You could do this with the .map(...)` function, e.g. define a function that maps through all audio files, load them and then save them on disk afterward.

@patrickvonplaten
Copy link
Contributor Author

@lhoestq - I think this one is good to go

@albertz
Copy link

albertz commented Apr 20, 2022

@albertz, I took a read through rwth-i6/i6_core#253 .

I think the best would be the following:

  1. Do ds = load_dataset(..., cache_dir="/dir/that/is/easy/to/access") <- having merged this PR, this will save all the original .flac files in the cache_dir
  2. Do ds.save_to_disk("local/path") this should then only save the arrow.format with a path string to the audio files which are located in cache_dir <- this won't require a lot of memory after [Librispeech] Add 'all' config #4184 (comment) is fixed and can be done for each person individually.
  3. ds = datasets.load_from_disk("local/path") can the be used. An object of ds will then have a path variable that links to the original audio files in the cache_dir. You can change these audio files then easily to .mp3. You could do this with the .map(...)` function, e.g. define a function that maps through all audio files, load them and then save them on disk afterward.

Oh, so you say that our current implementation in rwth-i6/i6_core#253 is broken? Because our cache dir is just some temp directory which will be removed afterwards, and we just store what we get out of save_to_disk. I think it would be good to clarify that in the doc of save_to_disk, that this is not enough and can depend on files from the cache dir. (@dthulke)

So, you say we anyway need to share the cache dir among users? But we would want to make sure that after the initial download and preparation of the data, this is set to readonly, because we want to make sure that other people will not modify the data in any way. Right?

But then, we don't really need the save_to_disk and load_from_disk at all, right?

@dthulke
Copy link

dthulke commented Apr 20, 2022

@albertz

Oh, so you say that our current implementation in rwth-i6/i6_core#253 is broken? Because our cache dir is just some temp directory which will be removed afterwards, and we just store what we get out of save_to_disk. I think it would be good to clarify that in the doc of save_to_disk, that this is not enough and can depend on files from the cache dir. (@dthulke)

Oh, I wasn't aware that audio files are handled this way. Then we should have the cache directory as an additional job output, so that we keep the audio files.

So, you say we anyway need to share the cache dir among users?

No, the cache dir can still be a directory in the job output folder. Then the audio paths in the corresponding dataset column correspond to the flac files in that directory. This way the "output" of the job is contained into the job directory and we don't write files to a global cache directory that is independent of the sisyphus graph.

If we want to share the audio data between different users, we can just link to a central instance of the job (similar to how we do it with the DownloadLibriSpeechCorpusJob).

@patrickvonplaten
Copy link
Contributor Author

@dthulke - that's a good point actually! So you can do both things:

  1. Convert all audio files to bytes. Bytes can be saved by arrow so in this case you can do save_to_disk(...), but then you cannot really inspect the audio files locally as they'll just be saved within a large arrow file (this actually used to be the default case but we're changing this now). The problem of this is summarized here a bit: [Audio] Path of Common Voice cannot be used for audio loading anymore #3663 . You can still do this if you'd like, e.g. you could do:
ds = load_dataset("librispeech_asr")

def read_file(batch):
    with open(batch["file"], "r") as f:
        batch["bytes"] = f.read()   
    return batch

ds = ds.map(read_file)
ds.save_to_disk("/path") <- the saved arrow object will now contain everything you need

however this is not recommend - it's should be much easier to just save the path to the downloaded audio files.

  1. Not convert audio files to bytes, but just leave them in their original file format. Then only the path to the original files will be save in arrow. This will be the default case. This means that when you do load_dataset(...) both the orginal audio data and the arrow file will be saved in the cache_dir (which can be saved locally for every user or in a shared cache - we actually use a shared cache quite a bit at Hugging Face). When do you do save_to_disk(...) now only the path will be saved in arrow format (after this PR is merged, you'll see that the arrow files should be very light weight meaning that save_to_disk(...) can be done for every user, but has a dependency on the cache_dir (because the audio files live there).

=> Now what you could do as well would be to simply move all the audio files to the folder you want (the save_to_disk(...) folder) and then change the path of every sample to this folder (maybe with map(...)) and then this folder would be self contained. I do however think it's better to just specific a cache_dir and re-use load_dataset(...) every time instead of load_from_disk or save_to_disk(...). Note that you can even pass the relevant cache files to load_dataset(...) here: https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/loading_methods#datasets.load_dataset.data_files in which case you can be 100% sure that nothing is redownloaded.

We discussed storing audio files quite a bit, e.g. see: #3663 and had (too many) changes around this topic recently, but we've come to the conclusion that the best is to leave the audio format in the format it was originally (.flac for Librispeech) so that the user can easily inspect it / understand the data. Arrow cannot save data is .flac so we'll just save a path to the original data. Curious to hear your guys opinion on this as well.

@patrickvonplaten
Copy link
Contributor Author

So what I would suggest here is to do the following:

  1. Do load_dataset(..., cache_dir=/a/read-only/folder)
  • Either just re-use load_dataset(..., cache_dir=...) which should always re-use the data in the cache_dir since the hash of the url matches - so there should never be any duplicated downloading

or

  • If you want to store the files in MP3 locally, first convert the files to MP3 in the read-only folder, then take do ds.save_to_disk(/some/path) which will save the correct path to the read-only folder to MP3 and then you can easily re-use the small arrow dataset that is saved in /some/path

@patrickvonplaten
Copy link
Contributor Author

So what I would suggest here is to do the following:

  1. Do load_dataset(..., cache_dir=/a/read-only/folder)
  • Either just re-use load_dataset(..., cache_dir=...) which should always re-use the data in the cache_dir since the hash of the url matches - so there should never be any duplicated downloading

or

  • If you want to store the files in MP3 locally, first convert the files to MP3 in the read-only folder, then take do ds.save_to_disk(/some/path) which will save the correct path to the read-only folder to MP3 and then you can easily re-use the small arrow dataset that is saved in /some/path

Also relevant here: #3663

@lhoestq
Copy link
Member

lhoestq commented Apr 21, 2022

I also added some documentation about how save_to_disk handles audio files here: #4193

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good thanks :)

@patrickvonplaten can you also update the dataset_infos.json file with this new config ?

datasets-cli test ./datasets/librispeech_asr --save_infos --name all

To fix the CI you can also replace audio-speaker-identification by speaker-identification in the task tags of the dataset card

@albertz
Copy link

albertz commented Apr 21, 2022

So, you say we anyway need to share the cache dir among users?

No, the cache dir can still be a directory in the job output folder.

@dthulke But this is what I mean. When we share the job output folder, it means we share the cache dir among users.

I wonder if load_dataset(..., cache_dir=job_output_cache_dir) is always save to do then, that it really would not modify the job_output_cache_dir.

We could enforce that by making the job_output_cache_dir read-only afterwards. We currently don't do this.

@patrickvonplaten @dthulke But in any case, we actually prefer the data content to be inside the dataset (the arrow files). Lots of small files would be very problematic for our cache manager. We have one main copy of the data on NFS, but accessing the NFS directly by all computing nodes is not feasible, so the cache manager will have copies of the files on the nodes. So it means, whenever we access some file, we query the cache manager DB whether the file is already cached somewhere (some other computing node) and if so, it copies it from the other computing node and not from NFS. This works very well when there are not too many files (but the files can be big). So, we want to have only a few but big files. Even for NFS access this is much better.

I also commented in #3663.

@patrickvonplaten
Copy link
Contributor Author

Hey @albertz @dthulke,

Thanks a lot for your input!

We've discussed quite a bit with @lhoestq and we think the best approach is the following:

a)
load_dataset(...) will not store both bytes and the files because this would mean that 3x the size of the dataset would often be needed (1. the compressed tar.gz file, 2. the extracted file b, 3. the raw bytes in arrow format).

For canonical datasets like librispeech and common voice I think we want to keep the dataset filenames because of i) no breaking changes and ii) reasons explained in #3663

However it's also trivial to write your own datasetset downloading script of librispeech and just not extract the folder e.g. this line: https://huggingface.co/datasets/common_voice/blob/main/common_voice.py#L671

And then it'll be allowed to save the bytes and the dataset will be self-contained out-of-the-box when using load_dataset(...)

b) Now, one major problem that you guys uncovered is that save_to_disk(...) is currently not necessarily saving a dataset to be self-contained. We will change that asap. This means that after we've corrected this when you do download the canonical librispeech dataset the following will work:

ds = load_dataset("....")  # <- here we have a dependency on the filepathes
ds[0]["audio"]["bytes"] # <- will not work

ds.save_to_disk("/local/path")  # <- now we want to have a self-contained dataset in arrow format, so we load the files into bytes and save it in arrow format

# now you can delete everything besides "/local/path"

ds = load_from_disk("/local/path")  # <- this will work

So either option a) where you define your own librispeech data downloading script (you guys could just sign up here: https://huggingface.co/join) and upload a dataset loading script in private mode so that no one can see it and you would always store the audio as bytes or b) where you first load then save to disk then delete cache would work.

Hope that fits in your vision :-)

cc @lhoestq @mariosasko

@dthulke
Copy link

dthulke commented Apr 21, 2022

@patrickvonplaten sounds like a good approach to me. For b) this could even be configurable with a parameter like embed_external_files as you have for push_to_hub (if people prefer to keep separate audio files).

@albertz
Copy link

albertz commented Apr 21, 2022

However it's also trivial to write your own datasetset downloading script of librispeech and just not extract the folder

I don't exactly understand. In all cases, we need to extract it to prepare the dataset, or not? No matter if we want to store the raw bytes inside the dataset or leaving them as local files. Just in the first case, we can safely delete the extracted files after the dataset preparation.

save_to_disk(...) is currently not necessarily saving a dataset to be self-contained. We will change that asap.

For us, this sounds exactly like what we want.

But regarding not introducing breaking changes, wouldn't this maybe also break some setups for users who don't expect this new behavior?

@dthulke
Copy link

dthulke commented Apr 21, 2022

@albertz I would suggest to move the discussion on implementation details on our side to the following issue: rwth-i6/i6_core/issues/257

@lhoestq
Copy link
Member

lhoestq commented Apr 21, 2022

I like the idea of adding embed_external_files and set it to True by default to save_to_disk.
It's indeed a kind of breaking change since some users will get bigger Arrow files when updating the lib, but the advantages are nice:

  1. I like the idea of having it self contained, in case you want to delete your cache
  2. users also upload these Arrow files to cloud storage via the fs parameter, and in this case they would expect to upload a self-contained dataset
  3. consistency with push_to_hub

If it sounds good to you I'll open an issue to discuss this and track the advancements

Patrick von Platen added 3 commits April 21, 2022 16:56
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks !

@patrickvonplaten patrickvonplaten merged commit 91d7171 into huggingface:master Apr 22, 2022
@patrickvonplaten patrickvonplaten deleted the add_all_to_librispeech branch April 22, 2022 09:45
@@ -159,6 +256,11 @@ def _generate_examples(self, files):
id_, transcript = line.split(" ", 1)
audio_file = f"{id_}.flac"
speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]]
audio_file = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to have extracted files in non-streaming mode

@albertvillanova
Copy link
Member

Closed #4179.

@wxlsummer
Copy link

load_dataset("librispeech_asr")

Hi when I run:

from datasets import load_dataset
import pandas as pd
from multiprocessing import Pool
import os
data_dir = my own path
dataset=load_dataset("librispeech_asr", cache_dir=data_dir)

after downloading those files and the spliting process, here is a error like this:

ExpectedMoreSplits Traceback (most recent call last)
/tmp/ipykernel_815982/814767946.py in
4 import os
5 data_dir = '/disk/scratch2/s1905792/librispeech'
----> 6 dataset=load_dataset("librispeech_asr", cache_dir=data_dir)
7 #test = load_dataset("librispeech_asr", "all", "train", cache_dir=data_dir)

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1813 try_from_hf_gcs=try_from_hf_gcs,
1814 num_proc=num_proc,
-> 1815 storage_options=storage_options,
1816 )
1817

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
911 verification_mode=verification_mode,
912 **prepare_split_kwargs,
--> 913 **download_and_prepare_kwargs,
914 )
915 # Sync info

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1020
1021 if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1022 verify_splits(self.info.splits, split_dict)
1023
1024 # Update the info object with the splits.

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits)
89 return
90 if len(set(expected_splits) - set(recorded_splits)) > 0:
---> 91 raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
92 if len(set(recorded_splits) - set(expected_splits)) > 0:
93 raise UnexpectedSplits(str(set(recorded_splits) - set(expected_splits)))

ExpectedMoreSplits: {'validation.other', 'test.other', 'test.clean', 'validation.clean', 'train.other.500', 'train.clean.360', 'train.clean.100'}

Could you please tell me where I was wrong and did I miss any files?
Best Regards,
Xiaoliang

@albertvillanova
Copy link
Member

Hi @wxlsummer,

Let's continue the discussion in the Community tab of the dataset: https://huggingface.co/datasets/openslr/librispeech_asr/discussions/11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants