[Librispeech] Add 'all' config #4184

patrickvonplaten · 2022-04-19T16:27:56Z

Add "all" config to Librispeech

patrickvonplaten · 2022-04-19T16:29:13Z

datasets/librispeech_asr/librispeech_asr.py

patrickvonplaten · 2022-04-19T16:32:32Z

datasets/librispeech_asr/librispeech_asr.py

+                                    "chapter_id": chapter_id,
+                                    "file": audio_file,
+                                    "text": transcript,
+                                    "subset": subset,


@albertz @lhoestq - this "subset" arg is one of "clean-100", "clean-360", "other-500" for:

ds = load_dataset("librispeech_asr", "all", split="train") ds[0]

This should allow the user to easily define which subsets are needed I think

HuggingFaceDocBuilderDev · 2022-04-19T16:42:30Z

The documentation is not available anymore as the PR was closed or merged.

datasets/librispeech_asr/librispeech_asr.py

patrickvonplaten · 2022-04-20T09:35:23Z

Just that I understand: With this change, simply doing load_dataset("librispeech_asr") is possible and returns the whole dataset?

And to get the subsets, I do sth like:
ds = load_dataset("librispeech_asr")
train_ds = ds["train"]
dev_clean_ds = ds["dev-clean"]
dev_other_ds = ds["dev-other"]
test_clean_ds = ds["test-clean"]
test_other_ds = ds["test-other"]
?

You could do:

ds = load_dataset("librispeech_asr", "all")  # <- note that we have to pass a config
train_ds = ds["train"]
dev_clean_ds = ds["dev-clean"]
dev_other_ds = ds["dev-other"]
test_clean_ds = ds["test-clean"]
test_other_ds = ds["test-other"]

albertz · 2022-04-20T09:43:05Z

So, load_dataset("librispeech_asr") is not possible, it must be load_dataset("librispeech_asr", "all")?

Why is that?

The docs say:

name: `str` name, optional configuration for the dataset that affects the data generated on disk. Different
    `builder_config`s will have their own subdirectories and versions.
    If not provided, uses the first configuration in self.BUILDER_CONFIGS

datasets/src/datasets/builder.py

Line 228 in cd3ce34

If not provided, uses the first configuration in self.BUILDER_CONFIGS

Or maybe you could just define DEFAULT_CONFIG_NAME?

lhoestq · 2022-04-20T09:50:54Z

If not provided, uses the first configuration in self.BUILDER_CONFIGS

Oh crap this is outdated documentation. No it doesn't take the first config by default.

EDIT: opened a PR to fix this: #4186

albertz · 2022-04-20T09:55:05Z

No it doesn't take the first config by default.

But defining DEFAULT_CONFIG_NAME would work?

So should we define DEFAULT_CONFIG_NAME = "all" here as well? I think this is a reasonable default config.

Don't most datasets have some default config?

lhoestq · 2022-04-20T10:04:35Z

But defining DEFAULT_CONFIG_NAME would work?

So should we define DEFAULT_CONFIG_NAME = "all" here as well? I think this is a reasonable default config.

Yes that would work, and I also find it reasonable to do it :)

Don't most datasets have some default config?

Most datasets only have one configuration, so the single configuration is the default one. Then other datasets gave several configurations, and whether they have a default one is decided case-by-case.

e.g. glue is a benchmark and doesn't have a default task, one must choose which task of glue they want to use explicitely.

patrickvonplaten · 2022-04-20T11:25:08Z

Thanks a lot for the feedback!

Using "all" now as the default config. I changed the layout a bit so that there is not a single "train", but instead we have multiple "train.clean.100", "train.clean.360", "train.other.500". This way we don't even need to do filtering and it's also cleaner IMO.

@albertz - you should now be able to do the following:

load_dataset("librispeech_asr")  # <- run this once to download, prepare dataset and cache everything

# The following operations will be very fast since all the downloading and processing is already cached
train_1 = load_dataset("librispeech_asr", split="train.clean.100")
print(train_1)
train_2 = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360")
print(train_2)
train_full = load_dataset("librispeech_asr", split="train.clean.100+train.clean.360+train.other.500")
print(train_full)
dev_clean_ds = load_dataset("librispeech_asr", split="validation.clean")
print(dev_clean_ds)
dev_other_ds = load_dataset("librispeech_asr", split="validation.other")
print(dev_other_ds)
test_clean_ds = load_dataset("librispeech_asr", split="test.clean")
print(test_clean_ds)
test_other_ds = load_dataset("librispeech_asr", split="test.other")
print(test_other_ds)

patrickvonplaten · 2022-04-20T11:29:16Z

Think this way we have the best of both worlds. Also @lhoestq, I think we could highlight better in the docs that it's possible to combine different splits. We do this actually quite a lot for speech. For Common Voice many people include "validation" in the training if the data is too small, e.g.: https://github.com/huggingface/transformers/blob/ff06b177917384137af2d9585697d2d76c40cdfc/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L147

Should we maybe add a short section to the loading tutorial here: https://huggingface.co/docs/datasets/v2.1.0/en/loading#hugging-face-hub ? (Happy to do it)

albertz · 2022-04-20T11:43:37Z

Is there any advantage or difference in calling load_dataset multiple times for each split? Or why not just call load_dataset once and then access each split?

Note in our case, we cannot really use the caching mechanism because we have a recipe pipeline used by multiple users (and I think a common cache dir for all users might end up in problems) and we basically would use load_dataset("librispeech_asr").save_to_disk(...) and then later load_from_disk(...). (See here: rwth-i6/i6_core#253)

So with load_from_disk, we cannot really provide the split this way, so we anyway would do sth like:

ds = datasets.load_from_disk(...)
train = ds["train"]

Or with your latest proposal, it would look like:

ds = datasets.load_from_disk(...)
train_ds = datasets.concatenate_datasets(
  [ds["train.clean.100"], ds["train.clean.360"], ds["train.other.500"]])

right?

patrickvonplaten · 2022-04-20T12:10:00Z

Is there any advantage or difference in calling load_dataset multiple times for each split? Or why not just call load_dataset once and then access each split?

Note in our case, we cannot really use the caching mechanism because we have a recipe pipeline used by multiple users (and I think a common cache dir for all users might end up in problems) and we basically would use load_dataset("librispeech_asr").save_to_disk(...) and then later load_from_disk(...). (See here: rwth-i6/i6_core#253)

So with load_from_disk, we cannot really provide the split this way, so we anyway would do sth like:
ds = datasets.load_from_disk(...)
train = ds["train"]
Or with your latest proposal, it would look like:
ds = datasets.load_from_disk(...)
train_ds = datasets.concatenate_datasets(
  [ds["train.clean.100"], ds["train.clean.360"], ds["train.other.500"]])
right?

I see the use case! The only advantage by calling datasets multiple times is that one can easily "merge" splits with "+", but yeah you can do the exact same with concatenate.

@lhoestq what do you think is the best approach with load_from_disk?

@albertz, you could also define the cache_dir when doing load_dataset(...) which will then put all the relevant arrow files int the cache dir that you defined, e.g.:

load_dataset("librispeech_asr", cache_dir="/easy/to/access/directory")

patrickvonplaten · 2022-04-20T12:33:44Z

datasets/librispeech_asr/librispeech_asr.py

@@ -159,6 +271,9 @@ def _generate_examples(self, files):
                        id_, transcript = line.split(" ", 1)
                        audio_file = f"{id_}.flac"
                        speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]]
+                        audio_file = (
+                            os.path.join(local_extracted_archive, audio_file) if local_extracted_archive else None


@lhoestq is the whole speech data now stored twice in non-streaming mode? Once as the original flac and once as an array of bytes?

Should we change line 266: from audio_data[id_] = f.read() to just audio_data[id_] = f and change the line: audio = {"path": transcript["file"], "bytes": audio_data[transcript["id"]]} at the moment it looks to me like both the bytes and the file is saved in non-streaming mode which would create a huge overhead in memory no?

Opening a new PR for this to change it internally in src/datasets ...

Will be fixed in #4187 :)

patrickvonplaten · 2022-04-20T13:36:28Z

@albertz, I took a read through rwth-i6/i6_core#253 .

I think the best would be the following:

Do ds = load_dataset(..., cache_dir="/dir/that/is/easy/to/access") <- having merged this PR, this will save all the original .flac files in the cache_dir
Do ds.save_to_disk("local/path") this should then only save the arrow.format with a path string to the audio files which are located in cache_dir <- this won't require a lot of memory after [Librispeech] Add 'all' config #4184 (comment) is fixed and can be done for each person individually.
ds = datasets.load_from_disk("local/path") can the be used. An object of ds will then have a path variable that links to the original audio files in the cache_dir. You can change these audio files then easily to .mp3. You could do this with the .map(...)` function, e.g. define a function that maps through all audio files, load them and then save them on disk afterward.

patrickvonplaten · 2022-04-20T13:36:53Z

@lhoestq - I think this one is good to go

albertz · 2022-04-20T14:24:13Z

@albertz, I took a read through rwth-i6/i6_core#253 .

I think the best would be the following:

Do ds = load_dataset(..., cache_dir="/dir/that/is/easy/to/access") <- having merged this PR, this will save all the original .flac files in the cache_dir

Do ds.save_to_disk("local/path") this should then only save the arrow.format with a path string to the audio files which are located in cache_dir <- this won't require a lot of memory after [Librispeech] Add 'all' config #4184 (comment) is fixed and can be done for each person individually.

ds = datasets.load_from_disk("local/path") can the be used. An object of ds will then have a path variable that links to the original audio files in the cache_dir. You can change these audio files then easily to .mp3. You could do this with the .map(...)` function, e.g. define a function that maps through all audio files, load them and then save them on disk afterward.

Oh, so you say that our current implementation in rwth-i6/i6_core#253 is broken? Because our cache dir is just some temp directory which will be removed afterwards, and we just store what we get out of save_to_disk. I think it would be good to clarify that in the doc of save_to_disk, that this is not enough and can depend on files from the cache dir. (@dthulke)

So, you say we anyway need to share the cache dir among users? But we would want to make sure that after the initial download and preparation of the data, this is set to readonly, because we want to make sure that other people will not modify the data in any way. Right?

But then, we don't really need the save_to_disk and load_from_disk at all, right?

dthulke · 2022-04-20T17:00:36Z

@albertz

Oh, so you say that our current implementation in rwth-i6/i6_core#253 is broken? Because our cache dir is just some temp directory which will be removed afterwards, and we just store what we get out of save_to_disk. I think it would be good to clarify that in the doc of save_to_disk, that this is not enough and can depend on files from the cache dir. (@dthulke)

Oh, I wasn't aware that audio files are handled this way. Then we should have the cache directory as an additional job output, so that we keep the audio files.

So, you say we anyway need to share the cache dir among users?

No, the cache dir can still be a directory in the job output folder. Then the audio paths in the corresponding dataset column correspond to the flac files in that directory. This way the "output" of the job is contained into the job directory and we don't write files to a global cache directory that is independent of the sisyphus graph.

If we want to share the audio data between different users, we can just link to a central instance of the job (similar to how we do it with the DownloadLibriSpeechCorpusJob).

patrickvonplaten · 2022-04-20T17:32:34Z

@dthulke - that's a good point actually! So you can do both things:

Convert all audio files to bytes. Bytes can be saved by arrow so in this case you can do save_to_disk(...), but then you cannot really inspect the audio files locally as they'll just be saved within a large arrow file (this actually used to be the default case but we're changing this now). The problem of this is summarized here a bit: [Audio] Path of Common Voice cannot be used for audio loading anymore #3663 . You can still do this if you'd like, e.g. you could do:

ds = load_dataset("librispeech_asr")

def read_file(batch):
    with open(batch["file"], "r") as f:
        batch["bytes"] = f.read()   
    return batch

ds = ds.map(read_file)
ds.save_to_disk("/path") <- the saved arrow object will now contain everything you need

however this is not recommend - it's should be much easier to just save the path to the downloaded audio files.

Not convert audio files to bytes, but just leave them in their original file format. Then only the path to the original files will be save in arrow. This will be the default case. This means that when you do load_dataset(...) both the orginal audio data and the arrow file will be saved in the cache_dir (which can be saved locally for every user or in a shared cache - we actually use a shared cache quite a bit at Hugging Face). When do you do save_to_disk(...) now only the path will be saved in arrow format (after this PR is merged, you'll see that the arrow files should be very light weight meaning that save_to_disk(...) can be done for every user, but has a dependency on the cache_dir (because the audio files live there).

=> Now what you could do as well would be to simply move all the audio files to the folder you want (the save_to_disk(...) folder) and then change the path of every sample to this folder (maybe with map(...)) and then this folder would be self contained. I do however think it's better to just specific a cache_dir and re-use load_dataset(...) every time instead of load_from_disk or save_to_disk(...). Note that you can even pass the relevant cache files to load_dataset(...) here: https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/loading_methods#datasets.load_dataset.data_files in which case you can be 100% sure that nothing is redownloaded.

We discussed storing audio files quite a bit, e.g. see: #3663 and had (too many) changes around this topic recently, but we've come to the conclusion that the best is to leave the audio format in the format it was originally (.flac for Librispeech) so that the user can easily inspect it / understand the data. Arrow cannot save data is .flac so we'll just save a path to the original data. Curious to hear your guys opinion on this as well.

patrickvonplaten · 2022-04-20T17:36:51Z

So what I would suggest here is to do the following:

Do load_dataset(..., cache_dir=/a/read-only/folder)

Either just re-use load_dataset(..., cache_dir=...) which should always re-use the data in the cache_dir since the hash of the url matches - so there should never be any duplicated downloading

or

If you want to store the files in MP3 locally, first convert the files to MP3 in the read-only folder, then take do ds.save_to_disk(/some/path) which will save the correct path to the read-only folder to MP3 and then you can easily re-use the small arrow dataset that is saved in /some/path

patrickvonplaten · 2022-04-20T18:29:30Z

So what I would suggest here is to do the following:

Do load_dataset(..., cache_dir=/a/read-only/folder)

Either just re-use load_dataset(..., cache_dir=...) which should always re-use the data in the cache_dir since the hash of the url matches - so there should never be any duplicated downloading

or

If you want to store the files in MP3 locally, first convert the files to MP3 in the read-only folder, then take do ds.save_to_disk(/some/path) which will save the correct path to the read-only folder to MP3 and then you can easily re-use the small arrow dataset that is saved in /some/path

Also relevant here: #3663

lhoestq · 2022-04-21T09:29:01Z

I also added some documentation about how save_to_disk handles audio files here: #4193

lhoestq

This PR looks good thanks :)

@patrickvonplaten can you also update the dataset_infos.json file with this new config ?

datasets-cli test ./datasets/librispeech_asr --save_infos --name all

To fix the CI you can also replace audio-speaker-identification by speaker-identification in the task tags of the dataset card

albertz · 2022-04-21T10:44:39Z

So, you say we anyway need to share the cache dir among users?

No, the cache dir can still be a directory in the job output folder.

@dthulke But this is what I mean. When we share the job output folder, it means we share the cache dir among users.

I wonder if load_dataset(..., cache_dir=job_output_cache_dir) is always save to do then, that it really would not modify the job_output_cache_dir.

We could enforce that by making the job_output_cache_dir read-only afterwards. We currently don't do this.

@patrickvonplaten @dthulke But in any case, we actually prefer the data content to be inside the dataset (the arrow files). Lots of small files would be very problematic for our cache manager. We have one main copy of the data on NFS, but accessing the NFS directly by all computing nodes is not feasible, so the cache manager will have copies of the files on the nodes. So it means, whenever we access some file, we query the cache manager DB whether the file is already cached somewhere (some other computing node) and if so, it copies it from the other computing node and not from NFS. This works very well when there are not too many files (but the files can be big). So, we want to have only a few but big files. Even for NFS access this is much better.

I also commented in #3663.

patrickvonplaten · 2022-04-21T13:14:27Z

Hey @albertz @dthulke,

Thanks a lot for your input!

We've discussed quite a bit with @lhoestq and we think the best approach is the following:

a)
load_dataset(...) will not store both bytes and the files because this would mean that 3x the size of the dataset would often be needed (1. the compressed tar.gz file, 2. the extracted file b, 3. the raw bytes in arrow format).

For canonical datasets like librispeech and common voice I think we want to keep the dataset filenames because of i) no breaking changes and ii) reasons explained in #3663

However it's also trivial to write your own datasetset downloading script of librispeech and just not extract the folder e.g. this line: https://huggingface.co/datasets/common_voice/blob/main/common_voice.py#L671

And then it'll be allowed to save the bytes and the dataset will be self-contained out-of-the-box when using load_dataset(...)

b) Now, one major problem that you guys uncovered is that save_to_disk(...) is currently not necessarily saving a dataset to be self-contained. We will change that asap. This means that after we've corrected this when you do download the canonical librispeech dataset the following will work:

ds = load_dataset("....")  # <- here we have a dependency on the filepathes
ds[0]["audio"]["bytes"] # <- will not work

ds.save_to_disk("/local/path")  # <- now we want to have a self-contained dataset in arrow format, so we load the files into bytes and save it in arrow format

# now you can delete everything besides "/local/path"

ds = load_from_disk("/local/path")  # <- this will work

So either option a) where you define your own librispeech data downloading script (you guys could just sign up here: https://huggingface.co/join) and upload a dataset loading script in private mode so that no one can see it and you would always store the audio as bytes or b) where you first load then save to disk then delete cache would work.

Hope that fits in your vision :-)

cc @lhoestq @mariosasko

dthulke · 2022-04-21T13:30:47Z

@patrickvonplaten sounds like a good approach to me. For b) this could even be configurable with a parameter like embed_external_files as you have for push_to_hub (if people prefer to keep separate audio files).

albertz · 2022-04-21T13:52:26Z

However it's also trivial to write your own datasetset downloading script of librispeech and just not extract the folder

I don't exactly understand. In all cases, we need to extract it to prepare the dataset, or not? No matter if we want to store the raw bytes inside the dataset or leaving them as local files. Just in the first case, we can safely delete the extracted files after the dataset preparation.

save_to_disk(...) is currently not necessarily saving a dataset to be self-contained. We will change that asap.

For us, this sounds exactly like what we want.

But regarding not introducing breaking changes, wouldn't this maybe also break some setups for users who don't expect this new behavior?

dthulke · 2022-04-21T14:41:38Z

@albertz I would suggest to move the discussion on implementation details on our side to the following issue: rwth-i6/i6_core/issues/257

lhoestq · 2022-04-21T16:22:05Z

I like the idea of adding embed_external_files and set it to True by default to save_to_disk.
It's indeed a kind of breaking change since some users will get bigger Arrow files when updating the lib, but the advantages are nice:

I like the idea of having it self contained, in case you want to delete your cache
users also upload these Arrow files to cloud storage via the fs parameter, and in this case they would expect to upload a self-contained dataset
consistency with push_to_hub

If it sounds good to you I'll open an issue to discuss this and track the advancements

lhoestq

Awesome thanks !

patrickvonplaten · 2022-04-22T09:45:54Z

datasets/librispeech_asr/librispeech_asr.py

@@ -159,6 +256,11 @@ def _generate_examples(self, files):
                        id_, transcript = line.split(" ", 1)
                        audio_file = f"{id_}.flac"
                        speaker_id, chapter_id = [int(el) for el in id_.split("-")[:2]]
+                        audio_file = (


Needed to have extracted files in non-streaming mode

albertvillanova · 2022-08-29T06:34:11Z

Closed #4179.

wxlsummer · 2024-08-01T21:05:30Z

load_dataset("librispeech_asr")

Hi when I run:

from datasets import load_dataset
import pandas as pd
from multiprocessing import Pool
import os
data_dir = my own path
dataset=load_dataset("librispeech_asr", cache_dir=data_dir)

after downloading those files and the spliting process, here is a error like this:

ExpectedMoreSplits Traceback (most recent call last)
/tmp/ipykernel_815982/814767946.py in
4 import os
5 data_dir = '/disk/scratch2/s1905792/librispeech'
----> 6 dataset=load_dataset("librispeech_asr", cache_dir=data_dir)
7 #test = load_dataset("librispeech_asr", "all", "train", cache_dir=data_dir)

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1813 try_from_hf_gcs=try_from_hf_gcs,
1814 num_proc=num_proc,
-> 1815 storage_options=storage_options,
1816 )
1817

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
911 verification_mode=verification_mode,
912 **prepare_split_kwargs,
--> 913 **download_and_prepare_kwargs,
914 )
915 # Sync info

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1020
1021 if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1022 verify_splits(self.info.splits, split_dict)
1023
1024 # Update the info object with the splits.

/disk/scratch2/s1905792/anaconda3/envs/py3.7-gpu/lib/python3.7/site-packages/datasets/utils/info_utils.py in verify_splits(expected_splits, recorded_splits)
89 return
90 if len(set(expected_splits) - set(recorded_splits)) > 0:
---> 91 raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))
92 if len(set(recorded_splits) - set(expected_splits)) > 0:
93 raise UnexpectedSplits(str(set(recorded_splits) - set(expected_splits)))

ExpectedMoreSplits: {'validation.other', 'test.other', 'test.clean', 'validation.clean', 'train.other.500', 'train.clean.360', 'train.clean.100'}

Could you please tell me where I was wrong and did I miss any files?
Best Regards,
Xiaoliang

albertvillanova · 2024-08-02T05:03:02Z

Hi @wxlsummer,

Let's continue the discussion in the Community tab of the dataset: https://huggingface.co/datasets/openslr/librispeech_asr/discussions/11

[Librispeech] Add 'all' config

9bfbc14

patrickvonplaten requested a review from lhoestq April 19, 2022 16:28

patrickvonplaten commented Apr 19, 2022

View reviewed changes

datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved

Update datasets/librispeech_asr/librispeech_asr.py

fe68645

patrickvonplaten commented Apr 19, 2022

View reviewed changes

This comment was marked as resolved.

Sign in to view

albertz reviewed Apr 20, 2022

View reviewed changes

datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved

albertz reviewed Apr 20, 2022

View reviewed changes

datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved

albertz reviewed Apr 20, 2022

View reviewed changes

datasets/librispeech_asr/librispeech_asr.py Outdated Show resolved Hide resolved

Patrick von Platen added 2 commits April 20, 2022 11:25

apply suggestions

189ee52

Fix merge

2f11a71

patrickvonplaten mentioned this pull request Apr 20, 2022

Librispeech documentation, clarification on format #4185

Open

correct paths

627c99f

patrickvonplaten commented Apr 20, 2022

View reviewed changes

Patrick von Platen added 2 commits April 20, 2022 12:49

up

31e67cb

up

3944a3b

lhoestq reviewed Apr 21, 2022

View reviewed changes

albertz mentioned this pull request Apr 21, 2022

[Audio] Path of Common Voice cannot be used for audio loading anymore #3663

Closed

dthulke mentioned this pull request Apr 21, 2022

HuggingFace Dataset Integration rwth-i6/i6_core#257

Open

lhoestq mentioned this pull request Apr 21, 2022

Embed image and audio files in save_to_disk #4196

Closed

Patrick von Platen added 3 commits April 21, 2022 16:56

up

5f699fa

up

2adb2d7

up

bce8221

patrickvonplaten requested a review from lhoestq April 21, 2022 23:39

lhoestq approved these changes Apr 22, 2022

View reviewed changes

patrickvonplaten merged commit 91d7171 into huggingface:master Apr 22, 2022

patrickvonplaten deleted the add_all_to_librispeech branch April 22, 2022 09:45

patrickvonplaten commented Apr 22, 2022

View reviewed changes

lhoestq mentioned this pull request Jul 27, 2022

Dataset librispeech_asr fails to load #4179

Closed

albertvillanova mentioned this pull request Aug 29, 2022

[LibriSpeech] Fix dev split local_extracted_archive for 'all' config #4904

Merged

[Librispeech] Add 'all' config #4184

[Librispeech] Add 'all' config #4184

Conversation

patrickvonplaten commented Apr 19, 2022 • edited by albertvillanova Loading

patrickvonplaten commented Apr 19, 2022

patrickvonplaten Apr 19, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 19, 2022 • edited Loading

This comment was marked as resolved.

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022

lhoestq commented Apr 20, 2022 • edited Loading

albertz commented Apr 20, 2022

lhoestq commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022 • edited Loading

patrickvonplaten commented Apr 20, 2022

patrickvonplaten Apr 20, 2022

Choose a reason for hiding this comment

patrickvonplaten Apr 20, 2022

Choose a reason for hiding this comment

lhoestq Apr 20, 2022

Choose a reason for hiding this comment

patrickvonplaten commented Apr 20, 2022 • edited Loading

patrickvonplaten commented Apr 20, 2022

albertz commented Apr 20, 2022 • edited Loading

dthulke commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

patrickvonplaten commented Apr 20, 2022

lhoestq commented Apr 21, 2022

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

albertz commented Apr 21, 2022 • edited Loading

patrickvonplaten commented Apr 21, 2022

dthulke commented Apr 21, 2022

albertz commented Apr 21, 2022

dthulke commented Apr 21, 2022

lhoestq commented Apr 21, 2022

lhoestq left a comment

Choose a reason for hiding this comment

patrickvonplaten Apr 22, 2022

Choose a reason for hiding this comment

albertvillanova commented Aug 29, 2022

wxlsummer commented Aug 1, 2024

after downloading those files and the spliting process, here is a error like this:

albertvillanova commented Aug 2, 2024

patrickvonplaten commented Apr 19, 2022 •

edited by albertvillanova

Loading

HuggingFaceDocBuilderDev commented Apr 19, 2022 •

edited

Loading

lhoestq commented Apr 20, 2022 •

edited

Loading

albertz commented Apr 20, 2022 •

edited

Loading

patrickvonplaten commented Apr 20, 2022 •

edited

Loading

albertz commented Apr 20, 2022 •

edited

Loading

lhoestq left a comment •

edited

Loading

albertz commented Apr 21, 2022 •

edited

Loading