diff --git a/README.md b/README.md index 2e55c9ff..c4a061c8 100644 --- a/README.md +++ b/README.md @@ -163,7 +163,7 @@ These arguments give coarse control over input/output "shape" of the dataset. Fo ## Downloading YouTube Metadata -If we want to download a large amount of YouTube videos with video2dataset we can specify some parameters and also extract useful metadata as well. For directions on how to do so please see this [example](https://github.com/iejMac/video2dataset/blob/main/examples/yt_metadata.md). +If we want to download a large amount of YouTube videos with video2dataset we can specify some parameters - including a proxy to distribute requests - and also extract useful metadata as well. For directions on how to do so please see this [example](https://github.com/iejMac/video2dataset/blob/main/examples/yt_metadata.md). ## Incremental mode diff --git a/examples/yt_metadata.md b/examples/yt_metadata.md index 0d06faf4..c02023d4 100644 --- a/examples/yt_metadata.md +++ b/examples/yt_metadata.md @@ -1,3 +1,16 @@ +### Setting up yt-dlp proxy: +#### Usage + +yt-dlp allows you to setup a proxy to send requests to YouTube. We surface this feature through our config file through the `proxy` and the flag `proxy-check-certificate`. If `proxy-check-certificate` is set to False, it supresses HTTPS certificate validation. + +```yaml +yt_args: + download_size: 360 + download_audio_rate: 44100 + proxy: "url:port" + proxy-check-certificate: True / False +``` + ### Download YouTube metadata & subtitles: #### Usage diff --git a/video2dataset/dataloader/custom_wds.py b/video2dataset/dataloader/custom_wds.py index fb8f7578..61c6c73c 100644 --- a/video2dataset/dataloader/custom_wds.py +++ b/video2dataset/dataloader/custom_wds.py @@ -507,9 +507,10 @@ def __init__( main_datapipe.apply_sharding(world_size, global_rank) # synchronize data across processes to prevent hanging if sharding is uneven (which is likely) main_datapipe = main_datapipe.fullsync() - except ValueError as e: + except (RuntimeError, ValueError) as e: if str(e) == "Default process group has not been initialized, please make sure to call init_process_group.": print("torch distributed not used, not applying sharding in dataloader") + pass else: raise # re-raise if it's a different ValueError