Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for subtitles from multiple languages #262

Merged
merged 15 commits into from
Jan 21, 2024

Conversation

sramshetty
Copy link
Contributor

Feature

Aim is to capture all requested subtitles with the same clip segment, i.e. subtitles share the same clip time segment.

  • could do this in either a clip-first or lang-first manner

Notes:

Additionally, noticed that yt-dlp doesn't like playlist links and causes issues since "requested_subtitles" and other keys are not present; however the metadata returned in info_dict can be used to parse all subtitles from full list of videos in playlist. Maybe making that note for video set curation could help others if I didn't miss a preexisting warning.

@sramshetty sramshetty marked this pull request as ready for review October 31, 2023 02:26
@sramshetty
Copy link
Contributor Author

sramshetty commented Oct 31, 2023

Could use a preliminary review since I'm not sure if these changes will negatively effect other parts of the system or future plans, thanks!

video2dataset/data_reader.py Outdated Show resolved Hide resolved
@iejMac
Copy link
Owner

iejMac commented Nov 1, 2023

looks mostly ok to me. I'm assuming you've tried it out with clipping and it gives expected results?

@sramshetty
Copy link
Contributor Author

looks mostly ok to me. I'm assuming you've tried it out with clipping and it gives expected results?

was working for me, will need to verify after making the changes you recommended

@sramshetty
Copy link
Contributor Author

sramshetty commented Nov 4, 2023

  • add an example case with both 'first' and 'all' results

@sramshetty
Copy link
Contributor Author

@iejMac Think the necessary changes are done and was able to test that I'm getting what is expect.

@sramshetty
Copy link
Contributor Author

sramshetty commented Dec 3, 2023

sorry about the lint stuff, but think I finally got what I could resolved.

However, pylint complains about video2dataset\dataloader\video_decode.py:59:35: W1116: Second argument of isinstance is not a type (isinstance-second-argument-not-valid-type) which seems to be fixed here so maybe a version thing?

@rom1504 rom1504 force-pushed the multilingual_transcript branch from 4b09b98 to 834a20e Compare January 21, 2024 19:38
@rom1504 rom1504 merged commit cc40160 into iejMac:main Jan 21, 2024
2 checks passed
@rom1504
Copy link
Collaborator

rom1504 commented Jan 21, 2024

thanks for the PR @sramshetty !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants