You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This includes (at least) several steps, each which will be detailed in its own github issue/PR:
Determine the correct format for storing each video modality and implement pseudolabelers/data downloaders/etc. to get the video data stored in parallel directories as usable by 4M and video2dataset. "Definition of done" here means we have the data in the right directories and we can load them in the correct format. ([PARENT ISSUE] Data preprocessing and pseudolabeling #3)
TODO? @garjania what other steps are required here? anything for decoder embeddings?
The text was updated successfully, but these errors were encountered:
kdu4108
changed the title
Implement the temporal changes in 4M to account for video
[PARENT ISSUE] Implement the temporal changes in 4M to account for video
Jun 21, 2024
Considering the RGB frames, before adding anything to modality_info or modality_transform, we need to tokenize them. So I suggest to also include the RGB tokenization step for the video datasets somewhere along the first steps.
(why?) -- We need to tokenize RGB (and all other vision-like modalities) because they can be inputted as tokens to the model. (in fact, RGB is the only one which allows for pixel-patches which would not require tokenization)
Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.
This includes (at least) several steps, each which will be detailed in its own github issue/PR:
The text was updated successfully, but these errors were encountered: