[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

kdu4108 · 2024-06-21T11:52:55Z

Implement the model according to this design: https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144.

This includes (at least) several steps, each which will be detailed in its own github issue/PR:

Determine the correct format for storing each video modality and implement pseudolabelers/data downloaders/etc. to get the video data stored in parallel directories as usable by 4M and video2dataset. "Definition of done" here means we have the data in the right directories and we can load them in the correct format. ([PARENT ISSUE] Data preprocessing and pseudolabeling #3)
Implement modality_info and modality_transforms to map the new video modalities to their transformations which prepare them from the input filetype to be passable downstream to the model. (WIP PR: Add modality info and transforms for raw RGB, tokens, video descriptions, video transcripts, and video bounding boxes. #1)
Implement the encoder embeddings to be encode frame position (in addition to the patch position and modality embeddings). (Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4)
Implement a masking strategy which masks consistently across temporal frames (i.e., if you mask out patch position 7 for one frame, do it for all frames in that clip). (Implement a masking strategy which masks consistently across temporal frames. #5)
TODO? @garjania what other steps are required here? anything for decoder embeddings?

garjania · 2024-06-24T13:49:19Z

Considering the RGB frames, before adding anything to modality_info or modality_transform, we need to tokenize them. So I suggest to also include the RGB tokenization step for the video datasets somewhere along the first steps.

kdu4108 · 2024-07-03T08:30:22Z

(why?) -- We need to tokenize RGB (and all other vision-like modalities) because they can be inputted as tokens to the model. (in fact, RGB is the only one which allows for pixel-patches which would not require tokenization)

kdu4108 changed the title ~~Implement the temporal changes in 4M to account for video~~ [PARENT ISSUE] Implement the temporal changes in 4M to account for video Jun 21, 2024

kdu4108 assigned kdu4108 and garjania Jul 4, 2024

kdu4108 mentioned this issue Jul 10, 2024

Transform from v2d format into video_rgb format and save in video_rgb/ directory #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

kdu4108 commented Jun 21, 2024 •

edited

Loading

garjania commented Jun 24, 2024

kdu4108 commented Jul 3, 2024

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

Comments

kdu4108 commented Jun 21, 2024 • edited Loading

garjania commented Jun 24, 2024

kdu4108 commented Jul 3, 2024

kdu4108 commented Jun 21, 2024 •

edited

Loading