Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

kdu4108 · 2024-06-21T12:06:19Z

According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).

A good starting point: check out this

ml-4m/fourm/models/encoder_embeddings.py

Line 206 in 4c2c9a5

x_emb = repeat(self.pos_emb + self.mod_emb, '() n d -> b n d', b=B)

and kinda do the same but with an extra temporal embedding?

Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?

Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.

@vesteinn @garjania

kdu4108 · 2024-07-03T14:05:25Z

Suggestion for aligning temp embeddings across modalities: when you make the embedding sum for one modality, e.g.,
Frame 0 (RGB): x + pos_emb + temp_emb + mod_emb, and for another one, e.g., <frame_0_…>: x + pos_emb + mod_emb + temp_emb
make sure the temp_emb is the same for those two different modalities if the position is the same

kdu4108 mentioned this issue Jun 21, 2024

[PARENT ISSUE] Implement the temporal changes in 4M to account for video #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

kdu4108 commented Jun 21, 2024

kdu4108 commented Jul 3, 2024

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

Comments

kdu4108 commented Jun 21, 2024

kdu4108 commented Jul 3, 2024