Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the encoder embeddings to encode which frame it is (a temporal embedding, in addition to the patch position and modality embeddings). #4

Open
kdu4108 opened this issue Jun 21, 2024 · 1 comment

Comments

@kdu4108
Copy link
Collaborator

kdu4108 commented Jun 21, 2024

According to https://docs.google.com/presentation/d/1AY3QV1N_hoi9aXI1r8QTqrNmDK9LyorgJDQMPWb8hBo/edit#slide=id.g2e696416940_0_144, we have to add the temporal/frame encoding to IMAGE-based modality embeddings (but not sequence based ones).

A good starting point: check out this

x_emb = repeat(self.pos_emb + self.mod_emb, '() n d -> b n d', b=B)
and kinda do the same but with an extra temporal embedding?

Things to consider: make sure the embedding for temporal frame doesn't interfere with the positional patch embedding somehow?

Definition of Done: all image based encoder embeddings are augmented with a temporal embedding.

@vesteinn @garjania

@kdu4108
Copy link
Collaborator Author

kdu4108 commented Jul 3, 2024

Suggestion for aligning temp embeddings across modalities: when you make the embedding sum for one modality, e.g.,
Frame 0 (RGB): x + pos_emb + temp_emb + mod_emb, and for another one, e.g., <frame_0_…>: x + pos_emb + mod_emb + temp_emb
make sure the temp_emb is the same for those two different modalities if the position is the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant