You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[1] and [2] use features from 2D and 3D CNNs + temporal max-pooling.
[3] uses I3D/S3D features + global mean-pooling [+ linear transformation]
-> I would go for something similar to this, at least for a first attempt, as it is easy to implement. Once it works, we can consider more complex approaches.
In both [1] and [2], a non-linear gating mechanism is applied on the vectors obtained from each modality.
[3] uses a special training loss that compensates for misalignments.
Possible sources of inspiration for the visual part:
[1] https://arxiv.org/abs/2006.09199
[2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html
[3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html
The text was updated successfully, but these errors were encountered: