[video-based training] Define and implement a basic architecture #77

bhigy · 2021-02-12T16:14:29Z

Possible sources of inspiration for the visual part:
[1] https://arxiv.org/abs/2006.09199
[2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html
[3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html

bhigy · 2021-03-03T10:25:37Z

For the visual part:

[1] and [2] use features from 2D and 3D CNNs + temporal max-pooling.
[3] uses I3D/S3D features + global mean-pooling [+ linear transformation]
-> I would go for something similar to this, at least for a first attempt, as it is easy to implement. Once it works, we can consider more complex approaches.

In both [1] and [2], a non-linear gating mechanism is applied on the vectors obtained from each modality.
[3] uses a special training loss that compensates for misalignments.

bhigy · 2021-07-28T14:19:36Z

You can find the code from [1] here: https://github.com/roudimit/AVLnet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[video-based training] Define and implement a basic architecture #77

[video-based training] Define and implement a basic architecture #77

bhigy commented Feb 12, 2021 •

edited

Loading

bhigy commented Mar 3, 2021

bhigy commented Jul 28, 2021

[video-based training] Define and implement a basic architecture #77

[video-based training] Define and implement a basic architecture #77

Comments

bhigy commented Feb 12, 2021 • edited Loading

bhigy commented Mar 3, 2021

bhigy commented Jul 28, 2021

bhigy commented Feb 12, 2021 •

edited

Loading