Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying Details to Reproduce on THUMOS14 #23

Open
HYUNJS opened this issue Aug 27, 2023 · 10 comments
Open

Clarifying Details to Reproduce on THUMOS14 #23

HYUNJS opened this issue Aug 27, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@HYUNJS
Copy link

HYUNJS commented Aug 27, 2023

While I was reproducing the accuracy on the THUMOS14 dataset, some of your implementations were confusing. I would really appreciate your clarification for me to reproduce the results.

Q1.
In the inference time, segments above the threshold are connected to form one large segment as shown in the below figure. Although this is the effective post-processing method for the ActivityNet dataset, this is not true for the THUMOS14 dataset which has many short action instances rather than one/two long action instances.

https://github.com/sauradip/STALE/blob/main/stale_inference.py#L156

filt_seg_score_int = ndimage.binary_fill_holes(filt_seg_score_int).astype(int).tolist()

image

Q2.
In the dataset builder, why do you add 1 and minus 1 for start and end indices, respectively?

https://github.com/sauradip/STALE/blob/main/stale_lib/stale_dataloader.py#L188

        for idx in range(len(start_id)):
          lbl_id = label_id[idx]
          start_indexes.append(start_id[idx]+1)
          end_indexes.append(end_id[idx]-1)
          tuple_list.append([start_id[idx]+1, end_id[idx]-1,lbl_id])
@sauradip
Copy link
Owner

A1. It is true , this approach will lead to over generalization of long action instances. We need to change the inference script and not include the filling operation. This is a drawback as we do not have start/end regressors. Start/End of action = start/end of mask.

A2. Normally all action follows a gaussian curve, i.e action onset/offset ( boundary regions ) have high chance of misclassification whereas the centre of the action segment it has higher chance of classifying. Since our snippet duration is small we try to approximate the GT closer to the centre to avoid misclassification which might hurt particularly in Zero-Shot setting.

@HYUNJS
Copy link
Author

HYUNJS commented Aug 28, 2023

Thank you for your reply! Q1. Then, when you obtained the results of THUMOS14 in a paper, have you used the filling operation?

I really appreciate your clarification for me to reproduce the result on THUMOS14.

Q3. Question regarding class-agnostic representation masking step.
Q3-1. Why do you use the mask computed by 1D-Conv instead of MaskFormer? Is it just your mistake of mixing the code version of TAGS and STALE, otherwise, is it an important trick to achieve the accuracy reported in the paper?
Q3-2. In the current code, you compute threshold (\theta_bin) by averaging the foreground probabilities across the temporal axis. Is it the correct version for your final version?
Q3-3. Why do you use only one action query in MaskFormer? Maybe it is a proper hyperparameter for the ActivityNet dataset, but what value did you use for the THUMOS14 dataset?

(Code about Q3-2) https://github.com/sauradip/STALE/blob/main/stale_model.py#L213

        ### Action Mask Localizer Branch ###
        bottom_br = self.localizer_mask(features)

        #### Representation Mask ####
        snipmask = self.masktrans(vid_feature.unsqueeze(2),features.unsqueeze(3))
        bot_mask = torch.mean(bottom_br, dim=2)
        soft_mask = torch.sigmoid(snipmask["pred_masks"]).view(-1,self.temporal_scale)
        mask_feat = self.crop_features(features,bot_mask)

(Code about Q3-3) https://github.com/sauradip/STALE/blob/main/stale_model.py#L58

        self.masktrans = TransformerPredictor(
            in_channels=512,
            mask_classification=False,
            num_classes=self.num_classes,
            hidden_dim=512,
            num_queries=1, # Why only one action query?
            nheads=2,
            dropout=0.1,
            dim_feedforward=1,
            enc_layers=2,
            dec_layers=2,
            pre_norm=True,
            deep_supervision=False,
            mask_dim=512,
            enforce_input_project=True
        ).cuda()

In the figure of the paper, there are several action queries used for mask decoding.
image

@HYUNJS HYUNJS changed the title Inference technique Clarifying Details to Reproduce on THUMOS14 Aug 28, 2023
@sauradip
Copy link
Owner

A1. I did not use filling for THUMOS.

A3-1. Both can be used, yes for ActivityNet, it comes out more consistent than maskformer ( if i use 1 query ). If i use more query then the maskformer output can be used there. The reason to use 1 query is because of memory constraints in GPU. i was testing this on 1 GPU and if i increase number of queries it is heavy on compute. One important change you need to do if you have > 1 query is to pass through a extra 1-D conv to map from many to one. i.e N x T x D to 1 x T x D. These operations is not done in this code due to memory constraints.

A3-2. We are selecting those probabilites where the probability is > than some mean(temporal probabilities). I observed that for all the videos since the videos are short and longer foreground, using this approach it covers majority of the foreground empirically, however, for thumos where it is long videos and short foreground, this may not work, you may need to put a high threshold to classify the foreground indexes. I used 0.55 for THUMOS.

A3-3. I answered partly in A3-1. You can use multiple queries for THUMOS. We used 30 queries in our testing version for THUMOS.

@sauradip sauradip added the enhancement New feature or request label Aug 28, 2023
@HYUNJS
Copy link
Author

HYUNJS commented Aug 28, 2023

Thank you for your reply!

Q3-1. The result of the MaskFormer code provided by you is a foreground logit score whose shape is B x nq x L [batch_size x num_queries x num_features (=video length)].

  • Do you mean (1) mapping B x nq x L to B x 1 x L via extra 1D conv of nn.Conv1d(in_channels=nq, out_channels=1, kernel_size=3,padding=1) that aggregates the neighboring foreground probabilities of all queries? Then, thresholding (0.55) this output to obtain foreground mask and foreground-masked features B x L x D. This option is based on the explanation in your paper, but I am not sure which configuration (e.g., kernel_size) to use.
  • Or, (2) using threshold (0.55) to obtain B x nq x L mask and foreground-masked features B x nq x L x D. Then, aggregate these features by 1D conv layer? This is the option I understood based on your reply above.
  • (3) Otherwise, please let me know other implementation should be used.

Q4. I also want to know about the details of "a stack of three 1-D dynamic convolution layers H_m" mentioned on page 8, such as kernel size or K values and other params for dynamic convolution. Although the paper states that the details are on the supplementary material, the link in ECCV`22 webpage shows your main paper, not the supplementary material. Sorry for interrupting your time.

Q5. For the label assignment, only one class label can be assigned to the snippet and video. However, some THUMOS14 videos include overlapped action instances of different classes. Is it correct that this work does not consider such a case?

@HYUNJS
Copy link
Author

HYUNJS commented Sep 2, 2023

Based on the available information, I can only achieve 2 mAP on THUMOS14 with the closed setting. The main bottleneck seems training action mask localizer part which is failed to be supervised using the dynamic conv and losses you stated in the paper.

I think predicting the global mask (only the action instance in the current time) is a very hard task with only convolution layers. Also, there are too many background masks in the THUMOS14 which hinders the learning of the model as well. May I ask how did you deal with this problem? And, how can you achieve 44.6 mAP of THUMOS14.

Also, why don't you use MaskFormer foreground mask directly for the output mask which seems working very well? I attached the results of my implemented ones below.

image

image

@sauradip
Copy link
Owner

sauradip commented Sep 2, 2023

Hi,

Your 2D action mask (250x250) looks not that bad, I can see initial two masks at top left , some in the middle and then some in bottom right , that is how the 1D GT action mask looks like. The 2 D mask is not expected to be clean ! It has not 0 probability for background. Hence noise has to be expected. How to clean the noise is a trick you can say. Some tricks like action segment in THuMoS cannot be shorter than 2 snippets or cannot be max than 50 snippets ; check the score of one / two rows from the 2D mask, see what probability it shows for foreground , one thing I remember is mean() does not work for thumos in thresholding because majority background. ; you can use the soft 1D mask to check the result and let me know how good the result is

  1. THUMOS have high chance of overfitting, so transformers should have a high dropout like 0.4 +

  2. Did you use the best score from UNet (stale_best_score.json) ? For the full supervised setting ?

  3. use low threshold during inference to select start and end points. I need to check to tell you the exact hyperparams for inference , but may be play a bit with the inference mask thresholds to suppress the backgrounds. As I said this is one drawback as we don't have start/End regressions.

@HYUNJS
Copy link
Author

HYUNJS commented Sep 3, 2023

Thanks for getting back to me! I'm a bit worried about the big differences between prediction and GT mask map. In each column, there seem to be too many points marked as foreground, even in the background column. Also, the parts you stated (e.g., top-left and middle) have lower foreground scores compared to other points in the columns - leading to inaccurate localization and suppression while SoftNMS. I'm starting to wonder if we can really get the same results as what the research paper claims... I planned to experiment with this model in other TAL datasets but even reproducing the results on THUMOS14 is challenging to achieve.

  1. Overfitting doesn't seem to be the primary issue here. During the training phase, the weighted BCE and dice loss haven't shown significant improvement compared to the other branches. In fact, the dice loss isn't decreasing at all. This suggests that there might be problems with the architecture implementation itself. As you mentioned, the predicted mask map is quite noisy. This could indicate that the model isn't effectively learning from the designed supervision and the dynamic/naive convolution layers.
  2. No. Is your THUMOS14 result based on the UNet of THUMOS14? (+ stale_best_score.json file doesn't have UNet score for THUMOS14).
  3. While I can play with the threshold values (class thresh & mask thresh & softNMS thresh values), my current result of 2 mAP is too far from your reported mAP. I can achieve 12.5 mAP by just using MaskFormer output with the average score as threshold + video level class labels (with searched threshold values).

image

@sauradip
Copy link
Owner

sauradip commented Sep 3, 2023

Hi , I will recommend you to check with the UNet value as a class score refinement just like it is done here for activitynet postprocessing since it is a standard followed for fair comparison. You can find the Unet score for Thumos from GTAD repository and just paste in stale best score json to check.

@HYUNJS HYUNJS mentioned this issue Sep 3, 2023
@Coder-Liuu
Copy link

Coder-Liuu commented Oct 23, 2023

@HYUNJS Hi can we exchange? I can achieve 20mAP with a 50:50 split mAP using a 2-stage approach (I3D features + uNet results), but by replacing the uNet with CLIP, the result drops to 10mAP. I would like to know how to achieve a higher result without uNet

@jordisassoon
Copy link

Hi everyone :)
I'm also struggling to reproduce the THUMOS14 results, would any of you @Coder-Liuu, @HYUNJS or @sauradip mind sharing your code with me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants