Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

论文中Tem-Con和Fram-Acc用的是生成的两个关键帧计算的嘛?Fram-Acc具体是怎么计算的呀? #142

Open
lankeren035 opened this issue Nov 17, 2024 · 4 comments

Comments

@lankeren035
Copy link

No description provided.

@williamyang1991
Copy link
Owner

williamyang1991 commented Nov 18, 2024

We follow GEN1

  • Tem-Con We compute CLIP image embeddings on all frames of output videos and report the average cosine similarity between all pairs of consecutive frames.
  • 输出视频的每连续两帧,计算两帧CLIP image embedding之间的cos相似度,所有连续两帧的相似度取均值

We follow FateZero

  • Frame-Acc is the frame-wise editing accuracy, which is the percentage of frames where the edited image has a higher CLIP similarity to the target prompt (e.g., a beautiful woman in CG style) than the source prompt (e.g., a beautiful woman)
  • 计算输出帧与输入帧的描述的CLIP相似度,计算输出帧与目标描述的CLIP相似度,总计整个视频后者高于前者的占比

@lankeren035
Copy link
Author

为什么要用输出视频的所有帧呢,按照文章里K=10,那么计算的指标很大一部分是由Ebsynth模型贡献的啊?

@williamyang1991
Copy link
Owner

williamyang1991 commented Nov 18, 2024

我们实验结果里定量结果,为了公平都是只比较关键帧,没用ebsynth
我们K=5取关键帧,然后所有指标都是在关键帧上计算的

image

@lankeren035
Copy link
Author

sorry,是我的疏忽,感谢您的耐心回答。此外,我对shape fusion和pixel fusion之间的关系有点疑问。shape fusion与pixel fusion之间是互补的嘛?这块对于ablation我有点不理解,原文的ablation应该是一步步加上各个模块的,我觉得为什么不直接全部用pixel fusion呢?shape fusion的必要性我不太理解。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants