Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessed mimic-cxr annotation seems to be different with others #7

Open
zh-Wang-Med opened this issue Feb 19, 2024 · 6 comments
Open

Comments

@zh-Wang-Med
Copy link

your preprocessed mimic-cxr annotation contains impression and finding. However, other's setting usually only use finding section of a report, such as R2Gen model. And I run your model on annotations only containging finding, the result is lower than other's. Can you explain? I'm not sure if I made a mistake somewhere

@wang-zhanyu
Copy link
Owner

Thanks for your interest. We referred to the official preprocessing code provided for parsing the reports, which contain both impressions and findings. Considering that both sections are crucial for a complete report, we retained them entirely. Regarding the experimental phenomenon you observed, could you please provide the experimental results and specify "others"?

@zh-Wang-Med
Copy link
Author

Thanks for your response! for example, R2Gen is a famous baseline, I notice you directly adopted the results given in their article.
However, their just predicted finding section ranther than impression and finding. Would it be unfair for you to directly compare like this? And I try your model R2GenGPT on report with only finding section, the result is lower than yours based on finding + impression.

@zh-Wang-Med
Copy link
Author

This is the result of using your code to only predict the finding section, the CIDEr is a little lower
Test result of /home/xxx/R2GenGPT-main/save/mimic_cxr/v1_deep_finding/checkpoints/checkpoint_epoch4_step112827_bleu0.132891_cider0.201495.pth: {'Bleu_1': 0.40156787140109074, 'Bleu_2': 0.24777515566553326, 'Bleu_3': 0.16587956792259043, 'Bleu_4': 0.11786933490873241, 'ROUGE_L': 0.27735867532555053, 'METEOR': 0.15643617274462107, 'CIDEr': 0.1991703269894305}
And this is the result of using your code to predict the finding and impression section just like you, the result is consistent with the results of your paper
Test result of /home/xxx/R2GenGPT/save/mimic_cxr/v1_test2_deep/checkpoints/checkpoint_epoch8_step135396_bleu0.190921_cider0.370094.pth: {'Bleu_1': 0.4128636987578053, 'Bleu_2': 0.2695699159478738, 'Bleu_3': 0.18779073357219533, 'Bleu_4': 0.13655461790879653, 'ROUGE_L': 0.2974765397369367, 'METEOR': 0.1622455704235942, 'CIDEr': 0.26411428019543615}

@wang-zhanyu
Copy link
Owner

Thank you for pointing out this issue. As we have mentioned, our work processes the official dataset, which includes both the impression and findings sections. The impression is a crucial part of a diagnostic report, and we should not omit it just because an earlier work did not use it.

Additionally, we would like to clarify that in the results reported in our paper, any method that was not fairly reproduced is marked with a dagger symbol. This is because many works do not have open-source code, in my experience, even different data preprocessing can result in variations in outcomes, making comparisons with these methods difficult to ensure absolute fairness. You can refer to methods without the dagger symbol, these are the ones we have replicated ourselves under the same experimental setup and can be compared more fairly.

Still, it's interesting to note your point. We trained our model using only findings and achieved the following results: {'Bleu_1': 0.404, 'Bleu_2': 0.252, 'Bleu_3': 0.169, 'Bleu_4': 0.121, 'ROUGE_L': 0.277, 'METEOR': 0.155, 'CIDEr': 0.209}. If you prefer to use only findings, you are welcome to use this result.

@zh-Wang-Med
Copy link
Author

Thank you for your attention and efforts of retraining model on the finding section. I am willing to use your result. I agree with you about "different data preprocessing can result in variations in outcomes"
In addition, I have another question about your result of CE. I use https://github.com/stanfordmlgroup/chexpert-labeler to label test refs and test result, and use https://github.com/zhjohnchan/R2Gen/blob/main/compute_ce.py to calculate the result. The result of predicting finding and impression is 'F1_MACRO': 0.23245815170872425, 'F1_MICRO': 0.409047345217558, 'PRECISION_MACRO': 0.33660613529579375, 'PRECISION_MICRO': 0.4855885922330097, 'RECALL_MACRO': 0.22065983503584846, 'RECALL_MICRO': 0.35335025941053094}
The result seems to be wrong. May I ask how you implemented the calculation at that time? I would greatly appreciate your help.
Thank you !

@zh-Wang-Med
Copy link
Author

hi, can you give your results of CE about your model only predicting the finding section? I think this result is also different with your results in your paper, which predict both impression and finding. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants