preprocessed mimic-cxr annotation seems to be different with others #7

zh-Wang-Med · 2024-02-19T07:35:32Z

your preprocessed mimic-cxr annotation contains impression and finding. However, other's setting usually only use finding section of a report, such as R2Gen model. And I run your model on annotations only containging finding, the result is lower than other's. Can you explain? I'm not sure if I made a mistake somewhere

wang-zhanyu · 2024-02-19T13:11:52Z

Thanks for your interest. We referred to the official preprocessing code provided for parsing the reports, which contain both impressions and findings. Considering that both sections are crucial for a complete report, we retained them entirely. Regarding the experimental phenomenon you observed, could you please provide the experimental results and specify "others"?

zh-Wang-Med · 2024-02-19T13:39:33Z

Thanks for your response! for example, R2Gen is a famous baseline, I notice you directly adopted the results given in their article.
However, their just predicted finding section ranther than impression and finding. Would it be unfair for you to directly compare like this? And I try your model R2GenGPT on report with only finding section, the result is lower than yours based on finding + impression.

zh-Wang-Med · 2024-02-25T03:18:35Z

This is the result of using your code to only predict the finding section, the CIDEr is a little lower
Test result of /home/xxx/R2GenGPT-main/save/mimic_cxr/v1_deep_finding/checkpoints/checkpoint_epoch4_step112827_bleu0.132891_cider0.201495.pth: {'Bleu_1': 0.40156787140109074, 'Bleu_2': 0.24777515566553326, 'Bleu_3': 0.16587956792259043, 'Bleu_4': 0.11786933490873241, 'ROUGE_L': 0.27735867532555053, 'METEOR': 0.15643617274462107, 'CIDEr': 0.1991703269894305}
And this is the result of using your code to predict the finding and impression section just like you, the result is consistent with the results of your paper
Test result of /home/xxx/R2GenGPT/save/mimic_cxr/v1_test2_deep/checkpoints/checkpoint_epoch8_step135396_bleu0.190921_cider0.370094.pth: {'Bleu_1': 0.4128636987578053, 'Bleu_2': 0.2695699159478738, 'Bleu_3': 0.18779073357219533, 'Bleu_4': 0.13655461790879653, 'ROUGE_L': 0.2974765397369367, 'METEOR': 0.1622455704235942, 'CIDEr': 0.26411428019543615}

wang-zhanyu · 2024-02-26T02:07:27Z

Thank you for pointing out this issue. As we have mentioned, our work processes the official dataset, which includes both the impression and findings sections. The impression is a crucial part of a diagnostic report, and we should not omit it just because an earlier work did not use it.

Additionally, we would like to clarify that in the results reported in our paper, any method that was not fairly reproduced is marked with a dagger symbol. This is because many works do not have open-source code, in my experience, even different data preprocessing can result in variations in outcomes, making comparisons with these methods difficult to ensure absolute fairness. You can refer to methods without the dagger symbol, these are the ones we have replicated ourselves under the same experimental setup and can be compared more fairly.

Still, it's interesting to note your point. We trained our model using only findings and achieved the following results: {'Bleu_1': 0.404, 'Bleu_2': 0.252, 'Bleu_3': 0.169, 'Bleu_4': 0.121, 'ROUGE_L': 0.277, 'METEOR': 0.155, 'CIDEr': 0.209}. If you prefer to use only findings, you are welcome to use this result.

zh-Wang-Med · 2024-02-26T02:37:50Z

Thank you for your attention and efforts of retraining model on the finding section. I am willing to use your result. I agree with you about "different data preprocessing can result in variations in outcomes"
In addition, I have another question about your result of CE. I use https://github.com/stanfordmlgroup/chexpert-labeler to label test refs and test result, and use https://github.com/zhjohnchan/R2Gen/blob/main/compute_ce.py to calculate the result. The result of predicting finding and impression is 'F1_MACRO': 0.23245815170872425, 'F1_MICRO': 0.409047345217558, 'PRECISION_MACRO': 0.33660613529579375, 'PRECISION_MICRO': 0.4855885922330097, 'RECALL_MACRO': 0.22065983503584846, 'RECALL_MICRO': 0.35335025941053094}
The result seems to be wrong. May I ask how you implemented the calculation at that time? I would greatly appreciate your help.
Thank you !

zh-Wang-Med · 2024-02-28T03:45:42Z

hi, can you give your results of CE about your model only predicting the finding section? I think this result is also different with your results in your paper, which predict both impression and finding. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessed mimic-cxr annotation seems to be different with others #7

preprocessed mimic-cxr annotation seems to be different with others #7

zh-Wang-Med commented Feb 19, 2024

wang-zhanyu commented Feb 19, 2024

zh-Wang-Med commented Feb 19, 2024

zh-Wang-Med commented Feb 25, 2024

wang-zhanyu commented Feb 26, 2024

zh-Wang-Med commented Feb 26, 2024

zh-Wang-Med commented Feb 28, 2024

preprocessed mimic-cxr annotation seems to be different with others #7

preprocessed mimic-cxr annotation seems to be different with others #7

Comments

zh-Wang-Med commented Feb 19, 2024

wang-zhanyu commented Feb 19, 2024

zh-Wang-Med commented Feb 19, 2024

zh-Wang-Med commented Feb 25, 2024

wang-zhanyu commented Feb 26, 2024

zh-Wang-Med commented Feb 26, 2024

zh-Wang-Med commented Feb 28, 2024