You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm a new learner of groundingdino. When I dive into the codes of the Phase A Fusion of groundingdino, I realize that the code firstly do the iamge-text cross-attention, later the text self attention, then the image self attention. It seems a little bit different from the dsicriptions of the original structure of the groundingdino model in the paper, which firstly do the image deformable attention and text self attention then the cross attention from text to image and from image to text. Is that True or whether I overlook some details of the code?
The text was updated successfully, but these errors were encountered:
Hi! I'm a new learner of groundingdino. When I dive into the codes of the Phase A Fusion of groundingdino, I realize that the code firstly do the iamge-text cross-attention, later the text self attention, then the image self attention. It seems a little bit different from the dsicriptions of the original structure of the groundingdino model in the paper, which firstly do the image deformable attention and text self attention then the cross attention from text to image and from image to text. Is that True or whether I overlook some details of the code?
![bothtext imageattention_code](https://private-user-images.githubusercontent.com/147381573/340566330-64329dee-f648-43c9-9cbe-2aae35edbebf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk3NDg5MTksIm5iZiI6MTcxOTc0ODYxOSwicGF0aCI6Ii8xNDczODE1NzMvMzQwNTY2MzMwLTY0MzI5ZGVlLWY2NDgtNDNjOS05Y2JlLTJhYWUzNWVkYmViZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjMwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYzMFQxMTU2NTlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kYzk0MTcyNTZmNjUzOTM5Mjg5YThmNjk5YjdmZWU1MTE3ZDdmN2I2ZGRhMDM5MzRlZTIzOGYwZWE3MWZlZGE1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.60KUhz7gGe9ImA0qCCz8iN6Wb7Q-vT10pQProVFfJZM)
![grounding_dino_layers_fusion_code](https://private-user-images.githubusercontent.com/147381573/340566345-2a632651-c830-4bd2-8b10-50e21cab30f8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk3NDg5MTksIm5iZiI6MTcxOTc0ODYxOSwicGF0aCI6Ii8xNDczODE1NzMvMzQwNTY2MzQ1LTJhNjMyNjUxLWM4MzAtNGJkMi04YjEwLTUwZTIxY2FiMzBmOC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjMwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYzMFQxMTU2NTlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01YjQwMzcwYjhhOTdmNjYwMDlhMWVlZTc3MjYxYTAzODhiNDNjZmZhNDUxMmYzMWY2NjI4MmU4YTIwOGE2MmEzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.2rjsThG2_sHQX7VMP9S_IxkRhQpfLwd5ECyb8LB35Q8)
![original model structure](https://private-user-images.githubusercontent.com/147381573/340566353-23f4c061-8120-4550-a5e4-9e137268ff12.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk3NDg5MTksIm5iZiI6MTcxOTc0ODYxOSwicGF0aCI6Ii8xNDczODE1NzMvMzQwNTY2MzUzLTIzZjRjMDYxLTgxMjAtNDU1MC1hNWU0LTllMTM3MjY4ZmYxMi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjMwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYzMFQxMTU2NTlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04ZDdjNzkwYzIzMThmYmZhZjYwZjdkMDc1Yjg2MDFiMWM5YWU5MTY3YmYxNTQ5M2FjMGI4MjQ3NzBjZmUxZGI1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.pUwZM-jvLU8YA35E9SIrM3ADVBWd4UUYP-gctIu3HwE)
The text was updated successfully, but these errors were encountered: