Unsatisfactory performance

It seems not good to inference on the autonomous driving scene with `Grounded-SAM-2 Video Object Tracking with Continuous ID (with Grounding DINO)` or reverse tracking. Is using the API the only solution?

The prompt: 'car. suv. bus.'

https://github.com/user-attachments/assets/5426045f-34fe-410a-96be-15a1f0d40c04

As we can see, the model ignores the black suv :(