arXiv, 2024
Yue Zhou
Β·
Litong Feng
Β·
Yiping Ke
Β·
Xue Jiang
Β·
Junchi Yan
Β·
Xue Yang
Β·
Wayne Zhang
This repo is used for recording, tracking, and benchmarking several recent vision-language geo-foundation models (VLGFM) to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.
-
You are welcome to give us an issue or PR for your VLGFM work !!!!!
-
Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.
- We update GitHub to record the available paper by the end of 2024/6/13.
-
The first survey for vision-language geo-foundation models, including contrastive/conversational/generative geo-foundation models.
-
It also contains several related works, including exploration and application of some downstream tasks.
-
We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.
This survey presents the first detailed survey on remote sensing vision language foundation models, including Contrastive/Conversational/Generative VLGFMs.
- π Introduction
- π Summary of Contents
- π Methods: A Survey
- πΉοΈ Application
- π Exploration
- π¨βπ« Survey
- ποΈ Citation
- π² Contact
Keywords
clip
: Use CLIPllm
: Use LLM (Large Language Model)sam
: Use SAM (Segment Anything Model)i-t
: Annotate using image-text tuplesv-t
: Annotate using video-text tuplesi-t-b
: Annotate using image-text-box tripletsi-t-m
: Annotate using image-text-mask triplets
image-caption-mask triplets
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip |
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing | Code |
2024 | ICLR | clip |
GRAFT: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment | Project |
2024 | AAAI | clip |
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing | Code |
2024 | arXiv | clip |
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment | N/A |
2024 | TGRS | clip |
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | Code |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2024 | ICLR | clip |
DiffusionSat: A Generative Foundation Model for Satellite Imagery | Code |
2024 | arXiv | clip |
CRS-Diff: Controllable Generative Remote Sensing Foundation Model | N/A |
Year | Venue | Keywords | Name | Code/Project | Download |
---|---|---|---|---|---|
2016 | CITS | i-t |
Sydney-Captions & UCM-Captions | [N/A] | link,link2 |
2017 | TGRS | i-t |
RSICD | Project | Link |
2020 | TGRS | i-t |
RSVQA-LR & RSVQA-HR | Project | link1,link2 |
2021 | IGARSS | i-t |
RSVQAxBEN | Project | link |
2021 | Access | i-t |
FloodNet | Project | link |
2021 | TGRS | i-t |
RSITMD | Code | link |
2021 | TGRS | i-t |
RSIVQA | Code | link |
2022 | TGRS | i-t |
NWPU-Captions | Project | link |
2022 | TGRS | i-t |
CRSVQA | Project | link |
2022 | TGRS | i-t |
LEVIR-CC | Project | link |
2022 | TGRS | i-t |
CDVQA | Project | link |
2022 | TGRS | i-t |
UAV-Captions | N/A | N/A |
2022 | MM | i-t-b |
RSVG | Project | link |
2022 | RS | v-t |
CapERA | Project | link |
2023 | TGRS | i-t-b |
DIOR-RSVG | Project | link |
2023 | arXiv | i-t |
RemoteCount | Code | N/A |
2023 | arXiv | i-t |
RS5M | Code | link |
2023 | arXiv | i-t |
RSICap & RSIEval | Code | N/A |
2023 | arXiv | i-t |
LAION-EO | N/A | link |
2023 | ICCVW | i-t |
SATIN | Project | link |
2024 | ICLR | i-t |
NAIP-OSM | Project | N/A |
2024 | AAAI | i-t |
SkyScript | Code | link |
2024 | AAAI | i-t-m |
EarthVQA | Project | N/A |
2024 | TGRS | i-t-m |
RRSIS | Code | link |
2024 | CVPR | i-t |
GeoChat-Instruct & GeoChat-Bench | Code | link |
2024 | CVPR | i-t-m |
RRSIS-D | Code | link |
2024 | arXiv | i-t |
SkyEye-968k | Code | N/A |
2024 | arXiv | i-t |
MMRS-1M | Project | N/A |
2024 | arXiv | i-t |
LHRS-Align & LHRS-Instruct | Code | N/A |
2024 | arXiv | i-t-m |
ChatEarthNet | project | link |
2024 | arXiv | i-t |
VLEO-Bench | Code | link |
2024 | arXiv | i-t |
LuoJiaHOG | N/A | N/A |
2024 | arXiv | i-t-m |
FineGrip | N/A | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | TGRS | llm |
A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning | Code |
2023 | JSEE | llm |
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | VT | llm |
CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study | N/A |
2024 | arXiv | llm |
Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models | Code |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | sam |
Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection | Code |
2024 | JPRS | clip |
ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning | Code |
2024 | TGRS | llm |
A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection | Code |
2024 | arXiv | sam |
Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM) | N/A |
2024 | arXiv | sam |
Segment Any Change | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | IJAEOG | clip |
RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision | Code |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | sam clip |
Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models | Code |
2024 | TGRS | RRSIS: Referring Remote Sensing Image Segmentation | Code | |
2024 | CVPR | Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation | Code | |
2024 | WACV | CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | CVPRW | Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering | N/A | |
2024 | AAAI | EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Project |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | ICML | clip |
CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations | Code |
2023 | NeurIPS | clip |
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization | Code |
2023 | arXiv | clip |
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery | Code |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip |
Stable Diffusion For Aerial Object Detection | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip |
Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing | Code |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | TGRS | An Empirical Study of Remote Sensing Pretraining | Code | |
2023 | arXiv | Autonomous GIS: the next-generation AI-powered GIS | N/A | |
2023 | arXiv | GPT4GEO: How a Language Model Sees the World's Geography | Code | |
2023 | arXiv | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | Code | |
2023 | arXiv | The Potential of Visual ChatGPT For Remote Sensing | N/A |
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | IGARSS | An Agenda for Multimodal Foundation Models for Earth Observation | N/A | |
2023 | TGRS | Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works | N/A | |
2023 | GISWU | Large Remote Sensing Model: Progress and Prospects | N/A | |
2023 | JSTARS | Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey | N/A | |
2023 | arXiv | On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications | N/A | |
2024 | GRSM | Vision-Language Models in Remote Sensing: Current Progress and Future Trends | N/A |
If you find our survey and repository useful for your research project, please consider citing our paper:
@article{zhou2024vlgfm,
title={Towards Vision-Language Geo-Foundation Models: A Survey},
author={Yue Zhou and Litong Feng and Yiping Ke and Xue Jiang and Junchi Yan and Xue Yang and Wayne Zhang},
journal={arXiv preprint arXiv:2406.09385},
year={2024}
}