Towards Vision-Language Geo-Foundation Models: A Survey

arXiv, 2024
Yue Zhou · Litong Feng · Yiping Ke · Xue Jiang · Junchi Yan · Xue Yang · Wayne Zhang

This repo is used for recording, tracking, and benchmarking several recent vision-language geo-foundation models (VLGFM) to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🙌 Add Your Paper in our Repo and Survey!!!!!

You are welcome to give us an issue or PR for your VLGFM work !!!!!
Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

🥳 New

We update GitHub to record the available paper by the end of 2024/6/13.

✨ Highlight!!

The first survey for vision-language geo-foundation models, including contrastive/conversational/generative geo-foundation models.
It also contains several related works, including exploration and application of some downstream tasks.
We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

📖 Introduction

This survey presents the first detailed survey on remote sensing vision language foundation models, including Contrastive/Conversational/Generative VLGFMs.

📚 Methods: A Survey

Keywords

clip: Use CLIP
llm: Use LLM (Large Language Model)
sam: Use SAM (Segment Anything Model)
i-t: Annotate using image-text tuples
v-t: Annotate using video-text tuples
i-t-b: Annotate using image-text-box triplets
i-t-m: Annotate using image-text-mask triplets

image-caption-mask triplets

Contrastive VLGFMs

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`clip`	RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing	Code
2024	ICLR	`clip`	GRAFT: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment	Project
2024	AAAI	`clip`	SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing	Code
2024	arXiv	`clip`	Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment	N/A
2024	TGRS	`clip`	RemoteCLIP: A Vision Language Foundation Model for Remote Sensing	Code

Conversational VLGFMs

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`llm`	Rsgpt: A remote sensing vision language model and benchmark	Code
2024	CVPR	`llm`	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	Code
2024	arXiv	`llm`	SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Code
2024	arXiv	`llm`	Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain	N/A
2024	arXiv	`llm`	LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model	Code
2024	arXiv	`llm`	Large Language Models for Captioning and Retrieving Remote Sensing Images	N/A
2024	arXiv	`llm`	H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model	N/A
2024	RS	`llm`	RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery	Code
2024	arXiv	`llm`	SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding	N/A

Generative VLGFMs

Year	Venue	Keywords	Paper Title	Code/Project
2024	ICLR	`clip`	DiffusionSat: A Generative Foundation Model for Satellite Imagery	Code
2024	arXiv	`clip`	CRS-Diff: Controllable Generative Remote Sensing Foundation Model	N/A

Datasets & Benchmark

Year	Venue	Keywords	Name	Code/Project	Download
2016	CITS	`i-t`	Sydney-Captions & UCM-Captions	[N/A]	link,link2
2017	TGRS	`i-t`	RSICD	Project	Link
2020	TGRS	`i-t`	RSVQA-LR & RSVQA-HR	Project	link1,link2
2021	IGARSS	`i-t`	RSVQAxBEN	Project	link
2021	Access	`i-t`	FloodNet	Project	link
2021	TGRS	`i-t`	RSITMD	Code	link
2021	TGRS	`i-t`	RSIVQA	Code	link
2022	TGRS	`i-t`	NWPU-Captions	Project	link
2022	TGRS	`i-t`	CRSVQA	Project	link
2022	TGRS	`i-t`	LEVIR-CC	Project	link
2022	TGRS	`i-t`	CDVQA	Project	link
2022	TGRS	`i-t`	UAV-Captions	N/A	N/A
2022	MM	`i-t-b`	RSVG	Project	link
2022	RS	`v-t`	CapERA	Project	link
2023	TGRS	`i-t-b`	DIOR-RSVG	Project	link
2023	arXiv	`i-t`	RemoteCount	Code	N/A
2023	arXiv	`i-t`	RS5M	Code	link
2023	arXiv	`i-t`	RSICap & RSIEval	Code	N/A
2023	arXiv	`i-t`	LAION-EO	N/A	link
2023	ICCVW	`i-t`	SATIN	Project	link
2024	ICLR	`i-t`	NAIP-OSM	Project	N/A
2024	AAAI	`i-t`	SkyScript	Code	link
2024	AAAI	`i-t-m`	EarthVQA	Project	N/A
2024	TGRS	`i-t-m`	RRSIS	Code	link
2024	CVPR	`i-t`	GeoChat-Instruct & GeoChat-Bench	Code	link
2024	CVPR	`i-t-m`	RRSIS-D	Code	link
2024	arXiv	`i-t`	SkyEye-968k	Code	N/A
2024	arXiv	`i-t`	MMRS-1M	Project	N/A
2024	arXiv	`i-t`	LHRS-Align & LHRS-Instruct	Code	N/A
2024	arXiv	`i-t-m`	ChatEarthNet	project	link
2024	arXiv	`i-t`	VLEO-Bench	Code	link
2024	arXiv	`i-t`	LuoJiaHOG	N/A	N/A
2024	arXiv	`i-t-m`	FineGrip	N/A	N/A

🕹️ Application

Captioning

Year	Venue	Keywords	Paper Title	Code/Project
2023	TGRS	`llm`	A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning	Code
2023	JSEE	`llm`	VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning	N/A

Retrieval

Year	Venue	Keywords	Paper Title	Code/Project
2022	VT	`llm`	CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study	N/A
2024	arXiv	`llm`	Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models	Code

Change Detection

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`sam`	Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection	Code
2024	JPRS	`clip`	ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning	Code
2024	TGRS	`llm`	A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection	Code
2024	arXiv	`sam`	Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM)	N/A
2024	arXiv	`sam`	Segment Any Change	N/A

Scene Classification

Year	Venue	Keywords	Paper Title	Code/Project
2023	IJAEOG	`clip`	RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision	Code

Segmentation

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`sam` `clip`	Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models	Code
2024	TGRS		RRSIS: Referring Remote Sensing Image Segmentation	Code
2024	CVPR		Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation	Code
2024	WACV		CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting	N/A

Visual Question Answering

Year	Venue	Keywords	Paper Title	Code/Project
2022	CVPRW		Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering	N/A
2024	AAAI		EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering	Project

Geospatial Localization

Year	Venue	Keywords	Paper Title	Code/Project
2023	ICML	`clip`	CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations	Code
2023	NeurIPS	`clip`	GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization	Code
2023	arXiv	`clip`	SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery	Code

Object Detection

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`clip`	Stable Diffusion For Aerial Object Detection	N/A

Super-Resolution

Year	Venue	Keywords	Paper Title	Code/Project
2023	arXiv	`clip`	Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing	Code

📊 Exploration

Year	Venue	Paper Title	Code/Project
2022	TGRS	An Empirical Study of Remote Sensing Pretraining	Code
2023	arXiv	Autonomous GIS: the next-generation AI-powered GIS	N/A
2023	arXiv	GPT4GEO: How a Language Model Sees the World's Geography	Code
2023	arXiv	Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	Code
2023	arXiv	The Potential of Visual ChatGPT For Remote Sensing	N/A

👨‍🏫 Survey

Year	Venue	Paper Title	Code/Project
2023	IGARSS	An Agenda for Multimodal Foundation Models for Earth Observation	N/A
2023	TGRS	Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works	N/A
2023	GISWU	Large Remote Sensing Model: Progress and Prospects	N/A
2023	JSTARS	Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey	N/A
2023	arXiv	On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications	N/A
2024	GRSM	Vision-Language Models in Remote Sensing: Current Progress and Future Trends	N/A

🖊️ Citation

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{zhou2024vlgfm,
  title={Towards Vision-Language Geo-Foundation Models: A Survey},
  author={Yue Zhou and Litong Feng and Yiping Ke and Xue Jiang and Junchi Yan and Xue Yang and Wayne Zhang},
  journal={arXiv preprint arXiv:2406.09385},
  year={2024}
}

🐲 Contact

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Vision-Language Geo-Foundation Models: A Survey

🙌 Add Your Paper in our Repo and Survey!!!!!

🥳 New

✨ Highlight!!

📖 Introduction

📗 Summary of Contents

📚 Methods: A Survey

Contrastive VLGFMs

Conversational VLGFMs

Generative VLGFMs

Datasets & Benchmark

🕹️ Application

Captioning

Retrieval

Change Detection

Scene Classification

Segmentation

Visual Question Answering

Geospatial Localization

Object Detection

Super-Resolution

📊 Exploration

👨‍🏫 Survey

🖊️ Citation

🐲 Contact

About

Releases

Packages

lzd19981105/Awesome-VLGFM

Folders and files

Latest commit

History

Repository files navigation

Towards Vision-Language Geo-Foundation Models: A Survey

🙌 Add Your Paper in our Repo and Survey!!!!!

🥳 New

✨ Highlight!!

📖 Introduction

📗 Summary of Contents

📚 Methods: A Survey

Contrastive VLGFMs

Conversational VLGFMs

Generative VLGFMs

Datasets & Benchmark

🕹️ Application

Captioning

Retrieval

Change Detection

Scene Classification

Segmentation

Visual Question Answering

Geospatial Localization

Object Detection

Super-Resolution

📊 Exploration

👨‍🏫 Survey

🖊️ Citation

🐲 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages