Skip to content

A Survey on Vision-Language Geo-Foundation Models (VLGFMs)

Notifications You must be signed in to change notification settings

lzd19981105/Awesome-VLGFM

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome PR's Welcome

Towards Vision-Language Geo-Foundation Models: A Survey

arXiv, 2024
Yue Zhou Β· Litong Feng Β· Yiping Ke Β· Xue Jiang Β· Junchi Yan Β· Xue Yang Β· Wayne Zhang

arXiv PDF


This repo is used for recording, tracking, and benchmarking several recent vision-language geo-foundation models (VLGFM) to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

πŸ™Œ Add Your Paper in our Repo and Survey!!!!!

  • You are welcome to give us an issue or PR for your VLGFM work !!!!!

  • Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

πŸ₯³ New

  • We update GitHub to record the available paper by the end of 2024/6/13.

✨ Highlight!!

  • The first survey for vision-language geo-foundation models, including contrastive/conversational/generative geo-foundation models.

  • It also contains several related works, including exploration and application of some downstream tasks.

  • We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

πŸ“– Introduction

This survey presents the first detailed survey on remote sensing vision language foundation models, including Contrastive/Conversational/Generative VLGFMs.

Alt Text

πŸ“— Summary of Contents

πŸ“š Methods: A Survey

Keywords

  • clip: Use CLIP
  • llm: Use LLM (Large Language Model)
  • sam: Use SAM (Segment Anything Model)
  • i-t: Annotate using image-text tuples
  • v-t: Annotate using video-text tuples
  • i-t-b: Annotate using image-text-box triplets
  • i-t-m: Annotate using image-text-mask triplets

image-caption-mask triplets

Contrastive VLGFMs

Year Venue Keywords Paper Title Code/Project
2023 arXiv clip RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Code
2024 ICLR clip GRAFT: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment Project
2024 AAAI clip SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing Code
2024 arXiv clip Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment N/A
2024 TGRS clip RemoteCLIP: A Vision Language Foundation Model for Remote Sensing Code

Conversational VLGFMs

Year Venue Keywords Paper Title Code/Project
2023 arXiv llm Rsgpt: A remote sensing vision language model and benchmark Code
2024 CVPR llm GeoChat: Grounded Large Vision-Language Model for Remote Sensing Code
2024 arXiv llm SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Code
2024 arXiv llm Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain N/A
2024 arXiv llm LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model Code
2024 arXiv llm Large Language Models for Captioning and Retrieving Remote Sensing Images N/A
2024 arXiv llm H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model N/A
2024 RS llm RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery Code
2024 arXiv llm SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding N/A

Generative VLGFMs

Year Venue Keywords Paper Title Code/Project
2024 ICLR clip DiffusionSat: A Generative Foundation Model for Satellite Imagery Code
2024 arXiv clip CRS-Diff: Controllable Generative Remote Sensing Foundation Model N/A

Datasets & Benchmark

Year Venue Keywords Name Code/Project Download
2016 CITS i-t Sydney-Captions & UCM-Captions [N/A] link,link2
2017 TGRS i-t RSICD Project Link
2020 TGRS i-t RSVQA-LR & RSVQA-HR Project link1,link2
2021 IGARSS i-t RSVQAxBEN Project link
2021 Access i-t FloodNet Project link
2021 TGRS i-t RSITMD Code link
2021 TGRS i-t RSIVQA Code link
2022 TGRS i-t NWPU-Captions Project link
2022 TGRS i-t CRSVQA Project link
2022 TGRS i-t LEVIR-CC Project link
2022 TGRS i-t CDVQA Project link
2022 TGRS i-t UAV-Captions N/A N/A
2022 MM i-t-b RSVG Project link
2022 RS v-t CapERA Project link
2023 TGRS i-t-b DIOR-RSVG Project link
2023 arXiv i-t RemoteCount Code N/A
2023 arXiv i-t RS5M Code link
2023 arXiv i-t RSICap & RSIEval Code N/A
2023 arXiv i-t LAION-EO N/A link
2023 ICCVW i-t SATIN Project link
2024 ICLR i-t NAIP-OSM Project N/A
2024 AAAI i-t SkyScript Code link
2024 AAAI i-t-m EarthVQA Project N/A
2024 TGRS i-t-m RRSIS Code link
2024 CVPR i-t GeoChat-Instruct & GeoChat-Bench Code link
2024 CVPR i-t-m RRSIS-D Code link
2024 arXiv i-t SkyEye-968k Code N/A
2024 arXiv i-t MMRS-1M Project N/A
2024 arXiv i-t LHRS-Align & LHRS-Instruct Code N/A
2024 arXiv i-t-m ChatEarthNet project link
2024 arXiv i-t VLEO-Bench Code link
2024 arXiv i-t LuoJiaHOG N/A N/A
2024 arXiv i-t-m FineGrip N/A N/A

πŸ•ΉοΈ Application

Captioning

Year Venue Keywords Paper Title Code/Project
2023 TGRS llm A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning Code
2023 JSEE llm VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning N/A

Retrieval

Year Venue Keywords Paper Title Code/Project
2022 VT llm CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study N/A
2024 arXiv llm Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models Code

Change Detection

Year Venue Keywords Paper Title Code/Project
2023 arXiv sam Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection Code
2024 JPRS clip ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning Code
2024 TGRS llm A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection Code
2024 arXiv sam Change Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM) N/A
2024 arXiv sam Segment Any Change N/A

Scene Classification

Year Venue Keywords Paper Title Code/Project
2023 IJAEOG clip RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision Code

Segmentation

Year Venue Keywords Paper Title Code/Project
2023 arXiv sam clip Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models Code
2024 TGRS RRSIS: Referring Remote Sensing Image Segmentation Code
2024 CVPR Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation Code
2024 WACV CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting N/A

Visual Question Answering

Year Venue Keywords Paper Title Code/Project
2022 CVPRW Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering N/A
2024 AAAI EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering Project

Geospatial Localization

Year Venue Keywords Paper Title Code/Project
2023 ICML clip CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations Code
2023 NeurIPS clip GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization Code
2023 arXiv clip SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery Code

Object Detection

Year Venue Keywords Paper Title Code/Project
2023 arXiv clip Stable Diffusion For Aerial Object Detection N/A

Super-Resolution

Year Venue Keywords Paper Title Code/Project
2023 arXiv clip Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing Code

πŸ“Š Exploration

Year Venue Keywords Paper Title Code/Project
2022 TGRS An Empirical Study of Remote Sensing Pretraining Code
2023 arXiv Autonomous GIS: the next-generation AI-powered GIS N/A
2023 arXiv GPT4GEO: How a Language Model Sees the World's Geography Code
2023 arXiv Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs Code
2023 arXiv The Potential of Visual ChatGPT For Remote Sensing N/A

πŸ‘¨β€πŸ« Survey

Year Venue Keywords Paper Title Code/Project
2023 IGARSS An Agenda for Multimodal Foundation Models for Earth Observation N/A
2023 TGRS Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works N/A
2023 GISWU Large Remote Sensing Model: Progress and Prospects N/A
2023 JSTARS Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey N/A
2023 arXiv On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications N/A
2024 GRSM Vision-Language Models in Remote Sensing: Current Progress and Future Trends N/A

πŸ–ŠοΈ Citation

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{zhou2024vlgfm,
  title={Towards Vision-Language Geo-Foundation Models: A Survey},
  author={Yue Zhou and Litong Feng and Yiping Ke and Xue Jiang and Junchi Yan and Xue Yang and Wayne Zhang},
  journal={arXiv preprint arXiv:2406.09385},
  year={2024}
}

🐲 Contact

About

A Survey on Vision-Language Geo-Foundation Models (VLGFMs)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published