Skip to content

Automatically maintain a list of datasets for remote sensing

License

Notifications You must be signed in to change notification settings

JorisCod/auto-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Remote sensing: auto-updating dataset information

Goal: automatically update a list of datasets used for machine learning in remote sensing.

This will be done leveraging a combination of webscraping and an LLM to process the information into a structured format. Basically, this will create an AWESOME list that is automatically maintained.

Note that is not possible with GPT-4 out of the box, the link reader/search pluging and the advanced data analysis (allowing to load an excel file) cannot be combined.

Starting point: the list of datasets composed by (Schmitt et al., 2023).

Tasks

  1. Convert pdf with list of datasets to a more readable format
  2. Get Langchain's example web scraping working
  3. Fill extra columns of the list: application, license, reference & link to Github/Zenodo
  4. Add new datasets to the list
  5. Write out new list

Task 1. was done by using Adobe's pdf to Excel tool. However, it might be more interesting to further convert it to a (sql) database format. This will be done in task 5. if necessary.

Possible prompt: "In this excel, can you fill the last 3 columns for the first 10 items? These are datasets for remote sensing, ie for training and validating deep learning methods. You will need to perform search to fill the last 3 columns, using the remote sensing as search criteria. The application column is the application domain, such as flooding, urban or land cover. The license column is the license of the data, often on github or zenodo. The last column is a link to the dataset or dataset paper; just paste any of the links you used to extract the data for previous two columns here."

With some extension, this application would be able to read a tender through https://python.langchain.com/docs/use_cases/question_answering/ and then generate questions on it, answering those with the web scraping. The references found could be checked against an existing reference database (either sql or graph). Test: Should be able to find GEO-bench through search (eg remote sensing foundation model benchmark) based on ITT documents.

Set-up

Installation:

conda env create -f environment.yml
playwright install
playwright install-deps

Additionally, it is necessary to get the following API keys:

  1. OpenAI API key
  2. Google API key: https://developers.google.com/webmaster-tools/search-console-api/v1/configure

How to run

python main.py

About

Automatically maintain a list of datasets for remote sensing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages