GitHub - JorisCod/auto-datasets: Automatically maintain a list of datasets for remote sensing

Remote sensing: auto-updating dataset information

Goal: automatically update a list of datasets used for machine learning in remote sensing.

This will be done leveraging a combination of webscraping and an LLM to process the information into a structured format. Basically, this will create an AWESOME list that is automatically maintained.

Note that is not possible with GPT-4 out of the box, the link reader/search pluging and the advanced data analysis (allowing to load an excel file) cannot be combined.

Starting point: the list of datasets composed by (Schmitt et al., 2023).

Tasks

Convert pdf with list of datasets to a more readable format
Get Langchain's example web scraping working
Fill extra columns of the list: application, license, reference & link to Github/Zenodo
Add new datasets to the list
Write out new list

Task 1. was done by using Adobe's pdf to Excel tool. However, it might be more interesting to further convert it to a (sql) database format. This will be done in task 5. if necessary.

Possible prompt: "In this excel, can you fill the last 3 columns for the first 10 items? These are datasets for remote sensing, ie for training and validating deep learning methods. You will need to perform search to fill the last 3 columns, using the remote sensing as search criteria. The application column is the application domain, such as flooding, urban or land cover. The license column is the license of the data, often on github or zenodo. The last column is a link to the dataset or dataset paper; just paste any of the links you used to extract the data for previous two columns here."

With some extension, this application would be able to read a tender through https://python.langchain.com/docs/use_cases/question_answering/ and then generate questions on it, answering those with the web scraping. The references found could be checked against an existing reference database (either sql or graph). Test: Should be able to find GEO-bench through search (eg remote sensing foundation model benchmark) based on ITT documents.

Set-up

Installation:

conda env create -f environment.yml
playwright install
playwright install-deps

Additionally, it is necessary to get the following API keys:

OpenAI API key
Google API key: https://developers.google.com/webmaster-tools/search-console-api/v1/configure

How to run

python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
environment.yml		environment.yml
main.py		main.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Remote sensing: auto-updating dataset information

Tasks

Set-up

How to run

About

Releases

Packages

Languages

License

JorisCod/auto-datasets

Folders and files

Latest commit

History

Repository files navigation

Remote sensing: auto-updating dataset information

Tasks

Set-up

How to run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages