So2Sat-BuildingType

A large-scale dataset for building type classification using social media and aerial data

Dataset

Balancing

In this section the balancing algorithm discussed in section 4.1 in our paper is shown with greater detail.

The two figures below depict the distribution of the labeled buildings before (left) and after (right) balancing, where the balancing algorithm down-sample the residential and commercial classes to meet the number of buildings in the other class.

Download

The dataset is split into two parts with different licenses

Building data is available at https://mediatum.ub.tum.de/1662350 (ODbL)
Twitter tweet IDs are available at https://mediatum.ub.tum.de/1662351 (CC BY-NC-SA)

Details

All labeled buildings (buildings.csv.bz2) are in part I. It contains information about the 6,950,182 OSM labeled buildings that we are able to identify in the 42 cities. For each building, we share: osm_building_id, class, city, and geometry (polygon or multi-polygons coordinates). The geometry column includes WKT strings which contain comas but enclosed with double quotes. When reading the csv with Python libraries such as pandas, it is possible to specify the quote char to circumvent a wrongly imported file. It is possible due to the adjacency of some urban areas that buildings are assigned to multiple places. Please filter according to your task/area specifications.
Twitter dataset for text classification (tweets.csv.bz2) is in part II. It contains the list of 26,666,198 geo-tagged tweets that are collected in the 42 cities and that are assigned to a labeled building. For each tweet, we share: tweet_id, osm_building_id, building_class, building_city, tweet_lang, distance_to_building (in meter), tweet_creation_time (in UTC), tweet_longitude, and tweet_latitude.
Google aerial images dataset. We do not provide a data file, but we provide the script that we used to download the aerial images from Google in the code repository.

Code

download_building_aerial_images.py yields the corresponding aerial images for each building
undersample.py performs two-dimensional undersampling as described in the paper
split_train_test.py splits the imbalanced and balanced buildings.csv.bz2 into a training and test part

Appendix of the paper

In this section we provide subsequent statistics and baseline results achieved with our proposed dataset.

Additional Dataset Statistics

Twitter

In this subsection we provide additional information about the Twitter modality. The table below shows the word count under consideration of the α value.

$\alpha$	Number of unique words
41	36,058
12	24,151
9	21,519
6	18,006
4	14,647
3	12,741
2	10,429
1	6,978

Number of unique words in the textual corpus for each value of α, where α refers to the maximum number of tweets to consider per building

The next table gives statistics about the distribution of tweets per building.

	min	max	median	mean	variance	sd
Commercial	1	584,296	4	66.99	4,975,237	2230.52
Residential	1	134,995	1	10.57	81,421.64	285.34
Other	1	1,541,532	4	133.49	39,229,660	6263.36
All	1	1,541,532	2	40.69	6,513,908	2552.24

The minimum, maximum, median, mean average, variance, and standard deviation for the number of tweets per building. “All” refers to all buildings of all classes

The following table depicts the main statistics about the number of tweets per building for: 0, 1, 2, 5, 10, 15, 20 and 25% excluding rate. Addionally, the table shows that by excluding more outlier values, we obtain more homogeneous dataset reflected through lower variance and standard deviation values:

Data share (#buildings)	min	max	median	mean	variance	sd
0% (655,425)	1	1,541,532	2	40.69 (41)	6,513,908	2552.24
1% (642,316)	1	426	2	11.95 (12)	1377.23	37.11
2% (629,208)	1	207	2	9.14 (9)	535.29	23.14
5% (589,882)	1	71	2	5.77 (6)	113.76	10.67
10% (524,340)	1	27	2	3.72 (4)	24.45	4.94
15% (458,798)	1	14	2	2.79 (3)	7.93	2.82
20% (393,255)	1	8	2	2.27 (2)	3.13	1.77
25% (327,712)	1	5	2	1.94 (1)	1.40	1.18

The minimum, maximum, median, mean average, variance, and standard deviation for the number of tweets per building. In the last case, we put α = 1 instead of 2 to differentiate it from the previous case.

Citation

Available at TBA

BibTeX:

insert_paper_bibtex_here

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_building_aerial_images.py		download_building_aerial_images.py
environment.yml		environment.yml
literature.bib		literature.bib
split_train_test.py		split_train_test.py
undersample.py		undersample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

So2Sat-BuildingType

Dataset

Balancing

Download

Details

Code

Appendix of the paper

Additional Dataset Statistics

Twitter

Citation

About

Releases

Packages

Contributors 3

Languages

License

zhu-xlab/So2Sat-BuildingType

Folders and files

Latest commit

History

Repository files navigation

So2Sat-BuildingType

Dataset

Balancing

Download

Details

Code

Appendix of the paper

Additional Dataset Statistics

Twitter

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages