Multi-threaded Spatial Join algorithm for spatially indexed data in MongoDB
Clone this repository and install the dependencies.
git clone https://github.com/UrbanSystemsLab/spatial-join-mongodb.git
cd spatial-join-mongodb
npm install
For the following documentation refers to the layers in following manner for ease of understanding.
- Inner Layer :
buildings
- Outer Layer :
lots
- Outer Layer with Buffer :
lots_buffer
- Output Layer:
buildings_spatialJoin
Note: Creating a buffer for the outer layer is optional.
Once the collections are setup in MongoDB run the following command to perform Spatial Join on collections.
node init.js --db '[database-url]' --innerLayer [Inner Layer Collection] --outerLayer [Outer Layer Collection] --outputLayer [Output Collection] --outerLayerAttributes 'Attribute 0' 'Attribute 1' ... 'Attribute n'
node init.js --db 'mongodb://localhost:27017/nyc' --innerLayer buildings --outerLayer lots --outputLayer buildings_spatialJoin --outerLayerAttributes 'borough_code'
Recommended: Create a buffer for the outer layer
If the difference between inner layer polygons and outer layer polygons is quite small then a fixed distance buffer can be created for the outer layer. For example, an outer layer maybe building lots and inner layer might be building footprint. The coordinates may be too close for the spatial join to effectively run, especially if the data sets were obtained from different sources. In this case, 0.00002
worked the best.
EPSG:4326 Geodetic coordinate system has been used for all features throughout the process.
Create empty collections in MongoDB before importing the data
db.buildings.createIndex({"geometry":"2dsphere"})
db.lots.createIndex({"geometry":"2dsphere"})
db.lots_buffer.createIndex({"geometry":"2dsphere"})
First, create import ready JSON files from GeoJSON using the JQ command line utility. Then import the data to respective collections in MongoDB. Bad GeoJSON geometries will get dropped during the import.
jq ".features" --compact-output buildings.geojson > buildings.json
jq ".features" --compact-output lots.geojson > lots.json
jq ".features" --compact-output lots_buffer.geojson > lots_buffer.json # Optional
mongoimport --db nyc -c buildings --file "buildings.json" --jsonArray
mongoimport --db nyc -c lots --file "lots.json" --jsonArray
mongoimport --db nyc -c lots_buffer --file "lots_buffer.json" --jsonArray # Optional
Ensure that the spatial indexes are good.
db.buildings.ensureIndex({"geometry":"2dsphere"})
db.lots.ensureIndex({"geometry":"2dsphere"})
db.lots_buffer.ensureIndex({"geometry":"2dsphere"})
This is a multi-threaded process that will consume all available CPU cores.
node init.js --db 'mongodb://localhost:27017/nyc' --innerLayer buildings --outerLayer lots --outputLayer buildings_spatialJoin --outerLayerAttributes 'borough_code'
--db
: MongoDB database url--innerLayer
: Database collection name containing inner layer features--outerLayer
: Database collection name containing outer layer features--outputLayer
: Empty collection where spatially joined features are to be stored--outerLayerAttributes
: Outer layer feature attributes to be added to contained inner layer features. Separated by space
Note: All three layer collections must be different and have spatial indices.
Run this command after the spatial join process is completed to get your data.
mongoexport --db [Database-name] -c [Collection-name] --out "[export-filename].json" --jsonArray
Benchmark:
The process ran on 856,000 Lots and 1,095,161 Buildings in 17 minutes on Intel Core i7-5700HQ CPU @ 2.60GHz, 4 Cores, 8 Logical Processors.