This repository contains a FastAPI application that connects to a noSQL Mongo Database and process some metrics for the incidents that happen to Chicago City.
All the data used for the development can be found here. Also I have uploaded some of these data inside this repository, which can be found here.
This section contains the installation instructions in order to set up a local development environment. The instructions have been validated for Ubuntu 20.04.
First, install all required software:
MongoDB detailed instructions here
wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/4.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list
sudo apt-get update
sudo apt-get install -y mongodb-org
Python 3.7+
sudo apt update
sudo apt install git python3 python3-pip python3-dev
The project dependencies are managed with pipenv. You can install it with:
pip install --user pipenv
pipenv
should now be in your PATH
. If not, logout and log in again. Then install all dependencies with:
pipenv install --dev
Then you can enable the python environment with:
pipenv shell
All commands from this point forward require the python environment to be enabled.
The project uses environment variables in order to keep private data like user names and passwords out of source
control. You can either set them at system level, or by creating a file named .env
at the root of the repository.
The required environment variables for development are:
MONGO_USER
: The database userMONGO_PASSWORD
: The database user passwordMONGO_HOST
: The database host. For local development uselocalhost
MONGO_DB
: The database name.MONGO_PORT
: The database port.
In order to run the project on your workstation, you must create a database named according to the value of the
MONGO_DB
environment variable, at the host that is specified by the MONGO_HOST
environment variable. Initially you must a create the database and a user:
mongo
use chicago_incidents
db.createUser({user: "<USER_NAME>", pwd: "<USER_PASSWORD>", roles: [{role: "readWrite", db: "chicago_incidents"}]})
After you create the database, you can populate it with the initial data with:
python data_import.py [csv_files]
where csv_files
is the path of the csv files (one or more) to import to the database
Example
python data_import.py ./assist_material/datasets/csv/311-service-requests-abandoned-vehicles.csv ./assist_material/datasets/csv/311-service-requests-street-lights-all-out.csv ./assist_material/datasets/csv/311-service-requests-street-lights-one-out.csv ./assist_material/datasets/csv/311-service-requests-alley-lights-out.csv ./assist_material/datasets/csv/311-service-requests-garbage-carts.csv ./assist_material/datasets/csv/311-service-requests-pot-holes-reported.csv ./assist_material/datasets/csv/311-service-requests-graffiti-removal.csv ./assist_material/datasets/csv/311-service-requests-rodent-baiting.csv ./assist_material/datasets/csv/311-service-requests-sanitation-code-complaints.csv ./assist_material/datasets/csv/311-service-requests-tree-debris.csv ./assist_material/datasets/csv/311-service-requests-tree-trims.csv
The importing takes approximately 2000 seconds (tested on i7-2600, 16gb RAM).
Now you can run the web server with:
uvicorn app.main:app
The API is available at http://127.0.0.1:8000/api/v1/
The documentation Swagger page of the API is available at http://127.0.0.1:8000/docs or with ReDoc http://127.0.0.1:8000/redoc
I recommend this way of installation in order to keep your machine clean from packages that you may not use ever again.
Initially, install Docker Engine (click the link to see instructions) & Docker Compose in order to build the project.
Set up the .env
at the root of the repository!
MONGO_USER
: The database userMONGO_PASSWORD
: The database user passwordMONGO_HOST
:mongodb
The host name must bemongodb
MONGO_PORT
: 27018 You must use this portMONGO_DB
: The database name.
Then just execute the following:
docker-compose up --build
Then you have the database & the API up and running!
In order to perform the import of the data you can log in to the running docker container and perform the process manually.
docker exec -it api bash
Now you have access to all the files of the API. You can now run the import command as mentioned above in the Local Development section.
The database is exposed at mongodb://localhost:27018/
The API & the documentation pages page are available to the same addresses that referred above.
params
start_date: YYYY-MM-DDTHH:MM:SS
end_date: YYYY-MM-DDTHH:MM:SS
eg.
curl http://0.0.0.0:8000/api/v1/total-requests-per-type?start_date=2010-10-10T00%3A00%3A00&end_date=2011-10-10T00%3A00%3A00
[
{
"_id": "GRAFFITI",
"count": 124191
},
{
"_id": "POTHOLE",
"count": 72652
},
{
"_id": "STREET_ALL_LIGHTS",
"count": 64028
},
{
"_id": "GARBAGE_CART",
"count": 42675
},
{
"_id": "ALLEY_LIGHTS",
"count": 38843
},
{
"_id": "TREE_TRIMS",
"count": 33809
},
{
"_id": "RODENT_BAITING",
"count": 28808
},
{
"_id": "STREET_ONE_LIGHT",
"count": 27851
},
{
"_id": "ABANDONED_VEHICLE",
"count": 19519
},
{
"_id": "TREE_DEBRIS",
"count": 18591
},
{
"_id": "SANITATION_VIOLATION",
"count": 15382
}
]
response time 1.18 s
params
start_date: YYYY-MM-DDTHH:MM:SS
end_date: YYYY-MM-DDTHH:MM:SS
request_type: TYPE_OF_REQUEST
eg.
curl http://0.0.0.0:8000/api/v1/total-requests-per-day?start_date=2010-10-10T00%3A00%3A00&end_date=2017-10-10T00%3A00%3A00&request_type=ABANDONED_VEHICLE
[
{
"_id": "2017-10-10T00:00:00",
"count": 111
},
{
"_id": "2017-10-09T00:00:00",
"count": 84
},
{
"_id": "2017-10-08T00:00:00",
"count": 40
},
{
"_id": "2017-10-07T00:00:00",
"count": 49
},
{
"_id": "2017-10-06T00:00:00",
"count": 76
},
{
"_id": "2017-10-05T00:00:00",
"count": 114
},
{
"_id": "2017-10-04T00:00:00",
"count": 91
},
{
"_id": "2017-10-03T00:00:00",
"count": 113
},
{
"_id": "2017-10-02T00:00:00",
"count": 102
},
...
]
response time 608 ms
params
date: YYYY-MM-DDTHH:MM:SS
eg.
curl http://0.0.0.0:8000/api/v1/three-most-common-requests-per-zipcode?date=2012-10-10T00%3A00%3A00
[
{
"_id": 60630.0,
"top_three": [
{
"type_of_service_request": "RODENT_BAITING",
"count": 4
},
{
"type_of_service_request": "ABANDONED_VEHICLE",
"count": 4
},
{
"type_of_service_request": "GARBAGE_CART",
"count": 3
}
]
},
{
"_id": 60637.0,
"top_three": [
{
"type_of_service_request": "STREET_ALL_LIGHTS",
"count": 11
},
{
"type_of_service_request": "TREE_DEBRIS",
"count": 5
},
{
"type_of_service_request": "TREE_TRIMS",
"count": 5
}
]
},
{
"_id": 60607.0,
"top_three": [
{
"type_of_service_request": "TREE_TRIMS",
"count": 1
},
{
"type_of_service_request": "STREET_ALL_LIGHTS",
"count": 1
},
{
"type_of_service_request": "GARBAGE_CART",
"count": 1
}
]
},
{
"_id": 60631.0,
"top_three": [
{
"type_of_service_request": "TREE_TRIMS",
"count": 8
},
{
"type_of_service_request": "TREE_DEBRIS",
"count": 5
},
{
"type_of_service_request": "STREET_ALL_LIGHTS",
"count": 4
}
]
},
{
"_id": 60620.0,
"top_three": [
{
"type_of_service_request": "GARBAGE_CART",
"count": 16
},
{
"type_of_service_request": "STREET_ONE_LIGHT",
"count": 14
},
{
"type_of_service_request": "TREE_TRIMS",
"count": 10
}
]
},
...
]
response time 49.7 ms
params
request_type: TYPE_OF_REQUEST
eg.
curl http://0.0.0.0:8000/api/v1/three-least-common-wards?request_type=SANITATION_VIOLATION
[
{
"_id": 0.0,
"count": 3
},
{
"_id": 48.0,
"count": 1111
},
{
"_id": 4.0,
"count": 1428
}
]
response time 512ms
params
start_date: YYYY-MM-DDTHH:MM:SS
end_date: YYYY-MM-DDTHH:MM:SS
eg.
curl http://0.0.0.0:8000/api/v1/average-completion-time-per-request?start_date=2010-10-10T00%3A00%3A00&end_date=2011-10-10T00%3A00%3A00
[
{
"_id": "ABANDONED_VEHICLE",
"average_completion_time": "20 days, 11:50:22.347456"
},
{
"_id": "ALLEY_LIGHTS",
"average_completion_time": "53 days, 20:06:26.664264"
},
{
"_id": "GARBAGE_CART",
"average_completion_time": "69 days, 9:13:57.933216"
},
{
"_id": "GRAFFITI",
"average_completion_time": "6 days, 3:23:14.970650"
},
{
"_id": "POTHOLE",
"average_completion_time": "12 days, 12:11:49.970820"
},
{
"_id": "RODENT_BAITING",
"average_completion_time": "21 days, 0:44:56.251041"
},
{
"_id": "SANITATION_VIOLATION",
"average_completion_time": "3 days, 22:46:13.839553"
},
{
"_id": "STREET_ALL_LIGHTS",
"average_completion_time": "5 days, 20:55:07.852814"
},
{
"_id": "STREET_ONE_LIGHT",
"average_completion_time": "34 days, 16:05:46.414850"
},
{
"_id": "TREE_DEBRIS",
"average_completion_time": "6 days, 6:11:19.683718"
},
{
"_id": "TREE_TRIMS",
"average_completion_time": "367 days, 7:19:20.318258"
}
]
response time 1.72 s
params
date: YYYY-MM-DDTHH:MM:SS
point_a_longitude: double
point_a_latitude: double
point_b_longitude: double
point_b_latitude: double
eg.
curl http://0.0.0.0:8000/api/v1/most-common-service-in-bounding-box?date=2010-10-10T00%3A00%3A00&point_a_longitude=-86.64615132728282&point_a_latitude=40.93702589972641&point_b_longitude=-88.64615132728282&point_b_latitude=43.93702589972641
[
{
"_id": "TREE_TRIMS",
"count": 22
}
]
response time 32.2 ms
params
date: YYYY-MM-DDTHH:MM:SS
eg.
curl http://0.0.0.0:8000/api/v1/top-fifty-upvoted-requests?date=2010-12-10T00%3A00%3A00
{
"_id": "6001ebe85c8559dd9d6b2111",
"total_votes": 30
},
{
"_id": "6001ed755c8559dd9d8d2508",
"total_votes": 28
},
{
"_id": "6001ec505c8559dd9d749758",
"total_votes": 25
},
{
"_id": "6001ed755c8559dd9d8d250f",
"total_votes": 25
},
{
"_id": "6001eba85c8559dd9d63abc8",
"total_votes": 24
},
{
"_id": "6001eca55c8559dd9d7d27ae",
"total_votes": 24
},
{
"_id": "6001ebe85c8559dd9d6b2108",
"total_votes": 24
},
{
"_id": "6001ed755c8559dd9d8d2504",
"total_votes": 23
},
{
"_id": "6001ebe85c8559dd9d6b2121",
"total_votes": 23
},
...
]
response time 29.2 ms
eg.
curl http://0.0.0.0:8000/api/v1/top-fifty-active-citizens
[
{
"_id": "5ffdf8750d58d021a3b4f4f7",
"total_votes": 1000
},
{
"_id": "5ffdf8750d58d021a3b45b71",
"total_votes": 1000
},
{
"_id": "5ffdf8750d58d021a3b4c97b",
"total_votes": 1000
},
{
"_id": "5ffdf8750d58d021a3b443b3",
"total_votes": 1000
},
{
"_id": "5ffdf8750d58d021a3b4413c",
"total_votes": 1000
},
{
"_id": "5ffdf8750d58d021a3b4f63b",
"total_votes": 1000
},
...
]
response time 132 ms
eg.
curl http://0.0.0.0:8000/api/v1/top-fifty-wards-citizens
[
{
"_id": "5ffdf8750d58d021a3b4330d",
"total_wards": 50
},
{
"_id": "5ffdf8750d58d021a3b432fd",
"total_wards": 50
},
{
"_id": "5ffdf8750d58d021a3b432eb",
"total_wards": 50
},
{
"_id": "5ffdf8750d58d021a3b43309",
"total_wards": 50
},
{
"_id": "5ffdf8750d58d021a3b43301",
"total_wards": 50
},
{
"_id": "5ffdf8750d58d021a3b432ea",
"total_wards": 50
},
...
]
response time 128 ms
eg.
[
{
"_id": "(961)407-5332x47507",
"incident_ids": [
"6001eb7a5c8559dd9d5c04b4",
"6001eb7a5c8559dd9d5c2012",
"6001eb7a5c8559dd9d5c3a57",
"6001eb7a5c8559dd9d5c57c2",
"6001eb7a5c8559dd9d5c66f2",
"6001eb7a5c8559dd9d5c9e2f",
"6001eb7a5c8559dd9d5cbf3d",
"6001eb7a5c8559dd9d5cf6d7",
"6001eb7a5c8559dd9d5d22b4",
"6001eb7a5c8559dd9d5d38f6",
"6001eb7a5c8559dd9d5d3e46",
...
]
},
{
"_id": "(+30)2105999999",
"incident_ids": [
"6001eb7b5c8559dd9d5e32ee",
"6001eb7b5c8559dd9d5e3529",
"6001eb7b5c8559dd9d5e3b7d",
"6001eb7b5c8559dd9d5e5e97",
"6001eb7b5c8559dd9d5e6cb3",
"6001eb7b5c8559dd9d5e9421",
"6001eb7b5c8559dd9d5eb5fb"
]
}
]
response time 596 ms
params
name: str
eg.
curl http://0.0.0.0:8000/api/v1/citizen-wards?name=Alexander%20Rose
[
{
"_id": "Alexander Rose",
"wards": [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
...
]
}
]
response time 32.2
body
{
"creation_date": "YYYY-MM-DDTHH:MM:SS",
"completion_date": "YYYY-MM-DDTHH:MM:SS",
"status": "string",
"service_request_number": "string",
"type_of_service_request": "string",
"current_activity": "string",
"most_recent_action": "string",
"street_address": "string",
"zip_code": 0,
"zip_codes": 0,
"x_coordinate": 0,
"y_coordinate": 0,
"ward": 0,
"wards": 0,
"historical_wards_03_15": 0,
"police_district": 0,
"community_area": 0,
"community_areas": 0,
"ssa": 0,
"census_tracts": 0,
"license_plate": "string",
"vehicle_make_model": "string",
"vehicle_color": "string",
"days_of_report_as_parked": 0,
"number_of_elements": 0,
"surface": "string",
"location": "string",
"number_of_premises_baited": 0,
"number_of_premises_w_garbage": 0,
"number_of_premises_w_rats": 0,
"nature_of_code_violation": "string"
}
params
incident_id: str in ObjectId format
body
{
"name": "string",
"street_address": "string",
"telephone_number": "string",
"incident": "string in ObjectId format>"
}
params
citizen_id: str in ObjectId format
The below json are just a representation of the documents that are stored to the database.
Incidents Collection
{
"_id":{"$oid":"id"},
"creation_date":{"$date":"YYYY-MM-DDTHH:MM:SS.000Z"},
"status":"string",
"completion_date":{"$date":"YYYY-MM-DDTHH:MM:SS.000Z"},
"service_request_number":"string",
"type_of_service_request":"string", // Normalized with ABANDONDED_VEHICLE, GARBAGE_CARTS, LIGHTS_ALL_OUT etc.
"street_address":"string",
"zip_code":"int",
"zip_codes":"int",
"ward":"int",
"wards":"int",
"historical_wards_03_15":"int",
"police_district":"int",
"community_area":"int",
"community_areas":"int",
"ssa":"int",
"census_tracts":"int",
"geo_location":{"type":"Point","coordinates":["longitude/double","latitude/double"]},
"x_coordinate":"double",
"y_coordinate":"double",
// List of ObjectIDs of citizens that voted this incident
"voted_by":[{"$oid":"id"},{"$oid":"id"}],
"total_votes":"int",
// Optional
"current_activity":"string",
"most_recent_action":"string",
// ABANDONED_VEHICLE
"license_plate":"string",
"vehicle_make_model":"string",
"vehicle_color":"string",
"days_of_report_as_parked":"int",
// POTHOLE or GARBAGE_CART
"number_of_elements": "int",
// GRAFFITI
"surface":"string",
//TREE & GRAFFITI
"location":"string",
//RODENT_BAITING
"number_of_premises_baited":"int",
"number_of_premises_w_garbage":"int",
"number_of_premises_w_rats":"int",
// SANITATION_VIOLATION
"nature_of_code_violation":"string"
}
Citizens Collection
{
"_id":{"$oid":"id"},
"name":"string",
"street_address":"string",
"telephone_number":"string",
// List of ObjectIDs of incidents that this user voted
"voted_incidents":[{"$oid":"id"},{"$oid":"id"}],
// The total number of votes of this user
"total_votes":"int",
// List-Set that contains all the wards of the incidents that this user voted
"wards":["int"],
// The total number of wards
"total_wards":"int"
}
Indexes
Incidents Collection
db['incidents'].create_index([('type_of_service_request', pymongo.ASCENDING)])
db['incidents'].create_index([('creation_date', pymongo.ASCENDING)])
db['incidents'].create_index([('type_of_service_request', pymongo.ASCENDING), ('creation_date', pymongo.ASCENDING)])
db['incidents'].create_index([('geo_location', pymongo.GEOSPHERE)])
db['incidents'].create_index([('total_votes', pymongo.ASCENDING)])
Citizens Collection
db['citizens'].create_index([('total_votes', pymongo.ASCENDING)])
db['citizens'].create_index([('total_wards', pymongo.ASCENDING)])
db['citizens'].create_index([('telephone_number', pymongo.ASCENDING)])
db['citizens'].create_index([('name', pymongo.ASCENDING)])
As you can see, the database consists of two main collections, the Incidents collection and the Citizens collection. The Incidents collection stores all the data of the csv files, only the type_of_request is normalized to specific strings because there were found incidents of the same type labeled with different values. The Citizens collection stores data about the people who upvote incidents. Also this table has denormalized data from the incidents collection, such as the wards of the voted incidents, in order to prevent unnesessary joins with the incidents collection. A good example is Query 9, using denormalization the query executes at ~24ms which is much faster in contrast to the equivalent join(lookup) query. This practice though comes with the price of a much more complicated insertion procedure at the Citizens table, but this is not a big deal because this app is intented to have low traffic at insertions and high traffic at quering. Also the Citizens table is populated initially with fake data and distribute random votes to 2M incidents keeping in mind that each citizen should have 1000 votes at the most.
-
Find the total requests per type that were created within a specified time range and sort them in a descending order.
cursor = db['incidents'].aggregate([ { '$match': { 'creation_date': {'$gte': start_date, '$lte': end_date} } }, { '$project': { '_id': 1, 'type_of_service_request': 1 } }, { '$group': { '_id': '$type_of_service_request', 'count': {'$sum': 1} } }, { '$sort': { 'count': -1 } } ])
-
Find the total requests per day for a specific request type and time range.
cursor = db['incidents'].aggregate([ { '$match': { '$and': [ {'type_of_service_request': request_type}, {'creation_date': {'$gte': start_date, '$lte': end_date}} ] } }, { '$project': { '_id': 1, 'creation_date': 1 } }, { '$group': { '_id': '$creation_date', 'count': {'$sum': 1} } }, { '$sort': { '_id': -1 } } ])
-
Find the three most common service requests per zipcode for a specific day.
cursor = db['incidents'].aggregate([ { '$match': {'creation_date': date} }, { '$project': { 'type_of_service_request': 1, 'zip_code': 1 } }, { # Create counts per type and zipcode '$group': { '_id': { 'type_of_service_request': '$type_of_service_request', 'zip_code': '$zip_code' }, 'count': {'$sum': 1} } }, { '$sort': { '_id.zip_code': 1, 'count': -1 } }, { # Group types & counts per zipcode inside an array '$group': { '_id': '$_id.zip_code', 'counts': { '$push': { 'type_of_service_request': '$_id.type_of_service_request', 'count': '$count' } } } }, { # Select first 3 elements of each array '$project': { 'top_three': {'$slice': ['$counts', 3]}} } ])
-
Find the three least common wards with regards to a given service request type.
cursor = db['incidents'].aggregate([ { '$match': { '$and': [ {'type_of_service_request': request_type}, {'ward': {'$exists': 'true'}} # Exclude records with no ward ] } }, { '$project': { 'ward': 1 } }, { '$group': { '_id': '$ward', 'count': {'$sum': 1} } }, { '$sort': { 'count': 1 } }, { '$limit': 3 } ])
-
Find the average completion time per service request for a specific date range.
cursor = db['incidents'].aggregate([ { '$match': { '$and': [ {'creation_date': {'$gte': start_date, '$lte': end_date}}, {'creation_date': {'$exists': 'true'}}, {'completion_date': {'$exists': 'true'}} ] } }, { '$project': { 'creation_date': 1, 'completion_date': 1, 'type_of_service_request': 1 } }, { '$group': { '_id': '$type_of_service_request', 'average_completion_time': { '$avg': { '$subtract': [ '$completion_date', '$creation_date' ] } } } }, { '$sort': { '_id': 1 } } ])
-
Find the most common service request in a specified bounding box for a specific day. You are encouraged to use GeoJSON objects and Geospatial Query Operators.
cursor = db['incidents'].aggregate([ { '$match': { '$and': [ {'creation_date': date}, { 'geo_location.coordinates': { '$geoWithin': { '$box': [ [point_a_longitude, point_a_latitude], [point_b_longitude, point_b_latitude] ] } } } ] } }, { '$project': { 'type_of_service_request': 1 } }, { '$group': { '_id': '$type_of_service_request', 'count': {'$sum': 1} } }, { '$sort': { 'count': -1 } }, { '$limit': 1 } ])
-
Find the fifty most upvoted service requests for a specific day.
cursor = db['incidents'].aggregate([ { '$match': { 'creation_date': date } }, { '$project': { '_id': 1, 'total_votes': 1 } }, { '$sort': { 'total_votes': -1 } }, { '$limit': 50 } ])
-
Find the fifty most active citizens, with regard to the total number of upvotes.
cursor = db['citizens'].aggregate([ { '$project': { '_id': 1, 'total_votes': 1 } }, { '$sort': { 'total_votes': -1 } }, { '$limit': 50 } ])
-
Find the top fifty citizens, with regard to the total number of wards for which they have upvoted an incidents.
cursor = db['citizens'].aggregate([ { '$project': { '_id': 1, 'total_wards': 1 } }, { '$sort': { 'total_wards': -1 } }, { '$limit': 50 } ])
-
Find all incident ids for which the same telephone number has been used for more than one names.
cursor = db['citizens'].aggregate([ { '$group': { '_id': '$telephone_number', 'citizen_ids': {'$push': '$_id'}, 'count': {'$sum': 1} } }, { # If a phone number appears more than once '$match': { '_id': {'$ne': 'null'}, 'count': {'$gt': 1} } }, { # "Self Join' to citizens collection with the IDs we retrieved above '$lookup': { 'from': 'citizens', 'localField': 'citizen_ids', 'foreignField': '_id', 'as': 'same_phone_citizens' } }, { # Expand joined array to documents '$unwind': '$same_phone_citizens' }, { # Group by _id (phone number) and append incident IDs to array '$group': { '_id': '$_id', 'incidents': {'$push': '$same_phone_citizens.voted_incidents'}, } }, { # Deduplicate incident IDs using reduce '$project': { 'incident_ids': { '$reduce': { 'input': '$incidents', "initialValue": [], "in": {"$setUnion": ["$$this", "$$value"]} } } } } ])
-
Find all the wards in which a given name has casted a vote for an incident taking place in it.
cursor = db['citizens'].aggregate([ { '$match': { 'name': name } }, { '$project': { '_id': '$name', 'wards': 1 } } ])