Skip to content

Latest commit

 

History

History
776 lines (622 loc) · 16.4 KB

README.md

File metadata and controls

776 lines (622 loc) · 16.4 KB

JSON

Introduction

In this lesson, you'll continue investigating new formats for data. Specifically, you'll investigate one of the most popular data formats for the web: JSON files.

Objectives

You will be able to:

  • Describe features of the JSON format and the Python json module
  • Use Python to load and parse JSON documents

JSON Format

JSON stands for JavaScript Object Notation. Similar to CSV, JSON is a plain text data format. However the structure of JSON — based on the syntax of JavaScript — is more complex.

Here's a brief preview of a JSON file:

As you can see, JSON is not a tabular format with one set of rows and one set of columns. JSON files are often nested in a hierarchical structure and will have data structures analogous to Python dictionaries and lists. Here's all of the built-in supported data types in JSON and their counterparts in Python:

json Module

In theory we could write our own custom code to split that string on {, ", : etc. and parse the contents of the file into the appropriate Python data structures.

Instead, we'll go ahead and use a pre-built Python module designed for this purpose. It will give us a powerful starting point for accessing and manipulating the data in JSON files. This module is called json.

You can find full documentation for this module here.

To use the json module, start by importing it:

import json

json.load

To load data from a JSON file, you first open the file using Python's built-in open function. Then you pass the file object to the json.load function, which returns a Python object representing the contents of the file.

In the cell below, we open the campaign finance JSON file previewed above:

with open('nyc_2001_campaign_finance.json') as f:
    data = json.load(f)
print(type(data))
<class 'dict'>

As you can see, this loaded the data as a dictionary. You can begin to investigate the contents of a JSON file by using our traditional Python methods.

Parsing a JSON File

Since we have a dictionary, check its keys:

data.keys()
dict_keys(['meta', 'data'])

Investigate what data types are stored within the values associated with those keys:

for v in data.values():
    print(type(v))
<class 'dict'>
<class 'list'>

Parsing Metadata

Then we can dig a level deeper. What are the keys of the nested dictionary?

data['meta'].keys()
dict_keys(['view'])

And what is the type of the value associated with that key?

type(data['meta']['view'])
dict

Again, what are the keys of that twice-nested dictionary?

data['meta']['view'].keys()
dict_keys(['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

That is a lot of keys. One way we might try to view all of that information is using the pandas package to make a table.

import pandas as pd
pd.set_option("max_colwidth", 120)
pd.DataFrame(
    data=data['meta']['view'].values(),
    index=data['meta']['view'].keys(),
    columns=["value"]
)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
value
id 8dhd-zvi6
name 2001 Campaign Payments
attribution Campaign Finance Board (CFB)
averageRating 0
category City Government
createdAt 1315950830
description A listing of public funds payments for candidates for City office during the 2001 election cycle
displayType table
downloadCount 1470
hideFromCatalog False
hideFromDataJson False
indexUpdatedAt 1536596254
newBackend False
numberOfComments 0
oid 4140996
provenance official
publicationAppendEnabled False
publicationDate 1371845179
publicationGroup 240370
publicationStage published
rowClass
rowsUpdatedAt 1371845177
rowsUpdatedBy 5fuc-pqz2
tableId 932968
totalTimesRated 0
viewCount 233
viewLastModified 1536605717
viewType tabular
columns [{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data', 'fieldName': ':sid', 'position': 0, 'renderTypeName': 'meta_...
grants [{'inherited': False, 'type': 'viewer', 'flags': ['public']}]
metadata {'rdfSubject': '0', 'rdfClass': '', 'attachments': [{'filename': 'Data_Dictionary_Public_Funds_Payments_FINAL.xlsx',...
owner {'id': '5fuc-pqz2', 'displayName': 'NYC OpenData', 'profileImageUrlLarge': '/api/users/5fuc-pqz2/profile_images/LARG...
query {}
rights [read]
tableAuthor {'id': '5fuc-pqz2', 'displayName': 'NYC OpenData', 'profileImageUrlLarge': '/api/users/5fuc-pqz2/profile_images/LARG...
tags [finance, campaign finance board, cfb, nyccfb, campaign finance, elections, contributions, politics, campaign, funding]
flags [default, restorable, restorePossibleForType]

So, it looks like the information under the meta key is essentially all of the metadata about the dataset, including the category, attribution, tags, etc.

Now let's look at the main data.

Parsing Data

This time, let's look at the value associated with the data key. Recall that we previously identified that this had a list data type, so let's look at the length:

len(data['data'])
285

Now let's look at a couple different values:

data['data'][0]
[1,
 'E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1',
 1,
 1315925633,
 '392904',
 1315925633,
 '392904',
 '{\n  "invalidCells" : {\n    "1519001" : "TOTALPAY",\n    "1518998" : "PRIMARYPAY",\n    "1519000" : "RUNOFFPAY",\n    "1518999" : "GENERALPAY",\n    "1518994" : "OFFICECD",\n    "1518996" : "OFFICEDIST",\n    "1518991" : "ELECTION"\n  }\n}',
 None,
 'CANDID',
 'CANDNAME',
 None,
 'OFFICEBORO',
 None,
 'CANCLASS',
 None,
 None,
 None,
 None]
data['data'][1]
[2,
 '9D257416-581A-4C42-85CC-B6EAD9DED97F',
 2,
 1315925633,
 '392904',
 1315925633,
 '392904',
 '{\n}',
 '2001',
 'B4',
 'Aboulafia, Sandy',
 '5',
 None,
 '44',
 'P',
 '45410.00',
 '0',
 '0',
 '45410.00']
data['data'][2]
[3,
 'B80D7891-93CF-49E8-86E8-182B618E68F2',
 3,
 1315925633,
 '392904',
 1315925633,
 '392904',
 '{\n}',
 '2001',
 '445',
 'Adams, Jackie R',
 '5',
 None,
 '7',
 'P',
 '11073.00',
 '0',
 '0',
 '11073.00']

This looks more like some kind of tabular data, where the first (0-th) row is some kind of header. Again, let's use pandas to make this into a more-readable table format:

pd.DataFrame(data['data'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 1 E3E9CC9F-7443-43F6-94AF-B5A0F802DBA1 1 1315925633 392904 1315925633 392904 {\n "invalidCells" : {\n "1519001" : "TOTALPAY",\n "1518998" : "PRIMARYPAY",\n "1519000" : "RUNOFFPAY",\n ... None CANDID CANDNAME None OFFICEBORO None CANCLASS None None None None
1 2 9D257416-581A-4C42-85CC-B6EAD9DED97F 2 1315925633 392904 1315925633 392904 {\n} 2001 B4 Aboulafia, Sandy 5 None 44 P 45410.00 0 0 45410.00
2 3 B80D7891-93CF-49E8-86E8-182B618E68F2 3 1315925633 392904 1315925633 392904 {\n} 2001 445 Adams, Jackie R 5 None 7 P 11073.00 0 0 11073.00
3 4 BB012003-78F5-406D-8A87-7FF8A425EE3F 4 1315925633 392904 1315925633 392904 {\n} 2001 HF Addabbo, Joseph P 5 None 32 P 75350.00 73970.00 0 149320.00
4 5 945825F9-2F5D-47C2-A16B-75B93E61E1AD 5 1315925633 392904 1315925633 392904 {\n} 2001 IR Alamo-Estrada, Agustin 5 None 14 P 25000.00 2400.00 0 27400.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
280 281 C50E6A4C-BDE9-4F12-97F4-95D467013540 281 1315925633 392904 1315925633 392904 {\n} 2001 537 Wilson, John H 5 None 13 P 0 0 0 0
281 282 04C6D19F-FF63-47B0-B26D-3B8F98B4C16B 282 1315925633 392904 1315925633 392904 {\n} 2001 559 Wooten, Donald T 5 None 42 P 0 0 0 0
282 283 A451E0E9-D382-4A97-AAD8-D7D382055F8D 283 1315925633 392904 1315925633 392904 {\n} 2001 280 Yassky, David 5 None 33 P 75350.00 75350.00 0 150700.00
283 284 E84BCD0C-D6F4-450F-B55B-3199A265C781 284 1315925633 392904 1315925633 392904 {\n} 2001 274 Zapiti, Mike 5 None 22 P 12172.00 0 0 12172.00
284 285 5BBC9676-2119-4FB5-9DAB-DE3F71B7681A 285 1315925633 392904 1315925633 392904 {\n} 2001 442 Zett, Lori M 5 None 24 P 0 0 0 0

285 rows × 19 columns

We still have some work to do to understand what all of the columns are supposed to mean, but now we have a general sense of what the data looks like.

Extracting a Value from a JSON File

Now, let's say that our task is:

Extract the description of the dataset

We know from our initial exploration that this JSON file contains meta and data, and that meta has this kind of high-level information whereas data has the actual records relating to campaign finance.

Let's look at the keys of meta again:

data['meta']['view'].keys()
dict_keys(['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

Ok, description is the 7th one. Let's pull the value associated with the description key:

data['meta']['view']['description']
'A listing of public funds payments for candidates for City office during the 2001 election cycle'

This is the general process you will use when extracting information from a JSON file.

Summary

As you can see, there's a lot going on here with the deeply nested structure of JSON data files.