Streamlined Data Ingestion With Pandas Chapter4
Streamlined Data Ingestion With Pandas Chapter4
S T R E A M L I N E D D ATA I N G E S T I O N W I T H PA N D A S
Amany Mahfouz
Instructor
Javascript Object Notation (JSON)
Common web data format
Not tabular
Records don't have to all have the same set of attributes
{
"columns": [
"age_adjusted_death_rate",
"death_rate",
"deaths",
"leading_cause",
"race_ethnicity",
"sex",
"year"
],
"index": [...],
"data": [
[
"7.6",
[5 rows x 7 columns]
Amany Mahfouz
Instructor
Application Programming Interfaces
Defines how a application communicates with other programs
Way to get data from an application without knowing database details
Keyword arguments
params keyword: takes a dictionary of parameters and values to customize API request
headers keyword: takes a dictionary, can be used to provide user authentication to API
api_url = "https://api.yelp.com/v3/businesses/search"
# Set up parameter dictionary according to documentation
params = {"term": "bookstore",
"location": "San Francisco"}
# Set up header dictionary w/ API key according to documentation
headers = {"Authorization": "Bearer {}".format(api_key)}
[2 rows x 16 columns]
Amany Mahfouz
Instructor
Nested JSONs
JSONs contain objects with attribute-value pairs
A JSON is nested when the value itself is an object
categories \
0 [{'alias': 'bookstores', 'title': 'Bookstores'}]
1 [{'alias': 'bookstores', 'title': 'Bookstores'...
2 [{'alias': 'bookstores', 'title': 'Bookstores'}]
coordinates \
0 {'latitude': 37.7975997924805, 'longitude': -1...
1 {'latitude': 37.7885846793652, 'longitude': -1...
2 {'latitude': 37.7589836120605, 'longitude': -1...
location
0 {'address1': '261 Columbus Ave', 'address2': '...
1 {'address1': '50 2nd St', 'address2': '', 'add...
2 {'address1': '866 Valencia St', 'address2': ''...
json_normalize()
Takes a dictionary/list of dictionaries (like pd.DataFrame() does)
['alias',
'categories',
'coordinates_latitude',
'coordinates_longitude',
...
'location_address1',
'location_address2',
'location_address3',
'location_city',
'location_country',
'location_display_address',
'location_state',
'location_zip_code',
...
'url']
biz_coordinates_longitude
0 -122.406578
1 -122.400631
2 -122.400631
3 -122.421638
Amany Mahfouz
Instructor
Concatenating
Use case: adding rows from one dataframe to another
concat()
pandas function
Syntax: pd.concat([df1,df2])
first_20_bookstores = json_normalize(first_results["businesses"],
sep="_")
print(first_20_bookstores.shape)
(20, 24)
next_20_bookstores = json_normalize(next_results["businesses"],
sep="_")
print(next_20_bookstores.shape)
(20, 24)
df.merge() arguments
Second dataframe to merge
Columns to merge on
on if names are the same in both dataframes
created_date call_counts
0 01/01/2018 4597
1 01/02/2018 4362
2 01/03/2018 3045
3 01/04/2018 3374
4 01/05/2018 4333
weather.head()
Default merge() behavior: return only values that are in both datasets
Amany Mahfouz
Instructor
Recap
Chapters 1 and 2
read_csv() and read_excel()