pandas: Convert a list of dictionaries to DataFrame with json_normalize
You can convert a list of dictionaries with shared keys to pandas.DataFrame
with pandas.json_normalize()
.
This format is commonly used in JSON obtained from Web API, so converting it to pandas.DataFrame
is very useful.
This article describes the following contents.
- Converting with
pandas.DataFrame()
- Basic usage of
pandas.json_normalize()
- Handling complex cases: :
record_path
,meta
- Reading a part of JSON strings or files
Use pandas.read_json()
to directly read JSON strings or files as pandas.DataFrame
instead of an object consisting of a dictionary or list.
Note that pandas.json_normalize()
is available since pandas 1.0.0
. In previous versions, it was provided as pandas.io.json.json_normalize()
.
- What’s new in 1.0.0 (January 29, 2020) - Deprecations — pandas 1.0.0 documentation
- Move json_normalize to pd namespace · Issue #27586 · pandas-dev/pandas
Converting with pandas.DataFrame()
Use the following list of dictionaries as an example.
import pandas as pd
l_simple = [{'name': 'Alice', 'age': 25},
{'name': 'Bob'}]
Such a simple list of dictionaries can be converted with pandas.DataFrame()
.
The dictionary key is treated as the column label, and the element when the key does not exist is treated as the missing value NaN
.
print(pd.DataFrame(l_simple))
# name age
# 0 Alice 25.0
# 1 Bob NaN
pandas.json_normalize()
returns the same result.
print(pd.json_normalize(l_simple))
# name age
# 0 Alice 25.0
# 1 Bob NaN
Basic usage of pandas.json_normalize()
Use the following list of nested dictionaries as an example.
l_nested = [{'name': 'Alice', 'age': 25, 'id': {'x': 2, 'y': 8}},
{'name': 'Bob', 'id': {'x': 10, 'y': 4}}]
pandas.DataFrame()
converts the nested dictionaries directly as elements.
print(pd.DataFrame(l_nested))
# name age id
# 0 Alice 25.0 {'x': 2, 'y': 8}
# 1 Bob NaN {'x': 10, 'y': 4}
pandas.json_normalize()
converts the nested dictionaries into separate columns for each key.
print(pd.json_normalize(l_nested))
# name age id.x id.y
# 0 Alice 25.0 2 8
# 1 Bob NaN 10 4
By default, the nested parts have column names in the format <parent key>.<child key>
. This separator .
can be changed with the sep
argument.
print(pd.json_normalize(l_nested, sep='_'))
# name age id_x id_y
# 0 Alice 25.0 2 8
# 1 Bob NaN 10 4
Handling complex cases: : record_path
, meta
Consider the following example, where dictionary values are lists of dictionaries.
l_complex = [{'label': 'X',
'info' : {'n': 'nx', 'm': 'mx'},
'data': [{'a': 1, 'b': 2},
{'a': 3, 'b': 4}]},
{'label': 'Y',
'info' : {'n': 'ny', 'm': 'my'},
'data': [{'a': 10, 'b': 20},
{'a': 30, 'b': 40}]}]
By default, the list of dictionaries is treated as an element.
print(pd.json_normalize(l_complex))
# label data info.n info.m
# 0 X [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}] nx mx
# 1 Y [{'a': 10, 'b': 20}, {'a': 30, 'b': 40}] ny my
By specifying a key with the record_path
argument, only the corresponding values are converted. You can add a prefix to the column names with the record_prefix
argument.
print(pd.json_normalize(l_complex, record_path='data'))
# a b
# 0 1 2
# 1 3 4
# 2 10 20
# 3 30 40
print(pd.json_normalize(l_complex, record_path='data', record_prefix='data_'))
# data_a data_b
# 0 1 2
# 1 3 4
# 2 10 20
# 3 30 40
If you want to convert other keys' values, specify them with the meta
argument. You can add a prefix to the column names with the meta_prefix
argument.
print(pd.json_normalize(l_complex, record_path='data',
meta='label'))
# a b label
# 0 1 2 X
# 1 3 4 X
# 2 10 20 Y
# 3 30 40 Y
print(pd.json_normalize(l_complex, record_path='data',
meta='label', meta_prefix='meta_'))
# a b meta_label
# 0 1 2 X
# 1 3 4 X
# 2 10 20 Y
# 3 30 40 Y
If the value of the key specified in meta
is a dictionary, you can specify the child key using the list [[<parent key>, <child key>], ...]
. By default, the column names are <parent key>.<child key>
, but you can change the separator using the sep
argument.
print(pd.json_normalize(l_complex, record_path='data',
meta='info'))
# a b info
# 0 1 2 {'n': 'nx', 'm': 'mx'}
# 1 3 4 {'n': 'nx', 'm': 'mx'}
# 2 10 20 {'n': 'ny', 'm': 'my'}
# 3 30 40 {'n': 'ny', 'm': 'my'}
print(pd.json_normalize(l_complex, record_path='data',
meta=[['info', 'n'], ['info', 'm']]))
# a b info.n info.m
# 0 1 2 nx mx
# 1 3 4 nx mx
# 2 10 20 ny my
# 3 30 40 ny my
print(pd.json_normalize(l_complex, record_path='data',
meta=[['info', 'n'], ['info', 'm']],
sep='_'))
# a b info_n info_m
# 0 1 2 nx mx
# 1 3 4 nx mx
# 2 10 20 ny my
# 3 30 40 ny my
To convert all elements in this example to pandas.DataFrame
, configure it as follows:
print(pd.json_normalize(l_complex, record_path='data',
meta=['label', ['info', 'n'], ['info', 'm']],
sep='_'))
# a b label info_n info_m
# 0 1 2 X nx mx
# 1 3 4 X nx mx
# 2 10 20 Y ny my
# 3 30 40 Y ny my
Be careful to use a list of lists, like meta=[[<parent key>, <child key>]]
, even when specifying a single child key. An error occurs if meta=[<parent key>, <child key>]
.
print(pd.json_normalize(l_complex, record_path='data',
meta=[['info', 'n']]))
# a b info.n
# 0 1 2 nx
# 1 3 4 nx
# 2 10 20 ny
# 3 30 40 ny
# print(pd.json_normalize(l_complex, record_path='data',
# meta=['info', 'n']))
# KeyError: "Try running with errors='ignore' as key 'n' is not always present"
Reading a part of JSON strings or files
As mentioned above, the list of dictionaries is a common format in JSON obtained from Web API. However, it is rarely possible to obtain data in this format directly, and you often need to extract part of the JSON string or file.
Use the following JSON-formatted string as an example:
import pandas as pd
import json
s = '{"OTHER": "x", "DATA": [{"name":"Alice","age":25},{"name":"Bob","age":42}]}'
pandas.read_json()
is a function to convert JSON strings or files to pandas.DataFrame
, but if the dictionaries are nested, the dictionaries are treated as elements.
print(pd.read_json(s))
# OTHER DATA
# 0 x {'name': 'Alice', 'age': 25}
# 1 x {'name': 'Bob', 'age': 42}
To use pandas.json_normalize()
, first convert the JSON string to objects consisting of dictionaries and lists with json.loads()
in the json
module of the standard library. Use json.load()
to read a JSON file.
d = json.loads(s)
print(d)
# {'OTHER': 'x', 'DATA': [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 42}]}
print(type(d))
# <class 'dict'>
By correctly setting the record_path
and meta
arguments, you can convert the JSON string to pandas.DataFrame
with pandas.json_normalize()
.
print(pd.json_normalize(d))
# OTHER DATA
# 0 x [{'name': 'Alice', 'age': 25}, {'name': 'Bob',...
print(pd.json_normalize(d, record_path='DATA'))
# name age
# 0 Alice 25
# 1 Bob 42
print(pd.json_normalize(d, record_path='DATA', meta='OTHER'))
# name age OTHER
# 0 Alice 25 x
# 1 Bob 42 x
If you only want to convert a part of the JSON string to pandas.DataFrame
, you can extract the desired part from the object. For deeply nested structures, repeat the key names like [key name][key name]
.
As mentioned above, a list of non-nested dictionaries can also be converted with pandas.DataFrame()
.
print(d['DATA'])
# [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 42}]
print(type(d['DATA']))
# <class 'list'>
print(pd.DataFrame(d['DATA']))
# name age
# 0 Alice 25
# 1 Bob 42
print(pd.json_normalize(d['DATA']))
# name age
# 0 Alice 25
# 1 Bob 42