pandas: Convert a list of dictionaries to DataFrame with json_normalize

Posted: | Tags: Python, pandas, JSON

You can convert a list of dictionaries with shared keys to pandas.DataFrame with pandas.json_normalize().

This format is commonly used in JSON obtained from Web API, so converting it to pandas.DataFrame is very useful.

This article describes the following contents.

  • Converting with pandas.DataFrame()
  • Basic usage of pandas.json_normalize()
  • Handling complex cases: : record_path, meta
  • Reading a part of JSON strings or files

Use pandas.read_json() to directly read JSON strings or files as pandas.DataFrame instead of an object consisting of a dictionary or list.

Note that pandas.json_normalize() is available since pandas 1.0.0. In previous versions, it was provided as pandas.io.json.json_normalize().

Converting with pandas.DataFrame()

Use the following list of dictionaries as an example.

import pandas as pd

l_simple = [{'name': 'Alice', 'age': 25},
            {'name': 'Bob'}]

Such a simple list of dictionaries can be converted with pandas.DataFrame().

The dictionary key is treated as the column label, and the element when the key does not exist is treated as the missing value NaN.

print(pd.DataFrame(l_simple))
#     name   age
# 0  Alice  25.0
# 1    Bob   NaN

pandas.json_normalize() returns the same result.

print(pd.json_normalize(l_simple))
#     name   age
# 0  Alice  25.0
# 1    Bob   NaN

Basic usage of pandas.json_normalize()

Use the following list of nested dictionaries as an example.

l_nested = [{'name': 'Alice', 'age': 25, 'id': {'x': 2, 'y': 8}},
            {'name': 'Bob', 'id': {'x': 10, 'y': 4}}]

pandas.DataFrame() converts the nested dictionaries directly as elements.

print(pd.DataFrame(l_nested))
#     name   age                 id
# 0  Alice  25.0   {'x': 2, 'y': 8}
# 1    Bob   NaN  {'x': 10, 'y': 4}

pandas.json_normalize() converts the nested dictionaries into separate columns for each key.

print(pd.json_normalize(l_nested))
#     name   age  id.x  id.y
# 0  Alice  25.0     2     8
# 1    Bob   NaN    10     4

By default, the nested parts have column names in the format <parent key>.<child key>. This separator . can be changed with the sep argument.

print(pd.json_normalize(l_nested, sep='_'))
#     name   age  id_x  id_y
# 0  Alice  25.0     2     8
# 1    Bob   NaN    10     4

Handling complex cases: : record_path, meta

Consider the following example, where dictionary values are lists of dictionaries.

l_complex = [{'label': 'X',
              'info' : {'n': 'nx', 'm': 'mx'},
              'data': [{'a': 1, 'b': 2},
                       {'a': 3, 'b': 4}]},
             {'label': 'Y',
              'info' : {'n': 'ny', 'm': 'my'},
              'data': [{'a': 10, 'b': 20},
                       {'a': 30, 'b': 40}]}]

By default, the list of dictionaries is treated as an element.

print(pd.json_normalize(l_complex))
#   label                                      data info.n info.m
# 0     X      [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]     nx     mx
# 1     Y  [{'a': 10, 'b': 20}, {'a': 30, 'b': 40}]     ny     my

By specifying a key with the record_path argument, only the corresponding values are converted. You can add a prefix to the column names with the record_prefix argument.

print(pd.json_normalize(l_complex, record_path='data'))
#     a   b
# 0   1   2
# 1   3   4
# 2  10  20
# 3  30  40

print(pd.json_normalize(l_complex, record_path='data', record_prefix='data_'))
#    data_a  data_b
# 0       1       2
# 1       3       4
# 2      10      20
# 3      30      40

If you want to convert other keys' values, specify them with the meta argument. You can add a prefix to the column names with the meta_prefix argument.

print(pd.json_normalize(l_complex, record_path='data',
                        meta='label'))
#     a   b label
# 0   1   2     X
# 1   3   4     X
# 2  10  20     Y
# 3  30  40     Y

print(pd.json_normalize(l_complex, record_path='data',
                        meta='label', meta_prefix='meta_'))
#     a   b meta_label
# 0   1   2          X
# 1   3   4          X
# 2  10  20          Y
# 3  30  40          Y

If the value of the key specified in meta is a dictionary, you can specify the child key using the list [[<parent key>, <child key>], ...]. By default, the column names are <parent key>.<child key>, but you can change the separator using the sep argument.

print(pd.json_normalize(l_complex, record_path='data',
                        meta='info'))
#     a   b                    info
# 0   1   2  {'n': 'nx', 'm': 'mx'}
# 1   3   4  {'n': 'nx', 'm': 'mx'}
# 2  10  20  {'n': 'ny', 'm': 'my'}
# 3  30  40  {'n': 'ny', 'm': 'my'}

print(pd.json_normalize(l_complex, record_path='data',
                        meta=[['info', 'n'], ['info', 'm']]))
#     a   b info.n info.m
# 0   1   2     nx     mx
# 1   3   4     nx     mx
# 2  10  20     ny     my
# 3  30  40     ny     my

print(pd.json_normalize(l_complex, record_path='data',
                        meta=[['info', 'n'], ['info', 'm']],
                        sep='_'))
#     a   b info_n info_m
# 0   1   2     nx     mx
# 1   3   4     nx     mx
# 2  10  20     ny     my
# 3  30  40     ny     my

To convert all elements in this example to pandas.DataFrame, configure it as follows:

print(pd.json_normalize(l_complex, record_path='data',
                        meta=['label', ['info', 'n'], ['info', 'm']],
                        sep='_'))
#     a   b label info_n info_m
# 0   1   2     X     nx     mx
# 1   3   4     X     nx     mx
# 2  10  20     Y     ny     my
# 3  30  40     Y     ny     my

Be careful to use a list of lists, like meta=[[<parent key>, <child key>]], even when specifying a single child key. An error occurs if meta=[<parent key>, <child key>].

print(pd.json_normalize(l_complex, record_path='data',
                        meta=[['info', 'n']]))
#     a   b info.n
# 0   1   2     nx
# 1   3   4     nx
# 2  10  20     ny
# 3  30  40     ny

# print(pd.json_normalize(l_complex, record_path='data',
#                         meta=['info', 'n']))
# KeyError: "Try running with errors='ignore' as key 'n' is not always present"

Reading a part of JSON strings or files

As mentioned above, the list of dictionaries is a common format in JSON obtained from Web API. However, it is rarely possible to obtain data in this format directly, and you often need to extract part of the JSON string or file.

Use the following JSON-formatted string as an example:

import pandas as pd
import json

s = '{"OTHER": "x", "DATA": [{"name":"Alice","age":25},{"name":"Bob","age":42}]}'

pandas.read_json() is a function to convert JSON strings or files to pandas.DataFrame, but if the dictionaries are nested, the dictionaries are treated as elements.

print(pd.read_json(s))
#   OTHER                          DATA
# 0     x  {'name': 'Alice', 'age': 25}
# 1     x    {'name': 'Bob', 'age': 42}

To use pandas.json_normalize(), first convert the JSON string to objects consisting of dictionaries and lists with json.loads() in the json module of the standard library. Use json.load() to read a JSON file.

d = json.loads(s)
print(d)
# {'OTHER': 'x', 'DATA': [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 42}]}

print(type(d))
# <class 'dict'>

By correctly setting the record_path and meta arguments, you can convert the JSON string to pandas.DataFrame with pandas.json_normalize().

print(pd.json_normalize(d))
#   OTHER                                               DATA
# 0     x  [{'name': 'Alice', 'age': 25}, {'name': 'Bob',...

print(pd.json_normalize(d, record_path='DATA'))
#     name  age
# 0  Alice   25
# 1    Bob   42

print(pd.json_normalize(d, record_path='DATA', meta='OTHER'))
#     name  age OTHER
# 0  Alice   25     x
# 1    Bob   42     x

If you only want to convert a part of the JSON string to pandas.DataFrame, you can extract the desired part from the object. For deeply nested structures, repeat the key names like [key name][key name].

As mentioned above, a list of non-nested dictionaries can also be converted with pandas.DataFrame().

print(d['DATA'])
# [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 42}]

print(type(d['DATA']))
# <class 'list'>

print(pd.DataFrame(d['DATA']))
#     name  age
# 0  Alice   25
# 1    Bob   42

print(pd.json_normalize(d['DATA']))
#     name  age
# 0  Alice   25
# 1    Bob   42

Related Categories

Related Articles