针对复杂的JSON数据，如何做数据规范化

PYTED量化交易研究所

已于 2023-06-27 17:25:13 修改

阅读量858

点赞数 6

CC 4.0 BY-SA版权

分类专栏： Numpy/Pandas/Matplotlib 文章标签： Python 数据分析数据处理数据清洗算法

于 2021-07-29 17:32:34 首次发布

本文链接：https://blog.csdn.net/kzl_knight/article/details/119208441

Numpy/Pandas/Matplotlib 专栏收录该内容

3 篇文章

订阅专栏

针对复杂的JSON数据，如何做数据规范化，数分专家来实际操作

针对复杂的JSON数据，如何做数据规范化，数分专家来实际操作

针对复杂的JSON数据，如何做数据规范化，数分专家来实际操作

序：项目结构与下载方式

在这里插入图片描述

下载地址：

链接: https://pan.baidu.com/s/1ZRHf969-A1pg59VqPsDWzA

1 数据转化的需求是酱紫的

把目标数据转成我们想要的格式

目标数据地址：https://static-data.eol.cn/www/2.0/school/102/dic/specialplan.json

（这目前个请求没有任何反爬）

原数据：

在这里插入图片描述

需要转成的格式：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dt9Q1oZt-1627532295368)(_v_images/20210729105205580_268508897.png)]

2 讲一讲他的数据规范的规则

划定数据范围

在这里插入图片描述

为了看起来更方便，我把JSON数据保存到了MongoDB里面

在这里插入图片描述

需要的字段：
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ntz9vQru-1627532295371)(_v_images/20210729110227661_1352738584.png)]

用下面的数据例子，更方便理解转化规则：

原数据

原数据 = [{
    'year': 2021,
    'province': [
        {
            'pid': 35,
            'type': [0, 1],
            'batch': [2, 3],
            'batch_group': {
                '14': [
                    {'id': 4},
                    {'id': 5},
                ],
            },
            'first': ''
        }
    ]
},]

规范后的数据

结果 = [
    {'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 4, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 5, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 4, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 5, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 4, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 5, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 4, 'first': None},
    {'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 5, 'first': None},
]

注：

数据要有交叉，即type = [0,1] batch = [2,3] -->结果应该有4条，即0 2, 0 3 , 1 2 , 1 3,
batch_group里面仅仅需要的是id
如果某个字段无数据，用None进行填充

3 如何做数据处理呢？

3.1 数据的多层嵌套，需要先展平

原数据 = [{
    'year': 2021,
    'province': [
        {
            'pid': 35,
            'type': [0, 1],
            'batch': [2, 3],
            'batch_group': {
                '14': [
                    {'id': 4},
                    {'id': 5},
                ],
            },
            'first': ''
        }
    ]
}, ]

展平的数据 = []
for data in 原数据:
    year = data['year']
    province = data['province']

    for line_data in province:
        pid = line_data['pid']
        type = line_data['type']
        batch = line_data['batch']
        first = line_data['first']
        # 取到batch_group里面的id
        batch_groups = line_data['batch_group']
        batch_group = []

        if batch_groups:
            for key in batch_groups.keys():
                for bg in batch_groups[key]:
                    batch_group.append(bg['id'])
        展平的数据.append(
            {'year': year,
             'pid': pid,
             'type': type,
             'batch': batch,
             'batch_group': batch_group,
             'first': first
             }
        )

print(展平的数据)

数据的展平结果：

[{'year': 2021, 'pid': 35, 'type': [0, 1], 'batch': [2, 3], 'batch_group': [4, 5], 'first': ''}]

3.2 反向拆解，构造一个转换函数

【结果数据】 = func(【展平后的数据)

# 书接上文

from copy import deepcopy
from collections import Iterable

def jiaocha(**kwargs):
    # 初始化数据，为了让key存在
    data = {}
    for key in kwargs.keys():
        data[key] = None

    datas = [data, ]

    def _func(datas, key, values):
        datas2 = []
        for data in datas:
            # 如果有数据
            if values:
                # 可以迭代，则遍历
                if isinstance(values, Iterable):
                    for v in values:
                        data[key] = v
                        datas2.append(deepcopy(data))
                # 不可以迭代，直接赋值
                else:
                    data[key] = values
                    datas2.append(deepcopy(data))
            # 无数据填充None
            else:
                data[key] = None
                datas2.append(data)
        return datas2
    for key in kwargs.keys():
        datas = _func(datas, key, kwargs[key])
    return datas

结果数据 = jiaocha(**展平的数据[0])

for data in 结果数据:
    print(data)

输出的结果：

{'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 5, 'first': None}

3.3 封包，需要把多个过程变成一个独立整体

前两步的操作，既有函数外的遍历，也有定义的函数，二者之间相互依赖，最好变成一个整体，方便后续的调用

原数据 = [{
    'year': 2021,
    'province': [
        {
            'pid': 35,
            'type': [0, 1],
            'batch': [2, 3],
            'batch_group': {
                '14': [
                    {'id': 4},
                    {'id': 5},
                ],
            },
            'first': ''
        }
    ]
}, ]

from copy import deepcopy
from collections import Iterable


def convert(datas: list):
    '''
    @param:datas,需要满足设定的格式
    return:[
        {'year': xxx, 'pid': xxx, 'type': xxx, 'batch': xxx, 'batch_group': xxx, 'first': xxx},
        ...
    ]
    '''
    # 1. 展平全部数据
    flatten_datas = []  # 保存展平后的全部数据
    for data in datas:
        year = data['year']
        province = data['province']

        for line_data in province:
            pid = line_data['pid']
            type = line_data['type']
            batch = line_data['batch']
            first = line_data['first']
            # 取到batch_group里面的id
            batch_groups = line_data['batch_group']
            batch_group = []

            if batch_groups:
                for key in batch_groups.keys():
                    for bg in batch_groups[key]:
                        batch_group.append(bg['id'])
            flatten_datas.append(
                {'year': year,
                 'pid': pid,
                 'type': type,
                 'batch': batch,
                 'batch_group': batch_group,
                 'first': first
                 }
            )

    # 2.1 定义交叉函数
    def jiaocha(**kwargs):
        # 初始化数据，为了让key存在
        data = {}
        for key in kwargs.keys():
            data[key] = None

        datas = [data, ]

        def _func(datas, key, values):
            datas2 = []
            for data in datas:
                # 如果有数据
                if values:
                    # 可以迭代，则遍历
                    if isinstance(values, Iterable):
                        for v in values:
                            data[key] = v
                            datas2.append(deepcopy(data))
                    # 不可以迭代，直接赋值
                    else:
                        data[key] = values
                        datas2.append(deepcopy(data))
                # 无数据填充None
                else:
                    data[key] = None
                    datas2.append(data)
            return datas2

        for key in kwargs.keys():
            datas = _func(datas, key, kwargs[key])
        return datas

    # 2.2 调用交叉函数
    ret_datas = []  # 保存结果数据
    for data in flatten_datas:
        ret_datas += jiaocha(**data)
    return ret_datas


ret_datas = convert(原数据)
for data in ret_datas:
    print(data)

规范后的数据结果：

{'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 2, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 0, 'batch': 3, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 2, 'batch_group': 5, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 4, 'first': None}
{'year': 2021, 'pid': 35, 'type': 1, 'batch': 3, 'batch_group': 5, 'first': None}

4 对网站返回的JSON数据规范化，并保存本地

# 我将convert函数保存到了同级文件夹下的tool.py
from tool import convert
# pip install kuser_agent，本人自己写的，可以设置随机的User-Agent，调用方法:kuser_agent.get()
import kuser_agent
import requests
import pymongo
import json


# 1. 发送请求，获取Json数据
url = 'https://static-data.eol.cn/www/2.0/school/102/dic/specialplan.json'
json_data = requests.get(url=url,headers={'User-Agent':kuser_agent.get()}).content

# 注：如果这个接口已经关闭，可以使用一下的命令
# 我将Json的数据保存到了本地，直接读取就好
# with open('数据/specialplan.json','rb') as f:
#     json_data = f.read()

# 2. 把Json数据转成字典
json_dict = json.loads(json_data)

# 3. 获取里面的data
datas = [
    data for data in json_dict['data']['data']
]

# 4. 转化的结果
ret_datas = convert(datas)

# 5. 保存本地Mongo

db_name = 'HE' # 数据库的名字
c_name = 'specialplan' # 集合的名字

client = pymongo.MongoClient() # 连接客户端

collection = client.get_database(db_name).get_collection(c_name) # 指定数据库和连接

collection.insert_many(ret_datas) # 插入数据

client.close() # 关闭连接