数据处理与统计分析 —— 数据合并组合（附代码与数据集）

原创已于 2025-08-20 09:37:14 修改 · 906 阅读

CC 4.0 BY-SA版权

文章标签：

于 2025-08-19 20:58:07 首次发布

目录

（一）concat( )函数

1、DataFrame行连接

2、DataFrame列连接

3、DataFrame对象连接Series对象

4、重置索引

（二）merge( )函数

一对一案例

1、数据导入

2、创建连接

3、分组groupby( )，并取平均值

4、可视化补充：转换时间单位

多对一案例

（三）join( )函数

导入数据

场景1：根据两个df的行索引进行关联

场景2：根据两个df的Symbol设置为行索引，然后关联，无需手动写on字段，默认就是根据行索引关联

场景3:1个df的Symbol设置为行索引，另一个df的Symbol不设置

本文主要讲了关于在数据处理与统计分析过程中连接数据以及合并数据集所要用到的三种方法concat()、merge()、join()，知识量比较多，大家认真学习。文章中涉及的数据集可在网盘中下载自行进行练习

网盘链接：https://pan.quark.cn/s/ca7bbdf05058?pwd=xSjv 提取码：xSjv

连接数据

首先，我们来看看所要连接的原始数据

import pandas as pd
# 加载数据
df1 = pd.read_csv('data/concat_1.csv')
df2 = pd.read_csv('data/concat_2.csv')
df3 = pd.read_csv('data/concat_3.csv')
print(df1)
print("--------------------------")
print(df2)
print("--------------------------")
print(df3)

（一）concat( )函数

1、DataFrame行连接

r1 = pd.concat([df1, df2, df3])
r2 = pd.concat([df1, df2, df3], axis=0)
r3 = pd.concat([df1, df2, df3], axis='rows')

运行结果：（三个运行结果都是一样的）
     A    B    C    D
0   a0   b0   c0   d0
1   a1   b1   c1   d1
2   a2   b2   c2   d2
3   a3   b3   c3   d3
0   a4   b4   c4   d4
1   a5   b5   c5   d5
2   a6   b6   c6   d6
3   a7   b7   c7   d7
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

2、DataFrame列连接

由此可知，axix" "里面，0/rows是指行。1/colums是指列
另外，列连接中可以直接通过表名['新增列名'] = [ 数据或者 series]

# 列连接
r4 = pd.concat([df1, df2, df3], axis='columns')
r5 = pd.concat([df1, df2, df3], axis=1)
print(r4)
print("--------------------------------------------")
print(r5)
print("--------------------------------------------")
r5['new_col'] = ['n1', 'n2', 'n3', 'n4'] # 添加列,自定义列
print(r5)
print("--------------------------------------------")
r5['new_col_series'] = pd.Series(['n1', 'n2', 'n3', 'n4'])
print(r5)

运行结果：（前两行运行结果相同）
    A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11
--------------------------------------------
    A   B   C   D   A   B   C   D    A    B    C    D new_col
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8      n1
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9      n2
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10      n3
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11      n4
--------------------------------------------
    A   B   C   D   A   B   C   D    A    B    C    D new_col new_col_series
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8      n1             n1
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9      n2             n2
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10      n3             n3
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11      n4             n4

3、DataFrame对象连接Series对象

s1 = pd.Series(['a', 'b', 'c'])
# 细节：由于series对象是列数据，concat()函数默认是添加行
# 但是series中没有行索引，所以添加了新的列，缺失值用NaN填充
r6 = pd.concat([df1, s1]) 
print(r6)

r7 = pd.concat([df1, s1], axis=1) # 列连接
print(r7)

运行结果：
   A    B    C    D    0
0   a0   b0   c0   d0  NaN
1   a1   b1   c1   d1  NaN
2   a2   b2   c2   d2  NaN
3   a3   b3   c3   d3  NaN
0  NaN  NaN  NaN  NaN    a
1  NaN  NaN  NaN  NaN    b
2  NaN  NaN  NaN  NaN    c

    A   B   C   D    0
0  a0  b0  c0  d0    a
1  a1  b1  c1  d1    b
2  a2  b2  c2  d2    c
3  a3  b3  c3  d3  NaN

4、重置索引

添加行时，重置索引

# 忽略索引, 重新生成索引
row_concat = pd.concat([df1, df2, df3],ignore_index= True)# 升序
print(row_concat)

运行结果：
      A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
4    a4   b4   c4   d4
5    a5   b5   c5   d5
6    a6   b6   c6   d6
7    a7   b7   c7   d7
8    a8   b8   c8   d8
9    a9   b9   c9   d9
10  a10  b10  c10  d10
11  a11  b11  c11  d11

添加列时，重置索引

r1 = pd.concat([df1, df2, df3], axis='columns')
# 忽略索引, 重新生成索引
r2 = pd.concat([df1, df2, df3], axis='columns', ignore_index= True) 
print(r1)
print("------------------------------------------------")
print(r2)

运行结果：（前后对比）

    A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11
------------------------------------------------
   0   1   2   3   4   5   6   7    8    9    10   11
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

（二）merge( )函数

一对一案例

使用merge()函数，实现一对一的组合数据

1、数据导入

# 数据加载
import sqlite3
import pandas as pd
con = sqlite3.connect('data/chinook.db')
tracks = pd.read_sql_query('select * from tracks', con)
tracks.head()
gennres = pd.read_sql_query('select * from genres', con)
gennres.head()

how = ’left‘ 对应SQL中的 left outer 保留左侧表中的所有key
how = ’right‘ 对应SQL中的 right outer 保留右侧表中的所有key
how = 'outer' 对应SQL中的 full outer 保留左右两侧侧表中的所有key
how = 'inner' 对应SQL中的 inner 只保留左右两侧都有的key

先从tracks中提取部分数据，使其不含重复的'GenreId’值

tracks_subset = tracks.loc[[0,62,76,98,10,193,204,281,322,359],]
tracks_subset.head(10)

2、创建连接

# df.merge(df,on='关键字段',how='连接方式')
# 连接tracks_subset和gennres, 
df1 = gennres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='inner')  # 连接方式 inner
df2 = gennres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='left')  # 连接方式 left
df3 = gennres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='right') # 连接方式 right
df4 = gennres.merge(tracks_subset[['TrackId','GenreId','Milliseconds']],on='GenreId',how='outer')  # 连接方式 outer

运行结果：

3、分组groupby( )，并取平均值

df4.groupby(['GenreId','Name'])['Milliseconds'].mean()

4、可视化补充：转换时间单位

# 计算每种类型音乐的平均时长
# to_timedelta 将Milliseconds列转变为timedelta数据类型
# 参数 unit='ms' 时间单位
genres_time = df4.groupby(['GenreId','Name'])['Milliseconds'].mean()
# dt.floor()是将毫秒截断成秒 
# sort_values()是排序
pd.to_timedelta(genres_time,unit='ms').dt.floor('s').sort_values()

多对一案例

目的：计算每名用户的平均消费

从三张表中获取数据，用户表获取用户id，姓名
发票表，获取发票id，用户id
发票详情表，获取发票id,单价，数量

# 从三张表中获取数据
user_df = pd.read_sql_query("select CustomerId,FirstName,LastName from customers", conn)
print(user_df.info())
invoice_df = pd.read_sql_query("select InvoiceId,CustomerId from invoices", conn)
print(invoice_df.info())
invoice_items_df = pd.read_sql_query("select InvoiceId,UnitPrice,Quantity from invoice_items", conn)
print(invoice_items_df.info())

# 根据用户Id('CustomerId')合并用户表和发票表
# 根据发票Id ('InvoiceId')合并发票和发票详情表
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
print(cust_inv)
# 计算用户每笔消费的总金额
total = cust_inv['Quantity']*cust_inv['UnitPrice']  # 计算每张发票的总金额(虚拟列）
cust_inv['total'] = total


# 按照用户Id，姓名分组，分组后对总金额求和，并排序
result = cust_inv.groupby(['CustomerId', 'FirstName', 'LastName'])['total'].sum()  # 求和
result = result.sort_values(ascending=False)  # 降序排列
print(result.head())

运行结果：

（三）join( )函数

导入数据

stocks_2016 = pd.read_csv('data/stocks_2016.csv')
stocks_2017 = pd.read_csv('data/stocks_2017.csv')
stocks_2018 = pd.read_csv('data/stocks_2018.csv')
print(stocks_2016.head())
print(stocks_2017.head())
print(stocks_2018.head())

场景1：根据两个df的行索引进行关联

# 默认是左外连接，默认的索引是行索引
df1 = stocks_2016.join(stocks_2017, lsuffix="_2016", rsuffix="_2017")  
print(df1.head())
 # 满外连接
df2 = stocks_2016.join(stocks_2017, lsuffix="_2016", rsuffix="_2017", how="outer") 
print(df2.head())

场景2：根据两个df的Symbol设置为行索引，然后关联，无需手动写on字段，默认就是根据行索引关联

df3 = stocks_2016.set_index("Symbol").join(stocks_2017.set_index("Symbol"),
 lsuffix="_2016", rsuffix="_2017")
print(df3.head())

场景3:1个df的Symbol设置为行索引，另一个df的Symbol不设置

df4 = stocks_2016.join(stocks_2017.set_index("Symbol"), 
on="Symbol", lsuffix="_2016", rsuffix="_2017")
print(df4.head())