python 进行各种回归

本文对比了多种回归模型,包括线性回归、决策树、SVM、KNN、随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees以及神经网络在预测任务中的表现。通过分析模型的平均绝对误差和决定系数,发现随机森林和梯度提升回归在预测准确性方面表现最佳。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

基本回归:线性、决策树、SVM、KNN
集成方法:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
##学会了数据分层抽样,以及各种回归的代码书写。可能还需要注意调参等。
继续学习网址:使用sklearn做各种回归

数据准备

from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')  #设置matplotlib作图风格
import seaborn as sns
import pandas as pd
sns.set()   #设置sns作图默认风格
import warnings
warnings.filterwarnings('ignore')
import numpy as np
data = pd.read_csv("../论文/mianbanshu.csv")
data.head()
cityyearrkmdrjysysrqgddlpswsjzgymjlhldlshljwhhqntzwstzUnnamed: 18
0AH20082043.0174.6195.1187.607.5014.151.7578.8658.649.2935.9732.0394.7553.952722949.025465NaN
1AH20092114.0160.9695.2588.629.6614.918.2383.2462.4010.2337.1633.6395.5160.913762160.027900NaN
2AH20102469.0160.8396.0690.529.8816.018.8188.4671.5810.9537.5033.6795.5964.564756917.030324NaN
3AH20112265.0168.9996.5593.3510.6618.0010.2591.0979.0411.8839.4734.5597.0686.996203403.032884NaN
4AH20122401.0165.4598.0294.6111.1318.4711.7294.5386.3911.9238.8034.7295.2891.146339795.039962NaN

数据探索

f, ax = plt.subplots(figsize = (11, 7))
sns.factorplot(data = data, x = 'year', y = "wstz", 
               palette = 'plasma',
               hue = 'city', ax=ax) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XfK1TDUD-1588691811222)(output_5_1.png)]

f, ax = plt.subplots(figsize = (11, 7))
sns.factorplot(data = data, x = 'year', y = "rkmd", 
               palette = 'plasma',
               hue = 'city', ax=ax) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1HgicUTr-1588691811226)(output_6_1.png)]

f, ax = plt.subplots(figsize = (11, 7))
sns.factorplot(data = data, x = 'year', y = "rjys", 
               palette = 'plasma',
               hue = 'city', ax=ax) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oHiy9d9A-1588691811234)(output_7_1.png)]

data = data.drop(["Unnamed: 18"],axis=1)

#Correlation Matrix(另一种画法)

corr = kiva_loans_data.corr() plt.figure(figsize=(12,12))
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values, annot=True, cmap=‘cubehelix’, square=True) plt.title(‘Correlation between
different features’) corr

corr_all = data.drop(["city"],axis = 1).corr()

mask = np.zeros_like(corr_all, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11, 9))
sns.heatmap(corr_all, mask = mask,
            square = True, linewidths = .5, ax = ax, cmap = "BuPu")      
#plt.savefig('heatmap.png')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fksHJGFy-1588691811237)(output_9_0.png)]

a = set(data["city"])
b = [9] * 31

实现分层抽样,抽到的即为训练样本

gbr=data.groupby('city')
# 分层抽样字典定义 组名:数据个数
typicalNDict = dict(zip(a,b))
 
# 函数定义
def typicalsamling(group, typicalNDict):
    name = group.name
    n = typicalNDict[name]
    return (group.sample(n=n))

#返回值:抽样后的数据框
train = data.groupby('city').apply(typicalsamling, typicalNDict)
train.to_csv('TrainData.csv', index=False)
train.head()
cityyearrkmdrjysysrqgddlpswsjzgymjlhldlshljwhhqntzwstz
city
AH4AH20122401.0165.4598.0294.6111.1318.4711.7294.5386.3911.9238.8034.7295.2891.146339795.039962
2AH20102469.0160.8396.0690.529.8816.018.8188.4671.5810.9537.5033.6795.5964.564756917.030324
8AH20162487.0180.1999.2098.0512.7621.8213.1897.3692.0314.0241.7137.6799.9499.947417525.067256
0AH20082043.0174.6195.1187.607.5014.151.7578.8658.649.2935.9732.0394.7553.952722949.025465
3AH20112265.0168.9996.5593.3510.6618.0010.2591.0979.0411.8839.4734.5597.0686.996203403.032884
train = pd.read_csv("../论文/TrainData.csv")
train.head()
cityyearrkmdrjysysrqgddlpswsjzgymjlhldlshljwhhqntzwstz
0AH20122401.0165.4598.0294.6111.1318.4711.7294.5386.3911.9238.8034.7295.2891.146339795.039962
1AH20102469.0160.8396.0690.529.8816.018.8188.4671.5810.9537.5033.6795.5964.564756917.030324
2AH20162487.0180.1999.2098.0512.7621.8213.1897.3692.0314.0241.7137.6799.9499.947417525.067256
3AH20082043.0174.6195.1187.607.5014.151.7578.8658.649.2935.9732.0394.7553.952722949.025465
4AH20112265.0168.9996.5593.3510.6618.0010.2591.0979.0411.8839.4734.5597.0686.996203403.032884
train.shape
(279, 18)

测试样本

test = data[~data["wstz"].isin(list(train["wstz"]))]
test.shape
(31, 18)
test.head()
cityyearrkmdrjysysrqgddlpswsjzgymjlhldlshljwhhqntzwstz
7AH20152458.0168.9098.7997.5512.3820.8212.6796.6891.8013.3741.1637.1699.5599.556324122.0106486
10BJ20081181.0187.22100.00100.006.776.210.7378.9274.528.5637.1535.8597.7197.715189156.098295
29FJ20172854.0203.6999.5697.4513.8517.419.4292.2189.0714.1343.6940.2899.3899.387128613.0260721
32GS20103793.0155.1291.5774.296.8912.204.8962.5958.588.1227.1223.1497.8437.95944242.06289
47GD20153060.0248.9598.4697.6017.7413.609.5193.6593.2517.4041.4337.3297.6391.568496440.0644310

从上面可以看出每个city之间的指标随着年份可能平稳,或趋于增加,但是各折线相交都较少,说明这些指标已足以将样本进行区分。

另外从热力图可以看出,year这个变量与其他的变量有明显相关关系的情况比较少,大部分指标在不同年份相差不大,故考虑也将year这个变量删除。

去除城市和年份这一列
train = train.drop(["city","year"],axis=1)
X_train = train[train.columns[:-1]]
y_train = train[train.columns[-1]]
test = test.drop(["city","year"],axis=1)
X_test = test[test.columns[:-1]]
y_test = test[test.columns[-1]]

1.线性回归

from sklearn import linear_model
model1_linear = linear_model.LinearRegression()
model1_linear = model1_linear.fit(X_train,y_train)
y_pred1 = model1_linear.predict(X_test)
from sklearn.metrics import mean_absolute_error 
mean_absolute_error(y_pred1,y_test)
84180.196193290452
from sklearn.metrics import r2_score
r2_score(y_pred1,y_test)
0.48478789713121917
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred1),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yzea3hnT-1588691811240)(output_30_0.png)]

2.决策树回归

from sklearn import tree
model2_tree = tree.DecisionTreeRegressor()
model2_tree = model2_tree.fit(X_train,y_train)
y_pred2 = model2_tree.predict(X_test)
mean_absolute_error(y_pred2,y_test)
52285.806451612902
r2_score(y_pred2,y_test)
0.55980167732053721
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred2),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XjiZRmZD-1588691811241)(output_35_0.png)]

3.SVM回归

####3.3SVM回归####
from sklearn import svm
model3_SVR = svm.SVR()
model3_SVR = model3_SVR.fit(X_train,y_train)
y_pred3 = model3_SVR.predict(X_test)
mean_absolute_error(y_pred3,y_test)
90906.8702612044
r2_score(y_pred3,y_test)
-16123841.245722869
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred3),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WhmnQ5du-1588691811243)(output_40_0.png)]

4.KNN回归

from sklearn.neighbors import KNeighborsRegressor
model4_KN = KNeighborsRegressor(n_neighbors=3)
model4_KN.fit(X_train,y_train)
y_pred4 = model4_KN.predict(X_test)
mean_absolute_error(y_pred4,y_test)
117266.68817204301
r2_score(y_pred4,y_test)
-1.5288016385577752
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred4),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-D3ofVHD9-1588691811244)(output_45_0.png)]

5.随机森林回归

from sklearn.ensemble import RandomForestRegressor
model5_RF = RandomForestRegressor()
model5_RF = model5_RF.fit(X_train,y_train)
y_pred5 = model5_RF.predict(X_test)
mean_absolute_error(y_pred5,y_test)
34605.676129032247
r2_score(y_pred5,y_test)
0.74665386911940801
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred5),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0et8pPgR-1588691811245)(output_50_0.png)]

6.AdaBoost回归

from sklearn.ensemble import AdaBoostRegressor
model6_AdaBoost = AdaBoostRegressor(n_estimators=100)
model6_AdaBoost = model6_AdaBoost.fit(X_train,y_train)
y_pred6 = model6_AdaBoost.predict(X_test)
mean_absolute_error(y_pred6,y_test)
71458.328974579621
r2_score(y_pred6,y_test)
0.30236366136673865
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred6),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FKl9fdBm-1588691811247)(output_54_0.png)]

7.梯度上升回归

from sklearn.ensemble import GradientBoostingRegressor
model7_GBDT = GradientBoostingRegressor()
model7_GBDT = model7_GBDT.fit(X_train,y_train)
y_pred7 = model7_GBDT.predict(X_test)
mean_absolute_error(y_pred7,y_test)
30247.23832854132
r2_score(y_pred7,y_test)
0.79286645995236993
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred7),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TKBuXWsk-1588691811249)(output_58_0.png)]

8.Bagging回归

from sklearn.ensemble import BaggingRegressor
model8_Bagging = BaggingRegressor()
model8_Bagging = model8_Bagging.fit(X_train,y_train)
y_pred8 = model8_Bagging.predict(X_test)
mean_absolute_error(y_pred8,y_test)
36788.058064516132
r2_score(y_pred8,y_test)
0.68475165003444116
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred8),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LC2KCEHC-1588691811250)(output_62_0.png)]

9.ExtraTree极端随机树回归

####3.9ExtraTree极端随机树回归####
from sklearn.tree import ExtraTreeRegressor
model9_Etra = ExtraTreeRegressor()
model9_Etra = model9_Etra.fit(X_train,y_train)
y_pred9 = model9_Etra.predict(X_test)
mean_absolute_error(y_pred9,y_test)
45674.580645161288
r2_score(y_pred9,y_test)
0.31221911911576294
plt.figure(figsize=(12, 6))
plt.plot(list(y_pred9),label="forcast")
plt.plot(list(y_test),label="test")
plt.ylabel('wstz',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LesYMvrX-1588691811251)(output_66_0.png)]

模型比较

modelnames = ['linear_model',
              'DecisionTreeRegressor',
              'RandomForestRegressor', 
              'AdaBoostRegressor', 
              'GradientBoostingRegressor',
              'BaggingRegressor',
              'ExtraTreeRegressor',
              ]
R_square = [r2_score(y_pred1,y_test),r2_score(y_pred2,y_test),
            r2_score(y_pred5,y_test),r2_score(y_pred6,y_test),
            r2_score(y_pred7,y_test),r2_score(y_pred8,y_test),
            r2_score(y_pred9,y_test)]
plt.figure(figsize=(10,9))

fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1, 1, 1)
 
ticks = ax.set_xticks(range(0,7))#设定x轴有7个标签
ax.plot(R_square,'ko--')
labels = ax.set_xticklabels(modelnames,fontsize='14',rotation=90)  
plt.title("model comparsion",fontsize='14')
plt.grid(True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AwrziDGh-1588691811253)(output_68_1.png)]误差衡量指标平均绝对误差和决定系数具有一致性,当决定系数系数越大,平均绝对误差就越小,上述试验数据也是这样。
相对较好的两个模型如下:
随机森林回归:平均绝对误差:34605 决定系数:0.747
梯度上升回归:平均绝对误差:30247 决定系数:0.793


前馈型神经网络,也叫多层感知器(Multi-Layer Perceptron,MLP)模型或全连接神经网络。它每一层的神经元从上一层神经元接收数据,经过计算之后产生输出数据,送入下一层神经元继续处理,最后一层神经元的输出是神经网络最终的输出值。

import numpy as np
from sklearn.neural_network import MLPRegressor


MLP = MLPRegressor()
MLP.fit(X_train,y_train)
y_pred10 = MLP.predict(X_test)
score = r2_score(y_test, y_pred10)  

plt.figure(figsize=(12, 6))
plt.plot(np.arange(len(y_pred10)),y_test,'go-',label = 'true value')
plt.plot(np.arange(len(y_pred10)),y_pred10,'ro-',label = 'predict value')
plt.legend()
plt.show()

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值