掌握分类问题的评估及超参数调优 task 6

最新推荐文章于 2022-07-20 10:22:32 发布

原创最新推荐文章于 2022-07-20 10:22:32 发布 · 299 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #机器学习

本文介绍了如何使用Python的scikit-learn库进行模型性能评估和调参。首先，通过管道实现数据预处理与逻辑回归模型的一体化流程。接着，使用k折交叉验证评估模型性能，以及分层k折交叉验证确保类别平衡。通过学习和验证曲线判断模型的过拟合或欠拟合问题。然后，采用网格搜索和随机网格搜索进行超参数调优，并比较两种方法的时间效率。最后，展示了混淆矩阵和ROC曲线作为评估模型性能的工具，并给出了乳腺癌数据集的示例。

评估模型的性能并调参

1. 用管道简化工作流

同时进行数据标准化，PCA降维和拟合逻辑回归模型并预测。
把所有的操作全部封在一个管道pipeline内形成一个工作流：标准化+PCA+逻辑回归
方式1：make_pipeline
方式2：Pipeline

2. 使用k折交叉验证评估模型性能

我们每次的测试集将不再只包含一个数据，而是多个，具体数目将根据K的选取决定。比如，如果K=5，那么我们利用五折交叉验证的步骤就是：

1.将所有数据集分成5份

2.不重复地每次取其中一份做测试集，用其他四份做训练集训练模型，之后计算该模型在测试集上的 $MSEi\text{MSE}_i$

3.将5次的[公式]取平均得到最后的MSE

$CV(k)=1k∑i=1kMSEi\text{CV}_(k_) = \frac{1}{k} \sum_{i=1}^{{k} } {MSE}_i$

k折交叉验证：使用sklearn.model_selection.cross_val_score
分层k折交叉验证：使用sklearn.model_selection.StratifiedKFold

3. 使用学习和验证曲线调试算法

如果模型过于复杂，即模型有太多的自由度或者参数，就会有过拟合的风险（高方差）；而模型过于简单，则会有欠拟合的风险(高偏差)。

用学习曲线诊断偏差与方差：sklearn.model_selection.learning_curve
用验证曲线解决欠拟合和过拟合：sklearn.model_selection.validation_curve

4. 通过网格搜索进行超参数调优

如果只有一个参数需要调整，那么用验证曲线手动调整是一个好方法，但是随着需要调整的超参数越来越多的时候，我们能不能自动去调整呢？

（注意参数与超参数的区别：参数可以通过优化算法进行优化，如逻辑回归的系数；超参数是不能用优化模型进行优化的，如正则话的系数。）

方式1：网格搜索GridSearchCV()
sklearn.model_selection.GridSearchCV
方式2：随机网格搜索RandomizedSearchCV()
from sklearn.model_selection.RandomizedSearchCV
方式3：嵌套交叉验证
sklearn.model_selection.GridSearchCV

5. 比较不同的性能评估指标

有时候，准确率不是我们唯一需要考虑的评价指标，因为有时候会存在各类预测错误的代价不一样。我们需要其他更加广泛的指标：
在这里插入图片描述

绘制混淆矩阵sklearn.metrics.confusion_matrix
各种指标的计算准确率，召回率，F1-Score：
sklearn.metrics.precision_score,recall_score,f1_score

6.实例演示

# 加载基本工具库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")
import warnings
warnings.filterwarnings("ignore")

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature = iris.feature_names
data = pd.DataFrame(X,columns=feature)
data['target'] = y
data.head()

在这里插入图片描述

# 使用网格搜索进行超参数调优：
# 方式1：网格搜索GridSearchCV()
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time

start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X,y)
end_time = time.time()
print("网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

网格搜索经历时间：0.443 S
0.9800000000000001
{‘svc__C’: 1.0, ‘svc__gamma’: 0.1, ‘svc__kernel’: ‘rbf’}

# 方式2：随机网格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import time

start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
# param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}]
gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X,y)
end_time = time.time()
print("随机网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

随机网格搜索经历时间：0.090 S
0.9733333333333334
{‘svc__kernel’: ‘linear’, ‘svc__C’: 0.1}

混淆矩阵和ROC曲线

# 混淆矩阵：
# 加载数据
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
'''
乳腺癌数据集：569个恶性和良性肿瘤细胞的样本，M为恶性，B为良性
'''
# 做基本的数据预处理
from sklearn.preprocessing import LabelEncoder

X = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder()    #将M-B等字符串编码成计算机能识别的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 数据切分8：2
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
from sklearn.svm import SVC
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
from sklearn.metrics import confusion_matrix

pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

在这里插入图片描述

# 绘制ROC曲线：
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###计算真阳率和假阳率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假阳率为横坐标，真阳率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述
开源内容来自：https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning