机器学习学习笔记收尾篇:银行客户流失预测

一、实验目的与背景        

        随着银行业竞争的加剧,潜在的客户流失已成为影响银行盈利能力的重要问题。准确预测客户流失不仅有助于银行采取针对性的挽留措施,还能优化客户关系管理策略。本研究基于银行客户数据集,采用决策树、XGBoost、逻辑回归和神经网络算法构建客户流失预测模型。

数据集来源Kaggle网站上的BankCustomer Churn开源项目:Bank Customer Churn Prediction | Kaggle,其中包含“CustomerId、Surname、CreditScore、Geography、Gender、Age、Tenure、Balance、NumOfProducts、HasCrCard、IsActiveMember、EstimatedSalary“12列自变量,”Exited“列为客户是否流失,训练集共包含13501个样本。

二、实验过程

1.数据与相关库加载

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score
#1. 数据读取
train_df = pd.read_csv('train.csv')
train_df.head()

        输出结果如下:

idCustomerIdSurnameCreditScoreGeographyGenderAgeTenureExited
0015611794.0Walker667FranceMale3350
1115803032.0Hsiung653GermanyFemale3410
2215631170.0Walker656FranceMale3050
3315761733.0Ch'eng704FranceMale4010
4415682070.0Mazzanti642FranceMale322
TenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
50.0211145562.400
1152532.311067972.450
50.021069052.870
10.0111165561.820
20.0210582.590

2.数据预处理与特征分析

        特征处理:分析客户特征,剔除无关变量(如姓、id)。

#2. 数据初步处理
from sklearn.preprocessing import LabelEncoder
X = train_df.loc[:, ['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure',
                      'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']] # 提取有用的变量列
y = train_df['Exited']
#print(X.head())
#print(y.head())

 

        数据编码与标准化:对分类变量(如性别、地区)进行编码,对数值型数据进行标准化处理。

print(X['Geography'].unique())
encoder = LabelEncoder()
encoded_country = encoder.fit_transform(X['Geography'])  # 将国家列编码
encoded_gender = encoder.fit_transform(X['Gender'])
X['Geography'] = encoded_country
X['Gender'] = encoded_gender
#print(X.head())

 

        数据探索:探索类别分布、变量分布、变量之间相关性。

#3. 数据探索(可视化)
# 查看训练集目标变量的分布
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False 
sns.countplot(data = train_df, x='Exited')  
plt.title('训练集数据中的类别分布')
plt.show()

# 查看训练集特征之间的相关性热力图
plt.figure(figsize=(10, 8))
sns.heatmap(X.corr(), annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('特征相关热力图')
plt.show()

print("数据集特征的描述性统计:")
print(X.describe())

        输出结果如下:

 

        可以看出,在类别列中,其中的负类样本的数量为正类样本数量的4倍作用,样本类别分布十分不平衡。对此,后续将采用SMOTE算法进行过采样解决这个问题;

        部分自变量异常值明显,需要进行异常值处理;

        变量之间相关性不明显,无需进行特殊处理(如PAC主成分分析)。

        数据清洗:处理缺失值、异常值,确保数据质量。

#4. 数据预处理(处理异常值与缺失值)
#4.1 缺失值处理
# 检查训练集和测试集中的缺失值
print("Missing values in train data:\n", train_df.isnull().sum())
#4.2 处理异常值
#数据异常值检测(使用Z-score方法)
z_scores_X = stats.zscore(X)
z_scores_y = stats.zscore(y)
abs_z_scores_X = np.abs(z_scores_X)
abs_z_scores_y = np.abs(z_scores_y)

for col in X.columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=X[col])
    plt.title(f'Boxplot of {col}')
    plt.show()
    

#处理异常值
def remove_outliers_iqr(df, factor=1.5):
    df_cleaned = df.copy()
    for col in df.columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)]
    return df_cleaned

cleaned_df_X = remove_outliers_iqr(X)
cleaned_df_y = y[cleaned_df_X.index]
print("原始样本数量:", len(X))
print("清洗后样本数量:", len(cleaned_df_X))

         输出结果如下:

        可以看出 ,数据集中特征并无缺失值,不需要额外处理。

        使用Z-score方法对原数据集进行异常值处理,原始样本数量有13501份,清洗后样本数量有 10262 份。

        划分数据集:本实验通过机器学习库中train_test_split方法进行以8:2比例划分测试集与训练集。

#5. 划分训练集和测试集(80% 训练,20% 测试)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleaned_df_X, cleaned_df_y, test_size=0.2, random_state=7)

 

        SMOTE过采样:对于样本类别不平衡采用SMOTE以提高模型训练效果。

        SMOTE(synthetic minority over⁃sampling tech⁃nique)通过人工方式在特征空间中生成新的、虚拟的少数类样本,以此提高模型对于少数类样本的学习能力。

#6. 训练集过采样
from imblearn.over_sampling import SMOTE
# 初始化 SMOTE 过采样器
smote = SMOTE(random_state=21)

# 应用 SMOTE 过采样
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# 打印过采样后的数据集形状
print("过采样后的 X_train 形状:", X_train_resampled.shape)
print("过采样后的 y_train 形状:", y_train_resampled.shape)

# 检查目标变量的分布
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False 
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.countplot(x=y_train_resampled)
plt.title('SMOTE 过采样后训练集中目标变量的分布')

plt.tight_layout()
plt.show()

        输出结果如下:

         可以看出,原始数据集正负样本比例是10810∶2691,执行SMOTE算法后,正负样本比例是6660∶6660。

3. 机器学习模型构建与优化

        模型训练:采用逻辑回归(Logistic Regression)、决策树(Decision Tree)、集成学习(XGBoost)、神经网络(neural networks)四类算法,对比其预测性能。

#7. 模型训练
#7.1 决策树模型初始化并训练模型
tr_model = DecisionTreeClassifier(random_state=7)
tr_model.fit(X_train_resampled, y_train_resampled)
# 预测概率与类别
y_prod_tr = tr_model.predict_proba(X_test)[:, 1]
y_pred_tr = tr_model.predict(X_test)
# 打印分类报告
#print(classification_report(y_test, y_pred_tr))

#7.2 luogistic回归模型初始化并训练模型
lr_model = LogisticRegression(max_iter=1000, random_state=7)
lr_model.fit(X_train_resampled, y_train_resampled)

# 预测概率与类别
y_prod_lr = lr_model.predict_proba(X_test)[:, 1]
y_pred_lr = lr_model.predict(X_test)

# 打印分类报告
#print(classification_report(y_test, y_pred_lr))

#7.3 xgboost模型初始化并训练模型
from xgboost import XGBClassifier
# 初始化 XGBoost 分类器
xgb_model = XGBClassifier(
    eval_metric='logloss',    # 设置评估指标
    random_state=7            # 固定随机种子
)
# 训练模型
xgb_model.fit(X_train_resampled, y_train_resampled)

# 预测概率与类别
y_prod_xgb = xgb_model.predict_proba(X_test)[:, 1]
y_pred_xgb = xgb_model.predict(X_test)

#7.4 神经网络模型初始化并训练模型
from sklearn.preprocessing import StandardScaler
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# 构建模型
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')  # 输出层:二分类用 sigmoid
])

# 编译模型
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 查看模型结构
model.summary()

# 训练模型
history = model.fit(X_train_scaled, y_train_resampled,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2,
                    verbose=1)
# 预测概率
y_pred_prob = model.predict(X_test_scaled).flatten()

# 预测类别(根据阈值 0.5)
y_pred_class = (y_pred_prob > 0.5).astype("int32")

# 打印分类报告
#print(classification_report(y_test, y_pred_class))

        初始模型构建完成并预测后,以AUC值与F1值作为评估指标,对4个初始模型进行评估。

#8. 绘制 ROC 曲线与混淆矩阵
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

models = [tr_model, lr_model, xgb_model]
model_names = ['tr_Model', 'lr_Model', 'xgb_Model']

plt.figure(figsize=(8,6))

for model, name in zip(models, model_names):
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format(name, roc_auc))
fpr, tpr, _ = roc_curve(y_test, y_pred_class)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format('nn_model', roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('初始各模型ROC曲线')
plt.legend(loc="lower right")
plt.show()

from sklearn.metrics import confusion_matrix, recall_score
import seaborn as sns
import matplotlib.pyplot as plt

for model, name in zip(models, model_names):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{name} 混淆矩阵\nf1值 = {f1:.4f}')
    plt.xlabel('预测标签')
    plt.ylabel('真实标签')
    plt.show()
cm = confusion_matrix(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'神经网络混淆矩阵\nf1值 = {f1:.4f}')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

        输出结果如下:

 

        可以看出, 属于树模型的XGBoost模型与属于广义线性回归模型的逻辑回归表现较好,AUC值分别为0.90与0.86,且两个模型的f1值较低,仍有提升空间。KNN作为懒惰学习的模型,在客户流失分类问题这种非聚类问题上表现不佳,初始决策树模型效果一般,有很大提升空间。

        模型参数调优:使用交叉验证优化超参数,提高模型泛化能力。

#9. 参数调优(交叉网格搜素)
from sklearn.model_selection import GridSearchCV, StratifiedKFold
param_grid_tr = {
        'max_depth': [2, 3, 5, 7, None],
        'min_samples_split': [2, 5, 10],
        'criterion': ['gini', 'entropy']
}

grid_search_tr = GridSearchCV(DecisionTreeClassifier(random_state=7), param_grid_tr, cv=5)
grid_search_tr.fit(X_train_resampled, y_train_resampled)
best_tr = grid_search_tr.best_estimator_
print("Best Decision Tree Parameters:", grid_search_tr.best_params_)

best_y_prod_tr = best_tr.predict_proba(X_test)[:, 1]
best_y_pred_tr = best_tr.predict(X_test)

param_grid_lr = {
        'C': [0.1, 1, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
}

grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=7), param_grid_lr, cv=5)
grid_search_lr.fit(X_train_resampled, y_train_resampled)
best_lr = grid_search_lr.best_estimator_
print("Best Logistic Regression Parameters:", grid_search_lr.best_params_)

best_y_prod_lr = best_lr.predict_proba(X_test)[:, 1]
best_y_pred_lr = best_lr.predict(X_test)

param_grid_xgb = {
 'n_estimators': [100, 200],
 'max_depth': [3, 5, 7],
 'learning_rate': [0.01, 0.1]}
grid_search_xgb = GridSearchCV(XGBClassifier(eval_metric='logloss'),
                                        param_grid_xgb,
                                        scoring='recall',
                                        cv=5,
                                        verbose=1)
grid_search_xgb.fit(X_train, y_train)
best_xgb = grid_search_xgb.best_estimator_
print("Best XGBoost Parameters:", grid_search_xgb.best_params_)
best_y_prod_xgb = best_xgb.predict_proba(X_test)[:, 1]
best_y_pred_xgb = best_xgb.predict(X_test)
best_xgb.save_model('xgboost_model.bin')

4. 最终模型评估与性能对比

        评估指标:采用F1值、ROC曲线指标评估模型性能。

#10. 最终模型可视化
best_models = [best_tr, best_lr, best_xgb]
for model, name in zip(best_models, model_names):
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format(name, roc_auc))
fpr, tpr, _ = roc_curve(y_test, y_pred_class)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format('nn_model', roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

for model, name in zip(best_models, model_names):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{name} 混淆矩阵\nf1值 = {f1:.4f}')
    plt.xlabel('预测标签')
    plt.ylabel('真实标签')
    plt.show()
cm = confusion_matrix(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'神经网络混淆矩阵\nf1值 = {f1:.4f}')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

         输出结果如下:

 

 

         可以看出,XGBoost模型在4个模型中表现最好,各个指标均为第一(AUC值分别为0.92,f1值为0.67),且相比优化前的模型,4个模型都有明显提升;KNN作为懒惰学习的模型,即使经过参数调优,在客户流失分类问题这种非聚类问题上表现仍然不佳;优化后决策树模型效果提升最大,模型性能仅次于XGBoost模型,最终,选取XGBoost分类器作为本次银行客户流失分类模型的最终模型。

        模型可解释性分析:基于SHAP算法,我们可以具体了解每个模型对于特征的选择与特征的重要性。这里以效果最好的XGBoost模型与调参后表现不错的决策树模型进行SHAP特征重要性分析。

import shap
from sklearn.inspection import PartialDependenceDisplay
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
tree_models = [best_xgb]
Kernel_models = [best_tr]
# 创建一个解释器对象的列表
explainers = []
for model in tree_models:
     #使用 TreeExplainer 支持基于树的模型 (决策树、XGBoost)
    explainer = shap.TreeExplainer(model)
    explainers.append(explainer)

for model in Kernel_models:
    # 使用 KernelExplainer 支持任意模型 (逻辑回归,)
    explainer = shap.KernelExplainer(model.predict, data = X_train[:100])
    explainers.append(explainer)

# 计算 SHAP 值
shap_values_list = [explainer.shap_values(X_test) for explainer in explainers]

# 绘制 SHAP 摘要图
for i, name in enumerate(model_names):
    print(f"Model: {name}")
    shap.summary_plot(shap_values_list[i], X_test, plot_type="bar")
    shap.summary_plot(shap_values_list[i], X_test)

输出结果如下:

                                                                     xgboost模型:

                                                                    决策树模型: 

 

         可以看出,“AGE”(年龄)、“NumOfProducts”(客户在银行购买产品数)、“IsActiveMember”(是否为活跃用户)与“Gender”(性别)四个变量在两个较优模型的特征重要性较高,对模型分类预测的贡献度大。

三、总代码实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score

#1. 数据读取
train_df = pd.read_csv('train.csv')
train_df.head()

#2. 数据初步处理
from sklearn.preprocessing import LabelEncoder
X = train_df.loc[:, ['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure',
                      'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']] # 提取有用的变量列
y = train_df['Exited']
print(X.head())
#print(y.head())

print(X['Geography'].unique())
encoder = LabelEncoder()
encoded_country = encoder.fit_transform(X['Geography'])  # 将国家列编码
encoded_gender = encoder.fit_transform(X['Gender'])
X['Geography'] = encoded_country
X['Gender'] = encoded_gender
print(X.head())

from collections import Counter
counts = Counter(y)
print(f"0 的个数: {counts[0]}")
print(f"1 的个数: {counts[1]}")
#3. 数据探索(可视化)
# 查看训练集目标变量的分布
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False 
sns.countplot(data = train_df, x='Exited')  
plt.title('训练集数据中的类别分布')
plt.show()

# 查看训练集特征之间的相关性热力图
plt.figure(figsize=(10, 8))
sns.heatmap(X.corr(), annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('特征相关热力图')
plt.show()
print("数据集特征的描述性统计:")
print(X.describe())

#4. 数据预处理(处理异常值与缺失值)
#4.1 缺失值处理
# 检查训练集和测试集中的缺失值
print("Missing values in train data:\n", train_df.isnull().sum())

#4.2 处理异常值
#数据异常值检测(使用Z-score方法)
z_scores_X = stats.zscore(X)
z_scores_y = stats.zscore(y)
abs_z_scores_X = np.abs(z_scores_X)
abs_z_scores_y = np.abs(z_scores_y)
for col in X.columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=X[col])
    plt.title(f'Boxplot of {col}')
    plt.show()
#处理异常值
def remove_outliers_iqr(df, factor=1.5):
    df_cleaned = df.copy()
    for col in df.columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)]
    return df_cleaned
cleaned_df_X = remove_outliers_iqr(X)
cleaned_df_y = y[cleaned_df_X.index]
print("原始样本数量:", len(X))
print("清洗后样本数量:", len(cleaned_df_X))

#5. 划分训练集和测试集(80% 训练,20% 测试)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleaned_df_X, cleaned_df_y, test_size=0.2, random_state=7)

#6. 训练集过采样
from imblearn.over_sampling import SMOTE
# 初始化 SMOTE 过采样器
smote = SMOTE(random_state=21)
# 应用 SMOTE 过采样
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# 打印过采样后的数据集形状
print("过采样后的 X_train 形状:", X_train_resampled.shape)
print("过采样后的 y_train 形状:", y_train_resampled.shape)

# 检查目标变量的分布
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False 
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.countplot(x=y_train_resampled)
plt.title('SMOTE 过采样后训练集中目标变量的分布')
plt.tight_layout()
plt.show()

#7. 模型训练
#7.1 决策树模型初始化并训练模型
tr_model = DecisionTreeClassifier(random_state=7)
tr_model.fit(X_train_resampled, y_train_resampled)
# 预测概率与类别
y_prod_tr = tr_model.predict_proba(X_test)[:, 1]
y_pred_tr = tr_model.predict(X_test)
# 打印分类报告
#print(classification_report(y_test, y_pred_tr))

#7.2 luogistic回归模型初始化并训练模型
lr_model = LogisticRegression(max_iter=1000, random_state=7)
lr_model.fit(X_train_resampled, y_train_resampled)

# 预测概率与类别
y_prod_lr = lr_model.predict_proba(X_test)[:, 1]
y_pred_lr = lr_model.predict(X_test)

# 打印分类报告
#print(classification_report(y_test, y_pred_lr))

#7.3 xgboost模型初始化并训练模型
from xgboost import XGBClassifier
# 初始化 XGBoost 分类器
xgb_model = XGBClassifier(
    eval_metric='logloss',    # 设置评估指标
    random_state=7            # 固定随机种子
)
# 训练模型
xgb_model.fit(X_train_resampled, y_train_resampled)

# 预测概率与类别
y_prod_xgb = xgb_model.predict_proba(X_test)[:, 1]
y_pred_xgb = xgb_model.predict(X_test)

#7.4 神经网络模型初始化并训练模型
from sklearn.preprocessing import StandardScaler
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# 构建模型
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')  # 输出层:二分类用 sigmoid
])

# 编译模型
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 查看模型结构
model.summary()

# 训练模型
history = model.fit(X_train_scaled, y_train_resampled,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2,
                    verbose=1)
# 预测概率
y_pred_prob = model.predict(X_test_scaled).flatten()

# 预测类别(根据阈值 0.5)
y_pred_class = (y_pred_prob > 0.5).astype("int32")

# 打印分类报告
#print(classification_report(y_test, y_pred_class))

#8. 绘制 ROC 曲线与混淆矩阵
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

models = [tr_model, lr_model, xgb_model]
model_names = ['tr_Model', 'lr_Model', 'xgb_Model']

plt.figure(figsize=(8,6))

for model, name in zip(models, model_names):
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format(name, roc_auc))
fpr, tpr, _ = roc_curve(y_test, y_pred_class)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format('nn_model', roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('初始各模型ROC曲线')
plt.legend(loc="lower right")
plt.show()

from sklearn.metrics import confusion_matrix, recall_score
import seaborn as sns
import matplotlib.pyplot as plt

for model, name in zip(models, model_names):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{name} 混淆矩阵\nf1值 = {f1:.4f}')
    plt.xlabel('预测标签')
    plt.ylabel('真实标签')
    plt.show()
cm = confusion_matrix(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'神经网络混淆矩阵\nf1值 = {f1:.4f}')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

#9. 参数调优(交叉网格搜素)
from sklearn.model_selection import GridSearchCV, StratifiedKFold
param_grid_tr = {
        'max_depth': [2, 3, 5, 7, None],
        'min_samples_split': [2, 5, 10],
        'criterion': ['gini', 'entropy']
}

grid_search_tr = GridSearchCV(DecisionTreeClassifier(random_state=7), param_grid_tr, cv=5)
grid_search_tr.fit(X_train_resampled, y_train_resampled)
best_tr = grid_search_tr.best_estimator_
print("Best Decision Tree Parameters:", grid_search_tr.best_params_)

best_y_prod_tr = best_tr.predict_proba(X_test)[:, 1]
best_y_pred_tr = best_tr.predict(X_test)

param_grid_lr = {
        'C': [0.1, 1, 10],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
}

grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=7), param_grid_lr, cv=5)
grid_search_lr.fit(X_train_resampled, y_train_resampled)
best_lr = grid_search_lr.best_estimator_
print("Best Logistic Regression Parameters:", grid_search_lr.best_params_)

best_y_prod_lr = best_lr.predict_proba(X_test)[:, 1]
best_y_pred_lr = best_lr.predict(X_test)

param_grid_xgb = {
 'n_estimators': [100, 200],
 'max_depth': [3, 5, 7],
 'learning_rate': [0.01, 0.1]}
grid_search_xgb = GridSearchCV(XGBClassifier(eval_metric='logloss'),
                                        param_grid_xgb,
                                        scoring='recall',
                                        cv=5,
                                        verbose=1)
grid_search_xgb.fit(X_train, y_train)
best_xgb = grid_search_xgb.best_estimator_
print("Best XGBoost Parameters:", grid_search_xgb.best_params_)
best_y_prod_xgb = best_xgb.predict_proba(X_test)[:, 1]
best_y_pred_xgb = best_xgb.predict(X_test)
best_xgb.save_model('xgboost_model.bin') #保存模型到本地

#10. 最终模型可视化
best_models = [best_tr, best_lr, best_xgb]
for model, name in zip(best_models, model_names):
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format(name, roc_auc))
fpr, tpr, _ = roc_curve(y_test, y_pred_class)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label='{} (AUC = {:.2f})'.format('nn_model', roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

for model, name in zip(best_models, model_names):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{name} 混淆矩阵\nf1值 = {f1:.4f}')
    plt.xlabel('预测标签')
    plt.ylabel('真实标签')
    plt.show()
cm = confusion_matrix(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'神经网络混淆矩阵\nf1值 = {f1:.4f}')
plt.xlabel('预测标签')
plt.ylabel('真实标签')
plt.show()

import shap
from sklearn.inspection import PartialDependenceDisplay
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
tree_models = [best_xgb]
Kernel_models = [best_tr]
# 创建一个解释器对象的列表
explainers = []
for model in tree_models:
     #使用 TreeExplainer 支持基于树的模型 (决策树、XGBoost)
    explainer = shap.TreeExplainer(model)
    explainers.append(explainer)

for model in Kernel_models:
    # 使用 KernelExplainer 支持任意模型 (逻辑回归,)
    explainer = shap.KernelExplainer(model.predict, data = X_train[:100])
    explainers.append(explainer)

# 计算 SHAP 值
shap_values_list = [explainer.shap_values(X_test) for explainer in explainers]

# 绘制 SHAP 摘要图
for i, name in enumerate(model_names):
    print(f"Model: {name}")
    shap.summary_plot(shap_values_list[i], X_test, plot_type="bar")
    shap.summary_plot(shap_values_list[i], X_test)

四、结论与改进方向

        结论:本研究基于机器学习技术构建了银行客户流失预测模型,通过系统性的数据分析和模型优化,取得了以下主要成果:

        构建了完整的客户流失预测框架,包括数据预处理、特征工程、模型训练与评估等关键环节。通过对比多种机器学习算法,发现集成学习方法(XGBoost)在预测准确性和稳定性方面表现最优,AUC值达到0.90以上,最终选取XGBoost分类器作为本次银行客户流失分类模型的最终模型。

        识别出影响客户流失的关键因素,包括账户活跃度、交易频率(客户在银行购买产品数)、客户年龄、客户性别等。通过SHAP值分析发现,账户活跃度和年龄对流失预测的贡献度最高,为银行制定精准的客户挽留策略提供了重要依据。

        未来改进方向:

(1)数据维度扩展:引入更多相关数据,收集更长时间跨度的客户行为数据。

(2)模型优化方向:尝试深度学习模型(如LSTM、Transformer)处理时序数据,开发自适应模型,能够动态调整预测策略。

(3)业务应用深化:开发智能化的客户挽留策略推荐系统,探索预测模型在其他金融场景(如信贷风险、产品推荐)的应用。

        本研究为银行客户流失预测提供了可行的技术方案和实施路径,未来随着数据积累和算法进步,预测模型的准确性和实用性将得到进一步提升,为银行业数字化转型提供更有力的支持。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值