[kaggle竞赛] 毒蘑菇的二元预测

最新推荐文章于 2025-04-06 21:16:18 发布

时雨h

最新推荐文章于 2025-04-06 21:16:18 发布

阅读量1.3k

点赞数 15

CC 4.0 BY-SA版权

分类专栏：数据库 kaggle 文章标签：机器学习大数据人工智能

本文链接：https://blog.csdn.net/shaozheng0503/article/details/141471750

毒蘑菇的二元预测

您提供了很多关于不同二元分类任务的资源和链接，看起来这些都是Kaggle竞赛中的参考资料和高分解决方案。为了帮助您更好地利用这些资源，这里是一些关键点的总结：

Playground Season 4 Episode 8

主要关注的竞赛: 使用银行流失数据集进行二元分类。
数据集: 已经重新组织并发布供参考。
热门解决方案:

- LightGBM 和 CatBoost 模型 (得分 0.8945)。
- XGBoost 和随机森林模型。
- 神经网络分类模型。

其他相关的竞赛和资源

使用生物信号对吸烟者状况进行二元预测

- EDA 和特征工程。
- XGBoost 模型。

使用软件缺陷数据集进行二元分类

- EDA 和建模。

机器故障的二元分类

- EDA, 集成学习, ML pipeline, SHAP 分析。

使用表格肾结石预测数据集进行二元分类

- 多种模型对比。

特色竞赛

- 美国运通 - 违约预测

- - 特征工程和LightGBM模型。

- 房屋信贷违约风险

- - 完整的EDA和特征重要性分析。

竞争指标 - Mathews 相关性系数

定义: 衡量二元分类器输出质量的度量。
资源:

- Wikipedia 关于 Phi 系数的页面。
- Voxco 博客关于 Matthews 相关性系数的文章。
- 一篇关于 Matthews 相关性系数在生物数据挖掘中的应用的论文。
- Scikit-learn 文档中关于 Matthews 相关性系数的说明。

希望这些信息能够帮助您更有效地开始学习和参与这些竞赛。如果您有具体的问题或者需要针对某个特定部分的帮助，请告诉我！

# 加载训练数据
train_data = pd.read_csv('train.csv')

# 显示前几行数据以了解数据结构
print(train_data.head())

# 查看数据的基本信息
print(train_data.info())

步骤 2: 数据探索与可视化

在这一步中，我们将对数据进行更深入的探索，并使用可视化工具来更好地理解数据的分布和特征之间的关系。

# 统计每种类型的蘑菇数量
print(train_data['class'].value_counts())

# 可视化不同类型的蘑菇数量
plt.figure(figsize=(8, 6))
sns.countplot(x='class', data=train_data)
plt.title('Distribution of Mushroom Classes')
plt.show()

# 查看各特征与目标变量之间的关系
fig, axs = plt.subplots(5, 5, figsize=(20, 20))
axs = axs.flatten()
for i, col in enumerate(train_data.columns[1:]):
    sns.countplot(x=col, hue='class', data=train_data, ax=axs[i])
    axs[i].set_title(f'Distribution of {col} by Class')
plt.tight_layout()
plt.show()

步骤 3: 数据预处理

接下来，我们将对数据进行预处理，包括特征编码和其他必要的变换。

# 对类别特征进行编码
label_encoder = LabelEncoder()

# 遍历所有非数字特征
for col in train_data.select_dtypes(include=['object']).columns:
    train_data[col] = label_encoder.fit_transform(train_data[col])
    
# 查看编码后的数据
print(train_data.head())

步骤 4: 构建模型

在这一步中，我们将构建 LightGBM 和 CatBoost 模型，并进行训练。

# 分割数据集
X = train_data.drop('class', axis=1)
y = train_data['class']

# 划分训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义 LightGBM 模型
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'verbosity': -1,
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1
}

# 创建 LightGBM 数据集
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

# 训练 LightGBM 模型
lgb_model = lgb.train(lgb_params, lgb_train, num_boost_round=1000, valid_sets=[lgb_val], early_stopping_rounds=100)

# 定义 CatBoost 模型
cb_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'learning_rate': 0.05,
    'depth': 6,
    'l2_leaf_reg': 10,
    'bootstrap_type': 'Bayesian',
    'bagging_temperature': 0.2,
    'random_seed': 42,
    'allow_writing_files': False
}

# 创建 CatBoost 数据集
cb_train = Pool(X_train, y_train)
cb_val = Pool(X_val, y_val)

# 训练 CatBoost 模型
cb_model = CatBoostClassifier(**cb_params)
cb_model.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, verbose=False)

步骤 5: 测试数据预测

最后，我们将使用训练好的模型对测试数据进行预测，并整理预测结果以符合 Kaggle 提交的要求。

# 加载测试数据
test_data = pd.read_csv('test.csv')

# 对测试数据进行相同的特征编码
for col in test_data.select_dtypes(include=['object']).columns:
    test_data[col] = label_encoder.transform(test_data[col])

# 获取测试数据的 ID
test_ids = test_data['id']

# 删除不需要的列
test_features = test_data.drop('id', axis=1)

# 使用模型进行预测
predictions_lgb = lgb_model.predict(test_features, num_iteration=lgb_model.best_iteration)
predictions_cb = cb_model.predict(test_features)

# 将预测结果转换为二进制形式
predictions_lgb_binary = (predictions_lgb > 0.5).astype(int)
predictions_cb_binary = (predictions_cb > 0.5).astype(int)

# 整理预测结果
submission_df = pd.DataFrame({'id': test_ids, 'class': predictions_lgb_binary})
submission_df['class'] = submission_df['class'].map({1: 'p', 0: 'e'})

# 保存预测结果
submission_df.to_csv('submission.csv', index=False)

好的，现在我将为你提供一个完整的代码示例来解决“毒蘑菇的二元预测”问题。我们将使用 LightGBM 和 CatBoost 这两个模型来进行预测，并遵循之前讨论过的步骤。

首先，我们需要安装必要的库。如果你还没有安装它们，可以使用以下命令安装：

pip install pandas numpy scikit-learn lightgbm catboost matplotlib seaborn plotly

下面是完整的代码示例：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.metrics import matthews_corrcoef

# 忽略警告
import warnings
warnings.filterwarnings('ignore')

# 加载数据
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# 数据预处理
def preprocess_data(data):
    # 对类别特征进行编码
    label_encoder = LabelEncoder()
    for col in data.select_dtypes(include=['object']).columns:
        data[col] = label_encoder.fit_transform(data[col])
    return data

# 预处理训练数据
train_data = preprocess_data(train_data)

# 预处理测试数据
test_data = preprocess_data(test_data)

# 数据分割
X = train_data.drop('class', axis=1)
y = train_data['class']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义 LightGBM 模型
lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'verbosity': -1,
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1
}

# 创建 LightGBM 数据集
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

# 训练 LightGBM 模型
lgb_model = lgb.train(lgb_params, lgb_train, num_boost_round=1000, valid_sets=[lgb_val], early_stopping_rounds=100)

# 定义 CatBoost 模型
cb_params = {
    'loss_function': 'Logloss',
    'eval_metric': 'AUC',
    'learning_rate': 0.05,
    'depth': 6,
    'l2_leaf_reg': 10,
    'bootstrap_type': 'Bayesian',
    'bagging_temperature': 0.2,
    'random_seed': 42,
    'allow_writing_files': False
}

# 训练 CatBoost 模型
cb_model = CatBoostClassifier(**cb_params)
cb_model.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, verbose=False)

# 测试数据预测
test_ids = test_data['id']
test_features = test_data.drop('id', axis=1)

# 使用 LightGBM 进行预测
predictions_lgb = lgb_model.predict(test_features, num_iteration=lgb_model.best_iteration)
predictions_lgb_binary = (predictions_lgb > 0.5).astype(int)

# 使用 CatBoost 进行预测
predictions_cb = cb_model.predict(test_features)
predictions_cb_binary = (predictions_cb > 0.5).astype(int)

# 评估模型
mcc_lgb = matthews_corrcoef(y_val, lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration) > 0.5)
mcc_cb = matthews_corrcoef(y_val, cb_model.predict(X_val) > 0.5)

print("LightGBM Matthews Correlation Coefficient: ", mcc_lgb)
print("CatBoost Matthews Correlation Coefficient: ", mcc_cb)

# 整理预测结果
submission_df = pd.DataFrame({'id': test_ids, 'class': predictions_lgb_binary})
submission_df['class'] = submission_df['class'].map({1: 'p', 0: 'e'})

# 保存预测结果
submission_df.to_csv('submission.csv', index=Fal