活动介绍
file-type

深入解析SVM与决策树在图片处理中的应用

版权申诉
5KB | 更新于2024-10-15 | 145 浏览量 | 0 下载量 举报 1 收藏
download 限时特惠:#14.90
本资源适合对图像处理和机器学习特别是分类技术感兴趣的开发者和研究人员使用。 1. SVM图片分类: SVM是一种二分类模型,其基本模型定义为特征空间上间隔最大的线性分类器,其学习策略便是间隔最大化,可用来解决非线性问题。在图片分类中,SVM通过提取图片的特征(如颜色直方图、纹理、形状等特征),然后利用SVM对这些特征进行训练和分类,最终实现对图像的识别和分类。 2. 决策树分类: 决策树是一种基本的分类与回归方法。它通过树状结构对数据集进行划分,每个内部节点代表对数据的一个属性进行判断,每个分支代表判断的输出结果,而每个叶子节点代表一种分类结果。在图片分类中,可以使用决策树对图片的特征进行决策和分类。 3. 图片压缩: 图片压缩是通过去除图片数据中的冗余信息,减少图片所需存储空间的过程。常见的压缩技术有无损压缩和有损压缩。无损压缩可以完整地还原原始图片数据,而有损压缩则牺牲一定的图像质量以获得更高的压缩比。图片压缩是数字图像处理中的一个基本技术,广泛应用于网络传输、存储优化等领域。 4. 图片重采样: 图片重采样是指在不改变图片尺寸的情况下,对图片的像素值进行重新计算的过程。这通常用于调整图片的分辨率或是为了满足特定的输出要求。重采样技术包括插值算法,如最近邻插值、双线性插值、双三次插值等。正确的重采样技术可以提高图片质量,减少锯齿或模糊。 5. 中值滤波: 中值滤波是一种典型的非线性滤波技术,用于去除图像噪声,尤其是去除椒盐噪声。它的工作原理是用像素点邻域内所有像素点值的中值来替代该像素点的值。中值滤波不会对图片的边缘信息产生明显的模糊,因此在保持图像边缘的同时,也能有效滤除噪声。 这些技术在图像处理领域具有广泛的应用,比如在医疗成像分析、卫星遥感图像处理、人工智能视觉系统等领域。此外,压缩包子文件的文件名称列表显示,文件名为 'code',这可能表明提供的代码示例或文档的名称为 'code',但未提供具体的文件扩展名,这可能意味着文件已经被解压或需要用户自行指定扩展名来使用。"

相关推荐

filetype

import numpy as np import pandas as pd from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from xgboost import XGBClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from imblearn.under_sampling import TomekLinks from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest, mutual_info_classif, VarianceThreshold from sklearn.tree import DecisionTreeClassifier from sklearn.feature_selection import RFE from sklearn.svm import SVC df = pd.read_excel(r’C:\Users\14576\Desktop\计算机资料\石波-乳腺癌\Traintest1.xlsx’) data = np.array(df) X = data[:, 1:] y = data[:, 0] pipeline = Pipeline([ (‘scaler’, StandardScaler()), (‘resample’, SMOTE(sampling_strategy=0.8,k_neighbors=3,random_state=42)), # 过采样在前 (‘clean’, TomekLinks(sampling_strategy=‘majority’)), # 欠采样在后 (‘variance_threshold’, VarianceThreshold(threshold=0.15)), (‘pca’, PCA(n_components=0.90)), (‘rfe’, RFE(estimator=RandomForestClassifier(), step=0.2, n_features_to_select=8)), (‘model’, AdaBoostClassifier( n_estimators=200, learning_rate=0.1, estimator=DecisionTreeClassifier(max_depth=1), random_state=42 )) # 模型最后 ]) #‘resample’, SMOTE(sampling_strategy=0.7,k_neighbors=5,random_state=42) kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) metrics = { ‘Accuracy’: [], ‘Precision’: [], ‘Recall’: [], ‘F1’: [], ‘AUC’: [] } for train_idx, val_idx in kf.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] 训练并预测 pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_val) y_proba = pipeline.predict_proba(X_val)[:, 1] 记录指标 metrics[‘Accuracy’].append(accuracy_score(y_val, y_pred)) metrics[‘Precision’].append(precision_score(y_val, y_pred)) metrics[‘Recall’].append(recall_score(y_val, y_pred)) metrics[‘F1’].append(f1_score(y_val, y_pred)) metrics[‘AUC’].append(roc_auc_score(y_val, y_proba)) for metric, values in metrics.items(): print(f"{metric}: {np.mean(values):.4f} ")from sklearn.pipeline import Pipeline from sklearn.feature_selection import RFE 在整个训练集上重新训练模型(使用全部可用数据) model=AdaBoostClassifier( n_estimators=200, learning_rate=0.1, estimator=DecisionTreeClassifier(max_depth=1), random_state=42 ) pipeline.fit(X_train, y_train) test_df = pd.read_excel(r’C:\Users\14576\Desktop\计算机资料\石波-乳腺癌\Testtest1.xlsx’) # 修改为实际路径 test_data = np.array(test_df) X_test = test_data[:, 1:] y_test = test_data[:, 0] 预测测试集结果 y_pred = pipeline.predict(X_test) y_proba = pipeline.predict_proba(X_test)[:, 1] # 获取正类的概率 计算评估指标 test_metrics = { ‘Accuracy’: accuracy_score(y_test, y_pred), ‘Precision’: precision_score(y_test, y_pred), ‘Recall’: recall_score(y_test, y_pred), ‘F1’: f1_score(y_test, y_pred), ‘AUC’: roc_auc_score(y_test, y_proba) } for metric, value in test_metrics.items(): print(f"{metric}: {value:.4f}")优化测试模型的各项指标,整个测试和训练集的样本总和近乎200个

filetype

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.linear_model import Ridge from sklearn.utils import resample import warnings # 设置中文字体为SimHei plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 忽略特定警告 warnings.filterwarnings("ignore", message="`distplot` is a deprecated function") warnings.filterwarnings("ignore", category=UserWarning) # 导入数据并添加类型标识 red_wine = pd.read_csv("winequality-red.csv", delimiter=";") white_wine = pd.read_csv("winequality-white.csv", delimiter=";") # 添加类型列 (红葡萄酒:0, 白葡萄酒:1) red_wine['type'] = 0 white_wine['type'] = 1 # 添加ID列 red_wine['ID'] = red_wine.index white_wine['ID'] = white_wine.index # 合并数据集 wine_data = pd.concat([red_wine, white_wine], axis=0) # 只使用白葡萄酒数据 white_data = wine_data[wine_data['type'] == 1].copy() # ============= 以下是修复后的代码 ============= # 绘制直方图,查看数据分布情况 columns = white_data.columns.tolist()[2:] fig = plt.figure(figsize=(15, 10)) for i in range(12): plt.subplot(3, 4, i + 1) plt.hist(white_data[columns[i]], bins=20, edgecolor='black', color='orange') # 添加特征名称作为标题放在图下方 plt.title(columns[i], y=-0.2, fontsize=15) # y=-0.2将标题放在图下方 # 移除坐标轴标签以节省空间 plt.xlabel('') plt.ylabel('') plt.tight_layout(pad=3.0) # 增加子图间距 plt.suptitle('特征分布直方图', fontsize=20) # 减小主标题字体大小 plt.show() # 绘制箱线图,查看是否具有异常值 fig = plt.figure(figsize=(20, 15)) for i in range(12): plt.subplot(3, 4, i + 1) sns.boxplot(y=white_data[columns[i]], width=0.5, color='forestgreen') # 添加特征名称作为标题放在图下方 plt.title(columns[i], y=-0.15, fontsize=12) # y=-0.15将标题放在图下方 # 移除Y轴标签以节省空间 plt.ylabel('') plt.tight_layout(pad=3.0) # 增加子图间距 plt.suptitle('特征箱线图(检测异常值)', fontsize=15) # 减小主标题字体大小 plt.show() # 查看数据正态分布情况 cols = 6 rows = len(columns) plt.figure(figsize=(4 * cols, 4 * rows)) plt.suptitle('数据正态分布检验', fontsize=30) i = 0 for col in columns: i += 1 ax = plt.subplot(rows, cols, i) # 使用histplot替代弃用的distplot sns.histplot(white_data[col], kde=True, stat="density", linewidth=0) # 手动添加正态分布曲线 mu, std = stats.norm.fit(white_data[col]) xmin, xmax = plt.xlim() x = np.linspace(xmin, xmax, 100) p = stats.norm.pdf(x, mu, std) plt.plot(x, p, 'r', linewidth=2) plt.title(f"{col}分布", fontsize=12) # 使用热力图展现特征之间的关系 # 计算相关性系数 pd.set_option('display.max_columns', 4) pd.set_option('display.max_rows', 4) data1 = white_data.drop(['ID', 'type'], axis=1) train_corr = data1.corr() print("特征相关性矩阵:\n", train_corr) plt.figure(figsize=(15, 10)) sns.heatmap(train_corr, vmax=.8, square=True, annot=True) plt.title('特征相关性热力图', fontsize=20) plt.tight_layout() plt.show() # 数据预处理 # type 和ID似乎没有什么用,删除 white_data = white_data.drop(['ID', 'type'], axis=1) print("\n删除ID和type列后的数据形状:", white_data.shape) # 白葡萄缺失值的处理 print("\n开始处理缺失值...") # 通过对数据集的观察,citric acid, chlorides, sulphates, pH数据范围小,直接用中位数填补 for col in ['citric acid', 'chlorides', 'pH', 'sulphates']: median_val = white_data[col].median() white_data[col] = white_data[col].fillna(median_val) print(f" 列 '{col}' 使用中位数填补: {median_val:.4f}") # residual sugar 只有两个空缺,直接删除 initial_count = len(white_data) white_data.dropna(subset=['residual sugar'], inplace=True) final_count = len(white_data) print(f" 删除 'residual sugar' 缺失值: 删除 {initial_count - final_count} 行") # 恢复索引 white_data.index = range(white_data.shape[0]) print(f"处理缺失值后剩余样本数: {len(white_data)}") # fixed acidity, volatile acidity用随机森林填补 data2 = white_data.copy() for i in ['fixed acidity', 'volatile acidity']: # 构建新特征矩阵和新标签 y = data2[i] x = data2.drop(columns=[i]) # 在新特征矩阵中,对含有缺失值的列进行0的填补 x_imputed = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0).fit_transform(x) # 找出训练集和测试集 Ytrain = y[y.notnull()] Ytest = y[y.isnull()] if len(Ytest) > 0: # 确保有缺失值需要填补 print(f"\n使用随机森林填补 '{i}' 的缺失值...") print(f" 缺失值数量: {len(Ytest)}") Xtrain = x_imputed[Ytrain.index, :] Xtest = x_imputed[Ytest.index, :] # 用随机森林回归填补缺失值 rfc = RandomForestRegressor(n_estimators=100, random_state=42) rfc.fit(Xtrain, Ytrain) predict = rfc.predict(Xtest) # 将填补好的特征返回原始特征矩阵 white_data.loc[white_data[i].isnull(), i] = predict print(f" 填补完成") print("\n缺失值处理后的数据信息:") print(white_data.info()) # 异常值处理函数 - 删除异常值 def Drop_outliers(data): print("\n开始处理异常值...") # 需要进行异常值处理的列 columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates'] initial_count = len(data) for col in columns: # 跳过零方差列 if data[col].std() == 0: print(f" 跳过零方差列 '{col}'") continue # 计算上下界 q1 = data[col].quantile(0.25) q3 = data[col].quantile(0.75) iqr = q3 - q1 bottom = q1 - 1.5 * iqr upper = q3 + 1.5 * iqr # 标识异常值 outliers = (data[col] < bottom) | (data[col] > upper) num_outliers = outliers.sum() # 只处理存在异常值的情况 if num_outliers > 0: # 异常值赋值为空值 data.loc[outliers, col] = np.nan # 删除包含异常值的行 data.dropna(subset=[col], inplace=True) # 重置索引 data.reset_index(drop=True, inplace=True) print(f" 列 '{col}': 删除 {num_outliers} 个异常值") final_count = len(data) print(f"异常值处理完成: 删除 {initial_count - final_count} 行, 剩余样本 {final_count} 个") return data white_data = Drop_outliers(white_data.copy()) # 获取异常数据的函数 def fine_outliers(model, X, y, sigma=3): try: y_pred = pd.Series(model.predict(X), index=y.index) except: model.fit(X, y) y_pred = pd.Series(model.predict(X), index=y.index) # 计算残差,以及残差的均值和方差 resid = y - y_pred mean_resid = resid.mean() std_resid = resid.std() # 检查是否为零标准差 if std_resid == 0: print("警告:残差标准差为零,无法计算Z值") return # 计算z统计量 z = (resid - mean_resid) / std_resid outliers = z[abs(z) > sigma].index num_outliers = len(outliers) print(f"检测到 {num_outliers} 个异常值 (|z| > {sigma})") plt.figure(figsize=(15, 4)) plt.suptitle('异常值检测', fontsize=16) ax1 = plt.subplot(1, 3, 1) plt.plot(y, y_pred, '.') plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro') plt.legend(['正常点', '异常值']) plt.xlabel('实际值') plt.ylabel('预测值') plt.title('实际值 vs 预测值') ax2 = plt.subplot(1, 3, 2) plt.plot(y, y - y_pred, '.') plt.plot(y.loc[outliers], y.loc[outliers] - y_pred.loc[outliers], 'ro') plt.legend(['正常点', '异常值']) plt.xlabel('实际值') plt.ylabel('残差') plt.title('残差分析') ax3 = plt.subplot(1, 3, 3) plt.hist(z, bins=50, alpha=0.7) plt.hist(z.loc[outliers], bins=50, color='r', alpha=0.7) plt.legend(['正常点', '异常值']) plt.xlabel('z值') plt.title('Z值分布') plt.tight_layout() plt.show() # 通过岭回归模型找出异常值 if not white_data.empty: X_train = white_data.iloc[:, :-1] y_train = white_data.iloc[:, -1] # 检查数据是否有效 if X_train.shape[1] > 0 and not y_train.empty: print("\n使用岭回归检测异常值...") fine_outliers(Ridge(), X_train, y_train) # 过采样少数类以生成平衡数据 # 检查quality列中存在的值 quality_values = white_data['quality'].unique() quality_counts = white_data['quality'].value_counts().sort_index() print("\n重采样前的质量分布:") print(quality_counts) # 创建样本字典 samples_dict = {} target_count = 1400 print("\n开始重采样以平衡类别...") for quality in quality_counts.index: df_quality = white_data[white_data['quality'] == quality] count = len(df_quality) if count < target_count: # 上采样少数类 samples_dict[quality] = resample(df_quality, replace=True, n_samples=target_count, random_state=42) print(f" 质量 {quality}: 上采样 {count} → {target_count} 样本") elif count > target_count: # 下采样多数类 samples_dict[quality] = resample(df_quality, replace=False, n_samples=target_count, random_state=42) print(f" 质量 {quality}: 下采样 {count} → {target_count} 样本") else: samples_dict[quality] = df_quality print(f" 质量 {quality}: 保持原样 {count} 样本") # 合并重采样后的数据 new_white_data = pd.concat(samples_dict.values()).reset_index(drop=True) # 显示新类别计数 plt.figure(figsize=(10, 6)) new_counts = new_white_data['quality'].value_counts().sort_index() plt.bar(new_counts.index, new_counts.values, color='yellowgreen', edgecolor='black') plt.title('重采样后的质量分布', fontsize=16) plt.xlabel('质量评分', fontsize=12) plt.ylabel('样本数量', fontsize=12) plt.xticks(new_counts.index) plt.grid(axis='y', alpha=0.5) # 在柱子上方添加数值标签 for i, v in enumerate(new_counts.values): plt.text(new_counts.index[i], v + 20, str(v), ha='center', fontsize=10) plt.tight_layout() plt.show() print("\n重采样后的质量分布:") print(new_counts) # 划分特征和标签,训练集和验证集 X = new_white_data.iloc[:, :-1] y = new_white_data.iloc[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print(f"\n数据集划分: 训练集 {X_train.shape[0]} 样本, 测试集 {X_test.shape[0]} 样本") # 标准化处理 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) X_train = pd.DataFrame(X_train_scaled, columns=X_train.columns) X_test = pd.DataFrame(X_test_scaled, columns=X_test.columns) print("已完成数据标准化处理") # 决策树模型 print("\n训练决策树模型...") clf = DecisionTreeClassifier(random_state=42) clf.fit(X_train, y_train) # 评估模型 train_acc = clf.score(X_train, y_train) print(f"训练集准确率: {train_acc:.4f}") # 决策树模型 print("\n训练决策树模型...") clf = DecisionTreeClassifier(random_state=42) clf.fit(X_train, y_train) # 预测测试集 y_pred = clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"测试集准确率: {accuracy:.4f}") # 决策树交叉验证 print("\n进行10折交叉验证...") clf_scores = cross_val_score(clf, X, y, cv=10) print(f"交叉验证准确率: {clf_scores}") print(f"平均交叉验证准确率: {clf_scores.mean():.4f} ± {clf_scores.std():.4f}") # ============= 添加混淆矩阵绘制 ============= from sklearn.metrics import confusion_matrix import seaborn as sns import numpy as np # 计算混淆矩阵 cm = confusion_matrix(y_test, y_pred) # 获取唯一的类别标签(质量评分) classes = sorted(np.unique(np.concatenate((y_test, y_pred)))) # 创建混淆矩阵热力图(原始计数) plt.figure(figsize=(12, 10)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes, cbar_kws={'label': '样本数量'}) # 设置标题和标签 plt.title('混淆矩阵 - 葡萄酒质量预测', fontsize=16) plt.xlabel('预测质量评分', fontsize=14) plt.ylabel('真实质量评分', fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12, rotation=0) # 添加准确率信息 plt.figtext(0.5, 0.01, f'测试集准确率: {accuracy:.4f}', ha="center", fontsize=12, bbox={"facecolor":"orange", "alpha":0.2, "pad":5}) plt.tight_layout() plt.show() # 创建归一化混淆矩阵(显示百分比) cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] plt.figure(figsize=(12, 10)) sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Blues', xticklabels=classes, yticklabels=classes, cbar_kws={'label': '百分比'}) # 设置标题和标签 plt.title('归一化混淆矩阵 (按真实类别)', fontsize=16) plt.xlabel('预测质量评分', fontsize=14) plt.ylabel('真实质量评分', fontsize=14) plt.xticks(fontsize=12) plt.yticks(fontsize=12, rotation=0) # 添加准确率信息 plt.figtext(0.5, 0.01, f'测试集准确率: {accuracy:.4f}', ha="center", fontsize=12, bbox={"facecolor":"orange", "alpha":0.2, "pad":5}) plt.tight_layout() plt.show() # ============= 添加分类报告 ============= from sklearn.metrics import classification_report # 打印详细的分类报告 print("\n分类报告:") print(classification_report(y_test, y_pred, target_names=[str(c) for c in classes])) # 计算并显示每个类别的准确率 class_accuracy = {} for i, cls in enumerate(classes): true_positives = cm[i, i] total = cm[i, :].sum() class_accuracy[cls] = true_positives / total if total > 0 else 0 print("\n各类别准确率:") for cls, acc in class_accuracy.items(): print(f" 质量 {cls}: {acc:.4f}") from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from sklearn.metrics import accuracy_score # 假设 X_train, y_train, X_test, y_test, X, y 已经定义好 # 创建随机森林分类器并训练 rfc = RandomForestClassifier(n_estimators=100, random_state=90) rfc.fit(X_train, y_train) # 预测测试集 y_pred = rfc.predict(X_test) # 计算训练集准确率 train_acc = rfc.score(X_train, y_train) print(f'Train Accuracy: {train_acc}') # 计算测试集准确率 accuracy = accuracy_score(y_test, y_pred) print(f'Test Accuracy: {accuracy}') # 使用随机森林分类器进行交叉验证 rfc = RandomForestClassifier(n_estimators=100, random_state=90) rfc_s = cross_val_score(rfc, X, y, cv=10).mean() print(f'交叉验证的结果为: {rfc_s}') 将此代码更改一下,使其更适应题目要求 预测葡萄酒质量 任务描述:”Wine”数据集,数据集包含红、白两种葡萄酒,4898条数据,12个属性,目标是预测葡萄酒的质量quality,取值是0~10的整数。 数据来源:https://archive.ics.uci.edu/dataset/186/wine+quality 任务提示:可以是聚类任务,无监督学习,画出图像,标记颜色即可。可以是回归任务/分类,预测目标是质量quality。回归任务请自行划分训练集和测试机。 任务考察点:数据划分,数据预处理,模型算法选择,模型训练,模型性能评估方法,模型性能度量,模型的性能。

filetype

193样本 正负比2比三 特征有53个 ,标签是乳腺癌患者her2是否表达大部分特征是心率变异的值,比如S1_Mean RR (ms) S1_SDNN (ms) S1_Mean HR (bpm) S1_SD HR (bpm) S1_Min HR (bpm) S1_Max HR (bpm)S1_RP_Lmean (beats) S1_RP_Lmax (beats) S1_RP_REC (%) S1_RP_DET (%) S1_RP_ShanEn S1_MSE_1 S1_MSE_2 S1_MSE_3 S1_MSE_4 S1_MSE_5 等,还有一些生理指标如年龄和bmi下面是我的数据操作和模型代码。写一个论文形式的模型搭建内容(包括使用了什么,为什么这么使用 对比其他这个方法好在哪里,以文本形式输出你的回答)data = pd.read_excel('C:/lydata/test4.xlsx') X = data.drop('HER2_G', axis=1) y = data['HER2_G'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=60) kf = KFold(n_splits=5, shuffle=True, random_state=91) accuracy_scores = [] precision_scores = [] recall_scores = [] f1_scores = [] auc_scores = [] total_confusion_matrix = np.zeros((len(np.unique(y_train)), len(np.unique(y_train))), dtype=int) pca = PCA(n_components=14) pipeline = Pipeline([ ('scaler', StandardScaler()), ('smote', SMOTE(k_neighbors=4, sampling_strategy=0.94, random_state=42)), ('pca', pca), ('gb', GradientBoostingClassifier( learning_rate=0.02, n_estimators=90, subsample=0.75, min_samples_split=5, min_samples_leaf=1, max_depth=6, random_state=42, warm_start=True, tol=0.0001, ccp_alpha=0, max_features=12, )) ]) for train_index, val_index in kf.split(X_train): X_train_fold, X_val = X_train.iloc[train_index], X_train.iloc[val_index] y_train_fold, y_val = y_train.iloc[train_index], y_train.iloc[val_index] pipeline.fit(X_train_fold, y_train_fold) y_pred = pipeline.predict(X_val) y_proba = pipeline.predict_proba(X_val)[:, 1] accuracy_scores.append(accuracy_score(y_val, y_pred)) precision_scores.append(precision_score(y_val, y_pred)) recall_scores.append(recall_score(y_val, y_pred)) f1_scores.append(f1_score(y_val, y_pred)) auc_scores.append(roc_auc_score(y_val, y_proba)) cm = confusion_matrix(y_val, y_pred) total_confusion_matrix += cm accuracy = np.mean(accuracy_scores) precision = np.mean(precision_scores) recall = np.mean(recall_scores) f1 = np.mean(f1_scores) auc = np.mean(auc_scores) print("Gradient Boosting 参数:") print(pipeline.named_steps['gb'].get_params()) print(f"Gradient Boosting 平均 accuracy: {accuracy:.2f}") print(f"Gradient Boosting 平均 precision: {precision:.2f}") print(f"Gradient Boosting 平均 recall: {recall:.2f}") print(f"Gradient Boosting 平均 F1 score: {f1:.2f}") print(f"Gradient Boosting 平均 AUC score: {auc:.2f}") print("综合混淆矩阵:") print(total_confusion_matrix) pipeline.fit(X_train, y_train) y_test_pred = pipeline.predict(X_test) y_test_proba = pipeline.predict_proba(X_test)[:, 1] accuracy_test = accuracy_score(y_test, y_test_pred) precision_test = precision_score(y_test, y_test_pred) recall_test = recall_score(y_test, y_test_pred) f1_test = f1_score(y_test, y_test_pred) auc_test = roc_auc_score(y_test, y_test_proba) print(f"测试集 accuracy: {accuracy_test:.2f}") print(f"测试集 precision: {precision_test:.2f}") print(f"测试集 recall: {recall_test:.2f}") print(f"测试集 F1 score: {f1_test:.2f}") print(f"测试集 AUC score: {auc_test:.2f}")

filetype

我使用的软件是visual studio code,我的数据集有14个特征纬度,列名包括:year;month;day;order;country;session_ID;page1(main_category);page2(clothing_model);colour;location;model_photography;price;price_2;page。每一列对应的值4个实例(一共有165475条数据):实例1:2008;4;1;1;29;1;1;A13;1;5;1;28;2;1。实例2:2008;4;1;2;29;1;1;A16;1;6;1;33;2;1。实例3:2008;4;1;3;29;1;2;B4;10;2;1;52;1;1。实例4:2008;4;1;4;29;1;2;B17;6;6;2;38;2;1。这些列的类型分别为:date,date,date,integer,categorical,integer,categorical,categorical,categorical,categorical,categorical,integer,binary,integer。在以下代码的基础上,为我用python编写代码,来对该数据集进行“逻辑回归和支持向量机(SVM)”(为之后制作决策树和随机森林做铺垫)。(思考是否需要先对数据集进行特征工程),为我撰写报告。目前代码:import pandas as pd import missingno as msno import matplotlib.pyplot as plt #(一)数据探索 #读取数据集 file_path = r"C:/Users/33584/Desktop/bobo/e-shop-clothing2008.csv" df = pd.read_csv(file_path, encoding='latin1') print(df.head(20)) print(df.tail(20)) print(df.info()) print(df.describe()) #(二)数据清洗 # ---------- 缺失值检测 ---------- # 方法1:统计每列缺失值数量 missing_count = df.isnull().sum() print("各列缺失值统计:\n", missing_count) # 方法2:计算整体缺失率 missing_percent = df.isnull().sum().sum() / df.size * 100 print(f"\n总缺失率:{missing_percent:.2f}%") # ---------- 缺失值可视化 ---------- # 矩阵图:白色表示缺失 msno.matrix(df, figsize=(10, 6), fontsize=12) plt.title("缺失值分布矩阵图", fontsize=14) plt.show() # 条形图:显示每列非空数据量 msno.bar(df, figsize=(10, 6), color="dodgerblue", fontsize=12) plt.title("数据完整性条形图", fontsize=14) plt.show() #结果:无缺失值

Kinonoyomeo
  • 粉丝: 110
上传资源 快速赚钱