[AutoGluon]MAP - Charting Student Math Misunderstandings

技术与健康

于 2025-07-30 20:30:00 发布

阅读量335

点赞数 5

CC 4.0 BY-SA版权

文章标签：机器学习

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/Practicer2015/article/details/149778372

安装 AutoGluon: 首先通过 !pip install autogluon 确保安装了 autogluon 库。
加载数据:将指定 CSV 文件中的训练数据加载到 pandas DataFrame 中。
数据准备:
- 用 “NA” 填充’Misconception’ 列中的缺失值。
- 将 ‘QuestionText’、‘MC_Answer’ 和’StudentExplanation’ 中的文本合并到一个单独的 ‘combined_text’ 列中，这将用作模型的特征。
训练Category 模型:
- 创建一个包含 ‘combined_text’、‘QuestionId’ 和 ‘Category’ 列的训练子数据集 (df_train)。
- 使用准备好的训练数据初始化和训练一个用于 ‘Category’ 列的 AutoGluonTabularPredictor，设置了时间限制和质量预设。
- 在训练数据上评估训练好的 Category 预测器。
加载测试数据并预测Category:
- 加载指定 CSV 文件中的测试数据。
- 在测试数据中创建 ‘combined_text’ 特征，类似于训练数据。
  使用训练好的 Category 预测器预测测试数据上每个类别的概率。
训练 Misconception 模型:
- 创建一个包含’combined_text’ 和 ‘Misconception’ 列的训练子数据集 (df_train_Misconception)。
- 初始化和训练另一个用于 ‘Misconception’ 列的 AutoGluon TabularPredictor。
- 在训练数据上评估训练好的 Misconception 预测器。
预测 Misconception:
使用训练好的Misconception 预测器预测测试数据上每个误解的概率。
获取顶部预测结果: 提取测试数据前 5 行的顶部 3
个预测类别和误解（基于其概率）作为示例。
计算顶部组合: 定义一个函数
get_top_combinations，该函数接受类别和误解的概率 Series，计算前 N 个类别和误解的所有组合的概率乘积，并以前
N 个组合作为字符串返回，格式为 ‘category:misconception’。然后将此函数应用于预测 DataFrame
的每一行。
创建结果 DataFrame: 创建一个新的 DataFrame results_df 来存储最终输出，包括测试数据中的
‘row_id’ 和计算出的顶部组合。
格式化组合: 将 ‘Category:Misconception’
列中的顶部组合列表转换为一个逗号分隔的字符串。
保存提交文件: 最后，将 results_df DataFrame 保存到名为
submission.csv 的 CSV 文件中

# Install AutoGluon
!pip install autogluon

import pandas as pd
from autogluon.tabular import TabularPredictor

# Load training data
file_path = '/content/drive/MyDrive/Colab Notebooks/train.csv'
df = pd.read_csv(file_path)

# View data basic information
df_info = df.info()
df_head = df.head()
df_shape = df.shape

print(df_shape)
print(df_head)

# Target variable (classification)
label = 'Category'
label2 = 'Misconception'

# Combine text features
df['Misconception'] = df['Misconception'].fillna("NA")
df['combined_text'] = df['QuestionText'].fillna('') + ' ' + df['MC_Answer'].fillna('')+ ' ' + df['StudentExplanation'].fillna('')

# Create sub-dataset for training Category
df_train = df[['combined_text', 'QuestionId', label]].dropna()

# AutoGluon training for Category
predictor = TabularPredictor(label).fit(
    num_gpus = 1,
    train_data=df_train,
    time_limit=1800,  # Can be set larger, e.g., 3600 seconds
    presets='best_quality'  # Or 'medium_quality_faster_train'
)

# Evaluate Category predictor
predictor.evaluate(df_train)

# Load test data
test_path = '/content/drive/MyDrive/Colab Notebooks/test.csv'
test_data = pd.read_csv(test_path)
test_data['combined_text'] = test_data['QuestionText'].fillna('') + ' ' + test_data['MC_Answer'].fillna('')+ ' ' + test_data['StudentExplanation'].fillna('')

# Predict probabilities for Category on test data
predictors = predictor.predict_proba(test_data)

# Create sub-dataset for training Misconception
df_train_Misconception = df[['combined_text', label2]]

# AutoGluon training for Misconception
predictor_Misconception = TabularPredictor(label2).fit(
    num_gpus = 1,
    train_data=df_train_Misconception,
    time_limit=1800,  # Can be set larger, e.g., 3600 seconds
    presets='best_quality'  # Or 'medium_quality_faster_train'
)

# Evaluate Misconception predictor
predictor_Misconception.evaluate(df_train_Misconception)

# Predict probabilities for Misconception on test data
predictor_Misconceptions = predictor_Misconception.predict_proba(test_data)

# Get top 3 categories and misconceptions for the first 5 rows (example)
top_3_catorgray = predictors.head().apply(lambda row: row.nlargest(3).index.tolist(), axis=1)
top_3_misconceptions = predictor_Misconceptions.head().apply(lambda row: row.nlargest(3).index.tolist(), axis=1)

print("Top 3 Categories for first 5 rows:")
display(top_3_catorgray)
print("\nTop 3 Misconceptions for first 5 rows:")
display(top_3_misconceptions)

# Function to get top combinations of category and misconception
import itertools

def get_top_combinations(category_probs, misconception_probs, n=3):
    """
    Calculates the top n combinations of category and misconception based on the product of their probabilities.

    Args:
        category_probs (pd.Series): Series of probabilities for categories.
        misconception_probs (pd.Series): Series of probabilities for misconceptions.
        n (int): The number of top combinations to return.

    Returns:
        list: A list of strings representing the top n combinations in the format 'category:misconception'.
    """
    combinations = []
    for cat_name, cat_prob in category_probs.nlargest(n).items():
        for mis_name, mis_prob in misconception_probs.nlargest(n).items():
            combinations.append(((cat_name, mis_name), cat_prob * mis_prob))

    # Sort combinations by probability in descending order
    sorted_combinations = sorted(combinations, key=lambda item: item[1], reverse=True)

    # Get the top n combinations and format them
    top_n_combinations = [f"{cat}:{mis}" for (cat, mis), prob in sorted_combinations[:n]]

    return top_n_combinations

# Apply the function to each row of the prediction DataFrames
top_combinations = predictors.apply(lambda row: get_top_combinations(row, predictor_Misconceptions.iloc[row.name]), axis=1)

print("\nTop 3 Category:Misconception combinations:")
display(top_combinations)

# Create a new DataFrame to store the results
results_df = pd.DataFrame()

# Iterate through test_data and top_combinations
results_df['row_id'] = test_data['row_id']
results_df['Category:Misconception'] = top_combinations

# Convert the list of combinations to a single string
results_df['Category:Misconception'] = results_df['Category:Misconception'].apply(lambda x: ', '.join(x))

# Display the resulting DataFrame
print("\nResults DataFrame:")
display(results_df)

# Save the results to a CSV file
results_df.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created.")