朴素贝叶斯

最新推荐文章于 2025-03-03 17:27:59 发布

原创最新推荐文章于 2025-03-03 17:27:59 发布 · 297 阅读

0 ·

CC 4.0 BY-SA版权

机器学习专栏收录该内容

9 篇文章

订阅专栏

1. 贝叶斯公式

问题描述：已知A事件发生的概率P(A)，B事件发生的概率P(B)，以及在A事件发生的前提下B事件发生的概率P(B|A)

那么在B事件发生的前提下A事件发生的概率P(A|B)：
$\frac{ P(A)P(B|A) } { P(B) }$
举例：今天出去游玩，但是出门发现早上天气是多云，已知早上多云的概率是40%，今天是雨天的概率是10%，下雨天的时候有50%早上是多云，你需要知道早上多云情况下是雨天的概率。
设雨天为事件A，早上多云为事件B
P(A) = 0.1
P(B) = 0.4
P(B|A) = 0.5
P(A|B) = $\frac{0.1 * 0.5}{0.4} = 0.125$
故早上多云情况下是雨天的概率为0.125

将P(B)用全概率公式表达，贝叶斯公式就为：
$\frac{ P(A)P(B|A) } { P(A)P(B|A)\ +\ P(非A)P(B|非A)}$
举例：用血清甲胎蛋白法诊断肝癌，肝癌患者反应呈阳性的概率为0.95，正常人反应呈阳性的概率为0.1，假定人群中肝癌的患病率为0.0001，先有一人检测结果呈阳性，需要知道此人患肝癌的概率。
设某人肝患癌为事件A，被检测者呈阳性为事件B
P(A) = 0.0001
P(非A) = 0.9999
P(B|A) = 0.95
p(B|非A) = 0.1
P(A|B) = $\frac {0.0001*0.95}{0.0001*0.95\ +\ 0.9999*0.1}$
图解说明：
在这里插入图片描述
所以检测阳性患病的概率就是 $\frac {95} {95+99990}$ ，也就是 $\frac {0.0001*0.95}{0.0001*0.95\ +\ 0.9999*0.1}$

2. 朴素贝叶斯算法

什么是朴素

我们都知道概率论中有P(A&B) = P(A)P(B)，也就是A、B两个事件同时发生的概率等于A事件发生的概率乘B事件发生的概率。但是这个公式有一个前提，那就是A、B两个事件独立。
如果我们就假设A、B两个事件独立，那么这个公式就成立。这个假设就是朴素假设。

朴素贝叶斯算法

现在我们要做一个垃圾邮件(spam)检测，根据样本已知：
一封邮件为垃圾邮件的概率P(spam)
一封邮件为正常邮件的概率P(normal) = 1 - P(spam)
垃圾邮件里有“easy”这个词的概率P(“easy”|spam)
正常邮件里有“easy”这个词的概率P(“easy”|normal)
垃圾邮件里有“money”这个词的概率P(“money”|spam)
正常邮件里有“money”这个词的概率P(“money”|normal)
此时我们需要根据样本推出一封邮件如果出现"money"和"easy"，那么它是垃圾邮件的概率和是正常邮件的概率，即P(spam|“easy”,“money”)和P(normal|“easy”,“money”)

算法步骤：

有贝叶斯公式有：P(A|B)P(B) = P(B|A)P(A)，那么也就是P(A|B)和P(B|A)P(A)呈比例
那么现在A为spam，B为"easy",“money”，那么
P(spam|“easy”,“money”) 和 P(“easy”,“money”|spam)P(spam)呈比例，同样：
P(normal|“easy”,“money”)和 P(“easy”,“money”|normal)P(normal)呈比例
现在对P(“easy”,“money”|spam)和P(normal|“easy”,“money”)使用朴素假设，就有：
P(“easy”,“money”|spam) = P(“easy”|spam)P(“money”|spam)
P(normal|“easy”,“money”) = P(“easy”|normal)P(“money”|normal)
此时可以得到
P(spam|“easy”,“money”) 和 P(“easy”|spam)P(“money”|spam)P(spam)呈比例
P(normal|“easy”,“money”)和 P(“easy”|normal)P(“money”|normal)P(normal)呈比例
最后对P(spam|“easy”,“money”)和P(normal|“easy”,“money”)进行归一化，就可以得到
$\frac{P("easy"|spam)P("money"|spam)P(spam)} {P("easy"|spam)P("money"|spam)P(spam)\ +\ P("easy"|normal)P("money"|normal)P(normal)}$
$\frac{P("easy"|normal)P("money"|normal)P(normal)} {P("easy"|spam)P("money"|spam)P(spam)\ +\ P("easy"|normal)P("money"|normal)P(normal)}$

推广到更多单词特征：

$P(spam|"easy",\cdots,"money") = \frac{P("easy"|spam) \cdots P("money"|spam)P(spam)} {P("easy"|spam) \cdots P("money"|spam)P(spam)\ +\ P("easy"|normal) \cdots P("money"|normal)P(normal)}$
$P(normal|"easy",\cdots,"money") = \frac{P("easy"|normal) \cdots P("money"|normal)P(normal)} {P("easy"|spam) \cdots P("money"|spam)P(spam)\ +\ P("easy"|normal) \cdots P("money"|normal)P(normal)}$

3. sklearn中的朴素贝叶斯

（1）读取数据

import pandas as pd

# 读取文件
df = pd.read_csv('SMSSpamCollection',#数据文件
                  sep='\t', 
                  header=None, 
                  names=['label', 'sms_message'])

# 输出前五行
df.head()

（2）将标签值映射为0，1

# 将标签值映射为0，1
df['label'] = df['label'].map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 2)

（3）分离训练集和测试集

# 分离训练集和测试集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1) # test_size默认是0.25

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393

（4）预处理数据集

将数据集的字符串转为词频矩阵
在这里插入图片描述

# 实例化一个CountVectorize对象
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

# 将训练集的字符串转为词频矩阵
training_data = count_vector.fit_transform(X_train)

# 将测试集的字符串转为词频矩阵，注意和训练集函数不一样
# transform(raw_documents): 使用符合fit的词汇表或提供给构造函数的词汇表
testing_data = count_vector.transform(X_test)

（5）创建贝叶斯分类器

from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

（6）评估模型

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score: 0.9885139985642498
Precision score: 0.9720670391061452
Recall score: 0.9405405405405406
F1 score: 0.9560439560439562

整体代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 1. 读取文件
df = pd.read_csv('SMSSpamCollection',#数据文件
                  sep='\t', 
                  header=None, 
                  names=['label', 'sms_message'])

# 2. 将标签值映射为0，1
df['label'] = df['label'].map({'ham':0, 'spam':1})

# 3. 分离训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1) # test_size默认是0.25

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

# 4. 预处理数据集
# 实例化一个CountVectorize对象
count_vector = CountVectorizer()
# 将训练集的字符串转为词频矩阵
training_data = count_vector.fit_transform(X_train)
# 将测试集的字符串转为词频矩阵，注意和训练集函数不一样
# transform(raw_documents): 使用符合fit的词汇表或提供给构造函数的词汇表
testing_data = count_vector.transform(X_test)

# 5. 创建分类器
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

# 6. 评估模型
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393
Accuracy score: 0.9885139985642498
Precision score: 0.9720670391061452
Recall score: 0.9405405405405406
F1 score: 0.9560439560439562

数据文件 SMSSpamCollection下载地址： SMSSpamCollection 提取码: qk8a