毕业论文-EM算法学习总结

本文介绍了EM算法的基本概念，包括它在处理具有隐变量的参数估计中的作用，以及与极大似然估计和极大后验概率估计的关系。文章详细阐述了EM算法的迭代思想和具体步骤，并提供了推导过程。此外，还展示了使用Python的sklearn库实现EM算法和对比KNN算法的代码示例。

序

这个和我的硕士毕业论文的题目就有一定关系，我的导师让我按时向她汇报学习进度。然而我还在进行实习，还要准备自己明年的秋招，只能想办法游走于三者之间。

EM算法是一个常用的数据挖掘算法，想必从事数据挖掘的相关工作的同学一定比较熟悉，至少是有所耳闻。

1 EM算法的概念和介绍

1.1 一些基本概念

EM算法（Expectation-Maximization Algorithm）是一种通过不断迭代进而使模型的参数收敛，进而得到模型参数的算法。常用于具有隐变量的参数估计（极大似然估计或者极大后验概率估计）。

隐变量是不可观测的随机变量，我们通常通过可观测变量的样本对隐变量作出推断。简而言之，我们手里有一堆数据，我们需要将这对数据分成很多类别，但这些数据并没有任何标签信息。虽然没有标签信息但是这些数据都有很多特征变量，我们可以根据这些特征变量给这些数据进行人为的分类，这便称为”聚类“。而用来标记划分出来的各个类别的那个变量便是”隐变量“。因此，EM算法是一种无监督学习。

极大似然估计（MLE）和极大后验概率估计（MAP）是有一定区别的。简而言之，极大后验概率估计考虑先验知识对统计结果的影响，和极大似然估计只是从统计结果得到的频率出发，并不考虑这个频率是否一定符合所有情况（因为，当样本数量不够多的时候，统计结果得到的频率可能并不准确）。具体的区别可以简单地认为极大后验概率估计MAP在优化时比极大似然估计MLE多了一个先验项。详细内容可以参考这位博主的文章：

极大似然估计与最大后验概率估计的区别https://zhuanlan.zhihu.com/p/40024110

1.2 EM算法的大致思想

首先根据己经给出的观测数据，估计出模型参数的值（初始化）；然后再依据上一步估计出的参数值估计缺失数据的值，再根据估计出的缺失数据加上之前己经观测到的数据重新再对参数值进行估计，然后反复迭代，直至最后收敛，迭代结束。

1.3 EM算法的具体步骤

先得给予需要估计的参数一个值。
进行”E步“的操作：利用现有样本对隐变量进行参数估计，求出隐变量的期望，以初步判断期望样本属于哪一类。
进行”M步“的操作：根据上一步E步求得的隐变量的结果，通过参数估计（一般是MLE或者MAP）得到新的参数值。
将”M步“得到的新的参数数值重新代入步骤二（E步），然后反复迭代，经过多个E步和M步直至收敛。

如果要我举一个例子的的话，我觉得还是网上抛硬币那个更容易理解。对于这个例子，只是有些人写的容易理解，有些人写的让人看不懂，从本身来讲这个例子讲的是很清晰，很不错的。这里我推荐这位博主的文章。他在自己博客第三章：《EM算法》，的第一节：《举例说明EM算法用来干啥》中，解释的还是相当清楚的，具体内容我就不特意转载了，这里给出链接：

EM算法详解（请看第三章第一节：《举例说明EM算法用来干啥》）https://zhuanlan.zhihu.com/p/367714302

2 EM算法的推导和代码实现

2.1 EM算法的推导

下面给出EM算法各个步骤的公式，接着对公式进行推导。假设在第i次迭代后参数的估计值为 $\theta^{(i)}$ ，对于第i+1次迭代，分为两步：

E步，求期望：
- $Q(\theta ,\theta^{(i)})=\sum_{Z}\left ( Z|X,\theta^{(i)} \right )logP(X,Z|\theta)=E_{Z|X,\theta^{(i)}}\left [ logP(X,Z|\theta) \right ]$
M步，最大化：
- $\theta^{(i+1)}=argmaxQ(\theta,\theta^{(i)})$

其中， $Q(\theta ,\theta^{(i)})$ 被称为Q函数，是EM算法的核心。Q函数的推导如下：给定一组观测数据记为 $X=(x_{1}, x_{2}, ..., x_{n})$ ，以及参数 $\theta$ 。假设 $x_{1}, x_{2}, ..., x_{n}$ 是独立同分布，因此有以下的对数似然函数：

$\iota(\theta|X)=logP(X|\theta)=log\left ( \prod _{i=1}^{n}P(x_{i}|\theta) \right )=\sum _{i=1}^{n}logP(x_{i}|\theta)$

可以通过极大似然估计来求解最优参数，即：

$\widehat{\theta}=argmaxlog\iota(\theta|X)=argmax\sum_{i=1}^{n}logP(x_{i}|\theta)$

但是由于隐变量的存在， $logP(X|\theta)$ 变为

$logP(X|\theta)=log\sum_{Z}P(X,Z|\theta)=log(\sum_{Z}P(X|\theta,Z)P(Z|\theta))$

累加使得直接求解式，变得十分困难。一个办法是通过构造一个容易优化的有关对数似然函数的下界函数，通过不断迭代优化下界，从而逼近最优参数。

记隐变量Z的概率分布q(Z)，满足：

$\sum_{Z}q(Z)=1$

有以下等式成立：

$P(X|\theta)=\frac{P(X,Z|\theta)}{P(Z|X,\theta)}=\frac{P(X,Z|\theta)/q(Z)}{P(Z|X,\theta)/q(Z)}$

两边同时取对数

$logP(X|\theta)=log\frac{P(X,Z|\theta)/q(Z)}{P(Z|X,\theta)/q(Z)}$

同时求两边在Z上的期望

$E_{Z}[logP(X|\theta)]=E_{Z}log\frac{P(X,Z|\theta)/q(Z)}{P(Z|X,\theta)/q(Z)}$

因为 $logP(X|\theta)$ 与Z无关，所以求期望仍然不变：

$E_{Z}[logP(X|\theta)]=\sum_{Z}q(Z)logP(X|\theta)=logP(X|\theta)\sum_{Z}q(Z)=logP(X|\theta)$

接着，将右边展开

$E_{Z}log\frac{P(X,Z|\theta)/q(Z)}{P(Z|X,\theta)/q(Z)}=\sum_{Z}q(Z)log\frac{P(X,Z|\theta)/q(Z)}{P(Z|X,\theta)/q(Z)}=\sum_{Z}q(Z)log\frac{P(X,Z|\theta)}{q(Z)}-\sum_{Z}q(Z)log\frac{P(Z|X,\theta)}{q(Z)}=\sum_{Z}q(Z)log\frac{P(X,Z|\theta)}{q(Z)}-\sum_{Z}q(Z)log\frac{q(Z)}{P(Z|X,\theta)}=\sum_{Z}q(Z)log\frac{P(X,Z|\theta)}{q(Z)}+KL(q(Z)||P(Z|X,\theta))\geq \sum_{Z}q(Z)log\frac{P(X,Z|\theta)}{q(Z)}$

其中，KL是相对熵

$KL(p||q)=\sum_{x\epsilon X}p(x)log\frac{p(x)}{q(x)}$

以上，便是E步骤的全部。由 $\hat{\theta}=\theta^{(i+1)}$ ，可以得到M步的公式：

$\theta^{(i+1)}=argmaxE_{Z|X,\theta^{(i)}}[logP(X,Z|\theta)]$

2.1 EM算法的代码实现

代码如下：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.neighbors import KNeighborsClassifier


def knn(data, iter, k):
    data = data.reshape(-1, 3)
    data = np.column_stack((data, np.ones(row * col)))
    # 1.随机产生初始簇心
    cluster_center = data[np.random.choice(row * col, k)]
    # 2.分类
    distance = [[] for i in range(k)]
    for i in range(iter):
        print("迭代次数：{}次".format(i+1))
        # 2.1距离计算
        for j in range(k):
            distance[j] = np.sqrt(np.sum((data - cluster_center[j]) ** 2, axis=1))
        # 2.2归类
        data[:, 3] = np.argmin(distance, axis=0)
        # 3.计算新簇心
        for j in range(k):
            cluster_center[j] = np.mean(data[data[:, 3] == j], axis=0)
    return data[:, 3]


if __name__ == "__main__":
    # 清洗数据
    img = plt.imread('data\\iccv09Data\\images\\3000119.jpg')
    row = img.shape[0]
    col = img.shape[1]
    plt.subplot(321)
    plt.imshow(img)
    plt.subplot(323)
    plt.imshow(img)

    # sklearn的EM
    gm = GaussianMixture(n_components=2, init_params='kmeans', max_iter=1000).fit(img.reshape(-1, 3))
    res = gm.predict(img.reshape(-1, 3))

    gm = GaussianMixture(n_components=2, init_params='kmeans', max_iter=500).fit(img.reshape(-1, 3))
    res2 = gm.predict(img.reshape(-1, 3))

    gm = GaussianMixture(n_components=2, init_params='kmeans', max_iter=1500).fit(img.reshape(-1, 3))
    res3 = gm.predict(img.reshape(-1, 3))

    # KNN
    image_show = knn(img, 1000, 2)
    image_show = image_show.reshape(row, col)

    # 手写的EM
    # prob1 = 0.46
    # prob2 = 0.55
    # prob3 = 0.67
    # epochs = 20
    # clf = EM(prob1, prob2, prob3, epochs)
    # clf.fit(img)

    plt.subplot(322)
    plt.title('KNN 1000')
    plt.imshow(image_show, cmap='gray')

    plt.subplot(324)
    plt.title('EM 1000')
    plt.imshow(res.reshape(row, col), cmap='gray')

    plt.subplot(325)
    plt.title('EM 500')
    plt.imshow(res2.reshape(row, col), cmap='gray')
    plt.subplot(326)
    plt.title('EM 1500')
    plt.imshow(res3.reshape(row, col), cmap='gray')

    plt.show()

这里直接调用了sklearn的现成的EM算法的api，并借用了一位网友写的KNN的代码，做了对比。同时借鉴了另一位网友的博文，并在它的基础上修改了EM算法的代码，用python实现了一下EM算法，结果和sklearn的现成的EM算法的api生成的结果几乎一致。

class EMFactory(AlgorithmFactory):
    # 数据加载与解析
    def get_data(self, path):
        # '.\\data\\iccv09Data\\images\\0100113.jpg'
        print(path)
        src_image = Image.open(path)
        return src_image

    def algorithm(self, gray_status, src_image, time, plot_dir):
        # 假设两类数据初始占比相同，即先验概率相同
        P_pre1 = 0.5
        P_pre2 = 0.5

        # 假设每个数据来自两类的初始概率相同，即软标签相同
        soft_guess1 = 0.5
        soft_guess2 = 0.5
        # gray_status = False  # 灰度图像分割，打开这个开关
        # gray_status = True  # 彩色图像分割，打开这个开关

        # 灰度图像分割
        if not gray_status:
            # 数据初始化
            Gray_img = np.array(src_image.convert('L'))
            sample = np.reshape(Gray_img, (-1, 1)) / 256
            Gray_ROI = Gray_img / 255

            # 观察图像，肉眼估计初始值
            gray1_m = 0.5
            gray1_s = 0.1
            gray2_m = 0.8
            gray2_s = 0.3

            # 绘制假定的PDF
            x = np.arange(0, 1, 1 / 1000)
            gray1_pdf = norm.pdf(x, gray1_m, gray1_s)
            gray2_pdf = norm.pdf(x, gray2_m, gray2_s)
            plt.figure(0)
            ax = plt.subplot(1, 1, 1)
            ax.plot(x, gray1_pdf, 'r', x, gray2_pdf, 'b')
            ax.set_title('supposed PDF')
            plt.figure(1)
            ax1 = plt.subplot(1, 1, 1)
            ax1.imshow(Gray_img, cmap='gray')
            ax1.set_title('gray ROI')
            plt.show()

            gray = np.zeros((len(sample), 5))
            gray_s_old = gray1_s + gray2_s

            # 迭代更新参数
            for epoch in range(time):
                print("第{}次，共{}次".format(epoch + 1, time))
                for i in range(len(sample)):
                    # 贝叶斯计算每个数据的后验，即得到软标签（E过程）
                    soft_guess1 = (P_pre1 * norm.pdf(sample[i], gray1_m, gray1_s)) / (
                            P_pre1 * norm.pdf(sample[i], gray1_m, gray1_s) +
                            P_pre2 * norm.pdf(sample[i], gray2_m, gray2_s))
                    soft_guess2 = 1 - soft_guess1
                    gray[i][0] = sample[i]
                    gray[i][1] = soft_guess1 * 1  # 当前一个数据中类别1占的个数，1*后验，显然是小数
                    gray[i][2] = soft_guess2 * 1
                    gray[i][3] = soft_guess1 * sample[i]  # 对当前数据中属于类别1的部分，当前数据*后验
                    gray[i][4] = soft_guess2 * sample[i]

                    # 根据软标签，再借助最大似然估计出类条件概率PDF参数——均值，标准差（M过程）

                gray1_num = sum(gray)[1]  # 对每一个数据中类别1占的个数求和，就得到数据中类别1的总数
                gray2_num = sum(gray)[2]
                gray1_m = sum(gray)[3] / gray1_num  # 对每一个数据中属于类别1的那部分求和，就得到类别1的x的和，用其除以类别1的个数就得到其均值
                gray2_m = sum(gray)[4] / gray2_num

                sum_s1 = 0.0
                sum_s2 = 0.0

                for i in range(len(gray)):
                    sum_s1 = sum_s1 + gray[i][1] * (gray[i][0] - gray1_m) * (gray[i][0] - gray1_m)  # 每个数据的波动中，属于类别1的部分
                    sum_s2 = sum_s2 + gray[i][2] * (gray[i][0] - gray2_m) * (gray[i][0] - gray2_m)
                gray1_s = pow(sum_s1 / gray1_num, 0.5)  # 标准差
                gray2_s = pow(sum_s2 / gray2_num, 0.5)

                # print(gray1_m, gray2_m, gray1_s, gray2_s)
                P_pre1 = gray1_num / (gray1_num + gray2_num)  # 更新先验概率
                P_pre2 = 1 - P_pre1

                gray1_pdf = norm.pdf(x, gray1_m, gray1_s)
                gray2_pdf = norm.pdf(x, gray2_m, gray2_s)
                gray_s_d = abs(gray_s_old - gray2_s - gray1_s)
                gray_s_old = gray2_s + gray1_s
                # if gray_s_d < 0.0001:                                                               # 迭代停止条件，如果两次方差变化较小则停止迭代
                #     break

                # 绘制更新参数后的pdf
                plt.figure(2)
                ax2 = plt.subplot(1, 1, 1)
                ax2.plot(x, gray1_pdf, 'r', x, gray2_pdf, 'b')
                ax2.set_title('epoch' + str(epoch + 1) + ' PDF')
                plt.savefig(plot_dir + '//' + 'PDF_' + str(epoch + 1) + '.jpg', dpi=100)
                plt.close()

                if epoch % 1 == 0:  # 迭代2次进行一次分割测试

                    gray_out = np.zeros_like(Gray_img)
                    for i in range(len(Gray_ROI)):
                        for j in range(len(Gray_ROI[0])):
                            if Gray_ROI[i][j] == 0:
                                continue
                            # 贝叶斯公式分子比较，等价于最大后验
                            elif P_pre1 * norm.pdf(Gray_ROI[i][j], gray1_m, gray1_s) > P_pre2 * norm.pdf(Gray_ROI[i][j],
                                                                                                         gray2_m,
                                                                                                         gray2_s):
                                gray_out[i][j] = 100
                            else:
                                gray_out[i][j] = 255
                    # 显示分割结果
                    plt.figure(3)
                    ax3 = plt.subplot(1, 1, 1)
                    ax3.imshow(gray_out, cmap='gray')
                    ax3.set_title('epoch' + str(epoch + 1) + 'gray segment')
                    plt.savefig(plot_dir + '\\Gray_segment_' + str(epoch + 1) + '.jpg', dpi=100)
                    plt.close()

                    plt.show()

        # 三维时的EM
        # 彩色图像分割
        else:
            # 初始化数据
            RGB_img = np.array(src_image)
            RGB_sample = np.reshape(RGB_img, (-1, 3)) / 256
            RGB_ROI = RGB_img / 255

            # 观察图像，肉眼估计初始值
            RGB1_m = np.array([0.5, 0.5, 0.5])
            RGB2_m = np.array([0.8, 0.8, 0.8])
            RGB1_cov = np.array([[0.1, 0.05, 0.04],
                                 [0.05, 0.1, 0.02],
                                 [0.04, 0.02, 0.1]])
            RGB2_cov = np.array([[0.1, 0.05, 0.04],
                                 [0.05, 0.1, 0.02],
                                 [0.04, 0.02, 0.1]])

            RGB = np.zeros((len(RGB_sample), 11))

            # 显示彩色ROI
            plt.figure(3)
            cx = plt.subplot(1, 1, 1)
            cx.set_title('RGB ROI')
            cx.imshow(RGB_img)
            plt.show()
            # 迭代更新参数
            for epoch in range(time):
                print("第{}次，共{}次".format(epoch + 1, time))
                for i in range(len(RGB_sample)):
                    # 贝叶斯计算每个数据的后验，即得到软标签
                    soft_guess1 = P_pre1 * multivariate_normal.pdf(RGB_sample[i], RGB1_m, RGB1_cov) / (
                            P_pre1 * multivariate_normal.pdf(RGB_sample[i], RGB1_m,
                                                             RGB1_cov) + P_pre2 * multivariate_normal.pdf(
                        RGB_sample[i], RGB2_m, RGB2_cov))
                    soft_guess2 = 1 - soft_guess1
                    RGB[i][0:3] = RGB_sample[i]
                    RGB[i][3] = soft_guess1 * 1
                    RGB[i][4] = soft_guess2 * 1
                    RGB[i][5:8] = soft_guess1 * RGB_sample[i]
                    RGB[i][8:11] = soft_guess2 * RGB_sample[i]
                # print(RGB[0])

                # 根据软标签，再借助最大似然估计出类条件概率PDF参数——均值，标准差
                RGB1_num = sum(RGB)[3]
                RGB2_num = sum(RGB)[4]
                RGB1_m = sum(RGB)[5:8] / RGB1_num
                RGB2_m = sum(RGB)[8:11] / RGB2_num

                # print(RGB1_num+RGB2_num, RGB1_m, RGB2_m)
                cov_sum1 = np.zeros((3, 3))
                cov_sum2 = np.zeros((3, 3))

                for i in range(len(RGB)):
                    # print(np.dot((RGB[i][0:3]-RGB1_m).reshape(3, 1), (RGB[i][0:3]-RGB1_m).reshape(1, 3)))
                    cov_sum1 = cov_sum1 + RGB[i][3] * np.dot((RGB[i][0:3] - RGB1_m).reshape(3, 1),
                                                             (RGB[i][0:3] - RGB1_m).reshape(1, 3))
                    cov_sum2 = cov_sum2 + RGB[i][4] * np.dot((RGB[i][0:3] - RGB2_m).reshape(3, 1),
                                                             (RGB[i][0:3] - RGB2_m).reshape(1, 3))
                RGB1_cov = cov_sum1 / (RGB1_num - 1)  # 无偏估计除以N-1
                RGB2_cov = cov_sum2 / (RGB2_num - 1)

                P_pre1 = RGB1_num / (RGB1_num + RGB2_num)
                P_pre2 = 1 - P_pre1

                print(RGB1_cov, P_pre1)

                # 用贝叶斯对彩色图像进行分割

                RGB_out = np.zeros_like(RGB_ROI)

                for i in range(len(RGB_ROI)):
                    for j in range(len(RGB_ROI[0])):
                        if np.sum(RGB_ROI[i][j]) == 0:
                            continue
                        # 贝叶斯公式分子比较
                        elif P_pre1 * multivariate_normal.pdf(RGB_ROI[i][j], RGB1_m,
                                                              RGB1_cov) > P_pre2 * multivariate_normal.pdf(
                            RGB_ROI[i][j], RGB2_m, RGB2_cov):
                            RGB_out[i][j] = [255, 0, 0]
                        else:
                            RGB_out[i][j] = [0, 255, 0]
                # print(RGB_ROI.shape)

                # 显示彩色分割结果
                plt.figure(4)
                ax3 = plt.subplot(1, 1, 1)
                ax3.imshow(RGB_out)
                ax3.set_title('epoch' + str(epoch + 1) + ' RGB segment')
                plt.savefig(plot_dir + '\\RGB_segment_' + str(epoch + 1) + '.jpg', dpi=100)
                plt.close()

    def algorithm_creator(self, gray_status, time, plot_dir, path):
        data = self.get_data(path)
        self.algorithm(gray_status, data, time, plot_dir)

    def algorithm_product(self, time, gray_status, paths):
        for path in paths:
            plot_dir = 'EM_out_' + path.replace("data\\iccv09Data\\images\\", '').replace(".jpg", '')
            print(plot_dir)
            if os.path.exists(plot_dir) == 0:
                os.mkdir(plot_dir)
            print(plot_dir, path)
            self.algorithm_creator(gray_status, time, plot_dir, path)

这是结果对比图：