meanshift算法:均值漂移,本质上是质心(下图的圆心)逐渐向样本点越来越密的地方进行偏移。最后算法收敛,质心就处在样本点最密的地方。
1. 基本meanshift算法
基本mean shift算法流程如下:
1)随机选择一个点x0,画圆/球,x0为第一个mean。
2)以x0为起点,球内其他点为终点,形成多个meanshift向量。
3)上述所有向量求和,生成一个x0为起点,x1为终点的向量。
4)mean漂移到x1。
5)以x1为第二个mean,重复上述过程。
从算法中,我们容易知道,mean整体上是朝着数据更密集的地方漂移的,因为数据越密集的地方,向量越多,越容易让x0x1向量指向这个方向。
2. 改进的meanshift算法
基本meanshift算法有非常大的缺陷,比如上面算法中的第三步,当球内某一侧的点特别多时,生成的向量x0x1可能会非常长,以至于mean会漂得很远,甚至跨到一个周围根本没有点的地方。这样算法就非常容易震荡而无法收敛。
1)对球内的每个向量进行归一化。
2)对meanshift向量进行归一化。
3. meanshift算法实现
# -*- coding: utf-8 -*-
import numpy as np
import utils
class MeanShift:
def __init__(self, mean, radius):
"""
mean: meanshift算法的球心mean
radius:meanshift算法的球半径
"""
self.mean = mean
self.radius = radius
def _compute_distance(self, train_x):
""" 计算所有点到球心的距离 """
return np.sqrt(np.sum((train_x - self.mean)**2, axis=1))
def create_ball(self, train_x):
""" 1. 生成球
train_x: 所有数据
distance: 所有点到球心mean的距离
inBall_index: 球内点, 在整个数据中的索引
"""
# 1. 计算距离
distance = self._compute_distance(train_x)
# 2. 找出球内的点
inBall_index = np.argwhere(distance <= self.radius)
return inBall_index.reshape(len(inBall_index),)
def compute_meanshiftVector(self, train_x, inBall_index):
""" 2. 计算meanshift向量
train_x: 所有数据
inBall_index: 球内点, 在整个数据中的索引
allVector: 球内所有点生成的向量
return: meanshift向量
"""
allVector = train_x[inBall_index] - self.mean
# 对球内所有向量进行归一化
length = len(allVector)
for i in range(length):
allVector[i, :] = allVector[i, :] / sum(allVector[i, :]**2)
# 计算meanshift向量同时归一化
meanshiftVector = np.sum(allVector, axis=0)
meanshiftVector = meanshiftVector / np.sqrt(sum(meanshiftVector**2))
return meanshiftVector
def update_mean(self, meanshiftVector):
""" 3. 更新mean, 即球心 """
self.mean = self.mean + meanshiftVector
def main(max_iter, mean, radius):
train_x = utils.load_data()
for i in range(len(mean)):
iter_times = 0
obj = MeanShift(mean[i, :], radius)
while iter_times < max_iter:
inBall_index = obj.create_ball(train_x)
meanshiftVector = obj.compute_meanshiftVector(train_x, inBall_index)
obj.update_mean(meanshiftVector)
iter_times += 1
print(obj.mean)
if __name__ == "__main__":
mean = np.array([[2, 2], [2, -2], [-2, 2], [-2, -2]])
radius = 1
main(200, mean, radius)
# -*- coding: utf-8 -*-
"""
file: utils.py
author: UniqueZ_
date: 2017-07-28
"""
import numpy as np
import matplotlib.pyplot as plt
import time
def load_data():
with open("../mean_shift/data/testSet.txt", "r") as f:
train_x = []
for line in f.readlines():
train_x.append(line.strip().split("\t"))
train_x = np.array(train_x, np.float)
return train_x
参考文献
http://blog.csdn.net/google19890102/article/details/51030884
http://blog.csdn.net/jinshengtao/article/details/30258833