聚类分析全解析:从固定数量聚类到实际应用
立即解锁
发布时间: 2025-08-23 02:00:41 订阅数: 4 

### 聚类分析全解析:从固定数量聚类到实际应用
#### 1. 固定数量聚类
在聚类过程中,有时需要强制生成预先设定数量的聚类。以下通过具体示例来展示这种强制聚类的结果。
首先,我们生成两个真正分离的输入云,代码如下:
```mathematica
Clear["Global‘*"];
<<CIP‘Cluster‘
<<CIP‘Graphics‘
<<CIP‘CalculatedData‘
standardDeviation = 0.05;
numberOfCloudInputs = 500;
centroid1 = {0.3, 0.7};
cloudDefinition1 = {centroid1, numberOfCloudInputs, standardDeviation};
inputs1 = CIP‘CalculatedData‘GetDefinedGaussianCloud[cloudDefinition1];
centroid2 = {0.7, 0.3};
cloudDefinition2 = {centroid2, numberOfCloudInputs, standardDeviation};
inputs2 = CIP‘CalculatedData‘GetDefinedGaussianCloud[cloudDefinition2];
inputs = Join[inputs1, inputs2];
labels = {"x", "y", "Inputs to be clustered"};
points2DWithPlotStyle = {inputs, {PointSize[0.01], Blue}};
points2DWithPlotStyleList = {points2DWithPlotStyle};
CIP‘Graphics‘PlotMultiple2dPoints[points2DWithPlotStyleList, labels]
```
当将这两个最优或自然聚类强制划分为 3 个聚类时:
```mathematica
numberOfClusters = 3;
clusterInfo = CIP‘Cluster‘GetFixedNumberOfClusters[inputs, numberOfClusters];
CIP‘Cluster‘ShowClusterResult[{"NumberOfClusters", "EuclideanDistanceDiagram", "ClusterStatistics"}, clusterInfo]
```
结果如下:
| 聚类编号 | 成员数量 | 占比 | 距离 |
| --- | --- | --- | --- |
| 1 | 500 | 50% | 0 |
| 2 | 271 | 27.1% | 0.561643 |
| 3 | 229 | 22.9% | 0.573776 |
输入被分割成一个大聚类和两个相邻的小聚类,实际上第二个自然聚类被简单地分成了两半。通过轮廓宽度检查,发现一个好的聚类(与第一个自然聚类相同)和两个较差的聚类。
如果将输入划分为 4 个聚类:
```mathematica
numberOfClusters = 4;
clusterInfo = CIP‘Cluster‘GetFixedNumberOfClusters[inputs, numberOfClusters];
CIP‘Cluster‘ShowClusterResult[{"NumberOfClusters", "EuclideanDistanceDiagram", "ClusterStatistics"}, clusterInfo]
```
结果如下:
| 聚类编号 | 成员数量 | 占比 | 距离 |
| --- | --- | --- | --- |
| 1 | 282 | 28.2% | 0 |
| 2 | 218 | 21.8% | 0.0842652 |
| 3 | 265 | 26.5% | 0.568283 |
| 4 | 235 | 23.5% | 0.587472 |
输入被分割成四个大小相似的小聚类,每个小聚类是两个最优自然聚类的一半,轮廓宽度显示这 4 个聚类都较差。
从这些示例可以看出,将输入划分为越来越多的聚类似乎没有太大用处,因为聚类变得越不自然,聚类质量就越低。
#### 2. 获取代表
强制固定数量聚类的一个重要应用是生成一组输入的少量代表,这些代表应具有与完整输入集相似的空间多样性。
##### 2.1 均匀分布输入示例
首先,我们使用 5000 个随机分布的输入作为示例:
```mathematica
Clear["Global‘*"];
<<CIP‘Graphics‘
<<CIP‘Cluster‘
<<CIP‘CalculatedData‘
SeedRandom[1];
inputs = Table[{RandomReal[{0.05, 0.95}], RandomReal[{0.05, 0.95}]}, {5000}];
argumentRange = {0.0, 1.0};
functionValueRange = {0.0, 1.0};
labels = {"x", "y", "Inputs"};
allInputVectorsWithPlotStyle = {inputs, {PointSize[0.01], Green}};
points2DWithPlotStyleList = {allInputVectorsWithPlotStyle};
CIP‘Graphics‘PlotMultiple2dPoints[points2DWithPlotStyleList, labels, GraphicsOptionArgumentRange2D -> argumentRange, GraphicsOptionFunctionValueRange2D -> functionValueRange]
```
查看输入各组件的统计信息:
```mathematica
indexOfComponentList = {1, 2};
numberOfIntervals = 5;
argumentRange = {0.0, 1.0};
functionValueRange = {0.0, 30.0};
CIP‘Cluster‘ShowComponentStatistics[inputs, indexOfComponentList, ClusterOptionNumberOfIntervals -> numberOfIntervals, GraphicsOptionArgumentRange2D -> argumentRange, GraphicsOptionFunctionValueRange2D -> functionValueRange]
```
结果显示输入近似均匀分布。如果需要 20 个代表,可以使用随机选择的方法:
```mathematica
numberOfRepresentatives = 20;
randomRepresentatives = CIP‘Cluster‘GetRandomRepresentatives[inputs, numberOfRepresentatives];
labels = {"x", "y", "Random representatives"};
argumentRange = {0.0, 1.0};
functionValueRange = {0.0, 1.0};
randomRepresentativesBackground = {randomRepresentatives, {PointSize[0.025], White}};
randomRepresentativesWithPlotStyle = {randomRepresentatives, {PointSize[0.02], Black}};
points2DWithPlotStyleList = {allInputVectorsWithPlotStyle, randomRepresentativesBackground, randomRepresentativesWithPlotStyle};
CIP‘Graphics‘PlotMultiple2dPoints[points2DWithPlotStyleList, labels, GraphicsOptionArgumentRange2D -> argumentRange, GraphicsOptionFunctionValueRange2D -> functionValueRange]
```
随机选择的代表在这个示例中对输入空间的描述是令人满意的,但随机选择的输入并非严格等间距分布。
另一种方法是基于聚类的选择:
```mathematica
clusterRepresentatives = CIP‘Cluster‘GetClusterRepresentatives[inputs, numberOfRepresentatives];
labels = {"x", "y", "Cluster representatives"};
clusterRepresentativesBackground = {clusterRepresentatives, {PointSize[0.025], White}};
clusterRepresentativesWithPlotStyle = {clusterRepresentatives, {PointSize[0.02], Black}};
points2DWithPlotStyleList = {allInputVectorsWithPlotStyle, clusterRepresentativesBackground, clusterRepresentativesWithPlotStyle};
CIP‘Graphics‘PlotMultiple2dPoints[points2DWithPlotStyleList, labels, GraphicsOptionArgumentRange2D -> argumentRange, GraphicsOptionFunctionValueRange2D -> functionValueRange]
```
基于聚类的代表似乎更均匀分布,在这个示例中,随机选择和基于聚类的选择结果相当,但基于聚类的选择略占优势。
##### 2.2 非均匀分布输入示例
当输入集在输入空间中具有不同的密度时,情况会有所不同。我们生成具有不同密度的输入:
```mathematica
centroid1 = {0.3, 0.7
```
0
0
复制全文
相关推荐









