Data Clustering: 50 Years Beyond K-Means1_随笔

Data Clustering: 50 Years Beyond K-Means1 论文题目：Data Clustering: 50 Years Beyond K-Means1

这篇paper对聚类的发展作了较为完善的概述，适合研究聚类方向的入门者（特别是我这样的小白哈哈）

一.小知识收藏

1.1955年 K-means算法提出
2.信息爆炸下大数据的特点：高容量，高纬度，种类多。
数据类型分为：结构化数据和非结构化数据。非结构化数据包括：图片，视频，音频，文本等。
3.数据分析主要分为两种：（1）探索描述：数据的特征和结构；（2）推断：基于给定数据确认模型、假设的有效性
4.学习问题（利用训练集预测测试集的行为）分为三种：（1）有监督学习（有标签y，分类）（2）无监督学习（无标签y，聚类）（3）半监督学习（包括半监督分类和半监督聚类）

5.聚类算法可以大致分为两类：层次和分块（hierarchical and partitional）。层次聚类主要分为自顶向下和自底向上。
6.层次聚类的输入是一个 n × n n times n n×n相似矩阵，其中 n n n是对象数。
分块聚类的输入是一个 n × d n times d n×d模式矩阵pattern matrix（k-means）或 n × n n times n n×n相似矩阵（谱聚类）。
7.

二.聚类clustering 1.聚类的定义

聚类的通俗定义：找到一种相似性度量，对全部对象分组，使得簇内的对象尽可能的相似，不同簇的对象差异尽可能的大。（对无标签数据分组）
虽然肉眼可以将图（a）分组为图（b）。如何设计一种聚类算法使得图（a）的聚类结果为（b)。

对于低维数据（二维甚至三维），我们可以人工聚类。关键是设计聚类算法对高维数据能合理有效的分组。

2.聚类的目的

聚类和分类的区别：有无标签y。举一个例子，盘子里有一堆苹果和梨。分类是告诉计算机什么是苹果，什么是梨，再让计算机对盘子里的水果分组。聚类是不告诉计算机苹果和梨的特征，让计算机根据盘子水果的特征对其分组。

对手写数字集分类可以得到手写数字2，对手写数字2聚类可以得到不同写法的2。

三、k-means算法 1.算法原理

X = { x i } , i = 1 , ⋯ , n X={x_i},i=1,cdots,n X={xi},i=1,⋯,n是一个d维数据集，将其划分为k个簇， C = { c k , k = 1 , ⋯ , K } C={c_k,k=1,cdots,K} C={ck,k=1,⋯,K}。
k-means的目标函数如下：
J ( C ) = ∑ k = 1 K ∑ x i ∈ c k ∥ x i − μ k ∥ 2 J(C)=sumlimits_{k=1}^{K} sumlimits_{x_iin c_k}|x_i-mu_k|^2 J(C)=k=1∑Kxi∈ck∑∥xi−μk∥2，其中 μ k mu_k μk是簇 c k c_k ck的均值。
k-means算法是找到一种划分使得目标函数达到最小。
k-means属于贪心算法，可能收敛于局部最小。
J（C）随着K的增加一直减小，当K=n时， J ( n ) = 0 J(n)=0 J(n)=0。所以k-means算法固定K寻找最小的划分。
k-means算法步骤：

2.k-means的参数

k-means算法需要提前指定三个参数：簇的个数K，初始化簇中心，距离度量。其中k值是最关键的。

1.初始化不同可能导致不同的聚类结果，因为k-means可能收敛于局部最优。避免局部最优的方法是对于给定的k值，设置多次不同的初始化，选择目标函数最小的聚类结果。
2.常见的距离度量有：欧式距离，马氏距离等。使用欧式距离，簇的形状是球形的；使用马氏距离，簇的形状是超椭圆形的。

3.k-means算法的拓展

1.Kernel K-means [Scholkopf et al. , 1998]基于核相似函数可检测任意形状的簇。
2.K- medoid [Kaufman & Rousseeuw, 2005]使用中位数代替均值聚类。
3. X- means [Pelleg & Moore, 2000]通过优化Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC)选择K。
4. Isodata [Ball & Hall, 1965] 和Forgy [Forgy, 1965]——k-means的两种变体。
5. Fuzzy c- means, proposed by Dunn [Dunn, 1973] and later improved by Bezdek [Bezdek, 1981]使得每个点可以同时属于多个簇（模糊集）。
6. bisecting K-means[Steinbach et al. , 2000]
7. 使用kd树加快k-means的计算速度[Pelleg & Moore, 1999]。

四、其他的聚类方法

Clusters can be defined as high density regions in the feature space separated by low density regions.
1.The Jarvis-Patrick 算法 defines the similarity between a pair of points as the number of common neighbors they share, where neighbors are the points present in a region of pre-specified radius around the point [Frank & Todeschini, 1994].

2.Ester et al.[1996] 提出了 DBSCAN 聚类算法，it directly searches for connected dense regions in the feature space by estimating the density using the Parzen window method.

Performance of Jarvis Patrick algorithm and DBSCAN depend on two parameters: neighborhood size in terms of distance, and the minimum number of points in a neighborhood for its inclusion in a cluster.

3.Graph theoretic clustering represents the data points as nodes in a weighted graph.

五、聚类的主要挑战

可参考文献[Jain & Dubes, 1988]。
(a) What is a cluster?
(b) What features should be used?
© Should the data be normalized?
(d) Does the data contain any outliers?
(e) How do we define the pair-wise similarity?
(f) How many clusters are present in the data?

Figueiredo and Jain [Figueiredo & Jain, 2002] used the minimum message length (MML) criteria in conjunction with the Gaussian mixture model (GMM) to estimate K.

(g) Which clustering method should be used?
不存在最好的聚类算法

(h) Does the data have any clustering tendency?
(i) Are the discovered clusters and partition valid?
不管数据中是否存在簇，聚类算法都会对数据j进行划分，得到簇。

均匀分布的数据点没有聚类的意义。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5712159.html

Data Clustering: 50 Years Beyond K-Means1

发表评论

评论列表（0条）