正如我在评论中提到的,我看到的关键点是“手段”`初始化。遵循[sk]的默认实现
混合物,
我没有随机初始化,而是切换到KMeans。
import numpy as npimport seaborn as snsimport matplotlib.pyplot as pltplt.style.use('seaborn')eps=1e-8def PDF(data, means, variances): return 1/(np.sqrt(2 * np.pi * variances) + eps) * np.exp(-1/2 * (np.square(data - means) / (variances + eps)))def EM_GMM(data, k=3, iterations=100, init_strategy='kmeans'): weights = np.ones((k, 1)) / k # shape=(k, 1) if init_strategy=='kmeans': from sklearn.cluster import KMeans km = KMeans(k).fit(data[:, None]) means = km.cluster_centers_ # shape=(k, 1) else: # init_strategy=='random' means = np.random.choice(data, k)[:, np.newaxis] # shape=(k, 1) variances = np.random.random_sample(size=k)[:, np.newaxis] # shape=(k, 1) data = np.repeat(data[np.newaxis, :], k, 0) # shape=(k, n) for step in range(iterations): # Expectation step likelihood = PDF(data, means, np.sqrt(variances)) # shape=(k, n) # Maximization step b = likelihood * weights # shape=(k, n) b /= np.sum(b, axis=1)[:, np.newaxis] + eps # updage means, variances, and weights means = np.sum(b * data, axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) variances = np.sum(b * np.square(data - means), axis=1)[:, np.newaxis] / (np.sum(b, axis=1)[:, np.newaxis] + eps) weights = np.mean(b, axis=1)[:, np.newaxis] return means, variances
This seems to yield the desired output much more consistently:
s = np.array([25.31 , 24.31 , 24.12 , 43.46 , 41.48666667, 41.48666667, 37.54 , 41.175 , 44.81 , 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 44.44571429, 39.71 , 26.69 , 34.15 , 24.94 , 24.75 , 24.56 , 24.38 , 35.25 , 44.62 , 44.94 , 44.815 , 44.69 , 42.31 , 40.81 , 44.38 , 44.56 , 44.44 , 44.25 , 43.66666667, 43.66666667, 43.66666667, 43.66666667, 43.66666667, 40.75 , 32.31 , 36.08 , 30.135 , 24.19 ])k=3n_iter=100means, variances = EM_GMM(s, k, n_iter)print(means,variances)[[44.42596231] [24.509301 ] [35.4137508 ]] [[0.07568723] [0.10583743] [0.52125856]]# Plotting the resultscolors = ['green', 'red', 'blue', 'yellow']bins = np.linspace(np.min(s)-2, np.max(s)+2, 100)plt.figure(figsize=(10,7))plt.xlabel('$x$')plt.ylabel('pdf')sns.scatterplot(s, [0.05] * len(s), color='navy', s=40, marker=2, label='Series data')for i, (m, v) in enumerate(zip(means, variances)): sns.lineplot(bins, PDF(bins, m, v), color=colors[i], label=f'Cluster {i+1}')plt.legend()plt.plot()
最后我们可以看到,纯随机初始化会产生不同的结果;让我们看看结果“means”:
for _ in range(5): print(EM_GMM(s, k, n_iter, init_strategy='random')[0], 'n')[[44.42596231] [44.42596231] [44.42596231]][[44.42596231] [24.509301 ] [30.1349997 ]][[44.42596231] [35.4137508 ] [44.42596231]][[44.42596231] [30.1349997 ] [44.42596231]][[44.42596231] [44.42596231] [44.42596231]]
在某些情况下,结果是如何不同的呢是常量,意味着选择了3个相似的值而没有
迭代时会有很大的变化。在
EMGMM中添加一些打印语句我会澄清的。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)