Unsupervised Learning: Clustering, Dimensionality Reduction
Clustering: K-means
1. Draw the cluster center at will
2. Allocate
3. Optimization
To connect the cluster centers, draw equidistant lines perpendicular to the line connecting the two clusters
Optimization: Mobility Center
Redistribute, re-optimize... until the correct center of the two classes
Very interesting k-means visualization tool http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Clustering Algorithms http://scikit-learn.org/stable/modules/clustering.html
k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Use k-means to determine the number of clusters at the beginning, the distance between the points
n_clusters Number of clusters, default 8
max_iter maximum number of iterations
n_init controls the number of initializations of the algorithm (????)
Limitations of k-means: For a fixed dataset, a fixed number of cluster centers, the predicted results are different (determined by the location of the initial cluster centers)
The initial center point selection of K-means has a lot to do with the final classification effect. For example, the clustering in the following figure has a big problem.
k-means means clustering mini-project
1. The initial code can be found in k_means/k_means_cluster.py which reads in the Email + Finance (E+F) dataset and lets us prepare for clustering. First you will start performing K-means based on two financial features, please review the code and determine which features the code uses for clustering.
Run the code, this will create a scatter plot of the data
salary 、exercised_stock_options
2. Deploy k-means clustering on the financial_features data and specify 2 clusters as parameters. Store the cluster predictions to a list named pred so that the Draw() command at the bottom of the script works correctly
### cluster here; create predictions of the cluster labels ### for the data and store them to a list called pred from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=2).fit(finance_features) pred = kmeans.predict(finance_features)
3. Add a third feature to the feature list (features_list): "total_payments". Now rerun the clustering with 3 instead of 2 input features (obviously we can still just show the original 2 dimensions). Compare the cluster plot with the plot obtained using the 2 input features. Is there any point to switch the cluster? How many points? This new cluster using 3 animals cannot be guessed by the naked eye - it has to be identified by a k-means algorithm.
### the input features we want to use ### can be any key in the person-level dictionary (salary, director_fees, etc.) feature_1 = "salary" feature_2 = "exercised_stock_options" feature_3 = "total_payments" then = "then" features_list = [poi, feature_1, feature_2, feature_3] data = featureFormat(data_dict, features_list ) poi, finance_features = targetFeatureSplit( data ) ### in the "clustering with 3 features" part of the mini-project, ### you'll want to change this line to ### for f1, f2, _ in finance_features: ### (as it's currently written, the line below assumes 2 features) for f1, f2, _ in finance_features: plt.scatter( f1, f2) plt.show()
4.
What are the maximum and minimum values for the "exercised_stock_options" feature used in this example? 34348384, 3285
What are the maximum and minimum values taken by the "salary" feature used in this example ? 1111258, 477
(Note: If you look at finance_features, you'll see that some of the "NaN" values have been cleaned up and replaced by zeros - so while those may look like minimums, they're deceiving, so they're more like you don't have information about which must be filled with a number. For this question, go back to the data_dict and look for the maximum and minimum values displayed, ignoring any "NaN" entries.)
stocklist = [] salarylist = [] for item in data_dict: stock = data_dict[item]['exercised_stock_options'] salary = data_dict[item]['salary'] if stock != 'NaN': stocklist.append(stock) if salary != 'NaN': salarylist.append(salary) print 'max stock:',np.max(stocklist) print 'min stock:',np.min(stocklist) print 'max salary:',np.max(salarylist) print 'min salary:',np.min(salarylist)
5. The effect map after feature scaling and the points that have changed from the original image