Getting Started with Udacity Machine Learning - Clustering

Unsupervised Learning: Clustering, Dimensionality Reduction

Clustering: K-means

1. Draw the cluster center at will

2. Allocate

3. Optimization

To connect the cluster centers, draw equidistant lines perpendicular to the line connecting the two clusters


Optimization: Mobility Center


Redistribute, re-optimize... until the correct center of the two classes



Very interesting k-means visualization tool  http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Clustering Algorithms  http://scikit-learn.org/stable/modules/clustering.html

k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Use k-means to determine the number of clusters at the beginning, the distance between the points

n_clusters Number of clusters, default 8

max_iter maximum number of iterations

n_init controls the number of initializations of the algorithm (????)


Limitations of k-means: For a fixed dataset, a fixed number of cluster centers, the predicted results are different (determined by the location of the initial cluster centers)

The initial center point selection of K-means has a lot to do with the final classification effect. For example, the clustering in the following figure has a big problem.



k-means means clustering mini-project


1.  The initial code can be found in k_means/k_means_cluster.py which reads in the Email + Finance (E+F) dataset and lets us prepare for clustering. First you will start performing K-means based on two financial features, please review the code and determine which features the code uses for clustering.

Run the code, this will create a scatter plot of the data

salary  、exercised_stock_options


2. Deploy k-means clustering on the financial_features data and specify 2 clusters as parameters. Store the cluster predictions to a list named pred so that the  Draw() command at the bottom of the script works correctly

### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(finance_features)
pred = kmeans.predict(finance_features)


3. Add a third feature to the feature list (features_list): "total_payments". Now rerun the clustering with 3 instead of 2 input features (obviously we can still just show the original 2 dimensions). Compare the cluster plot with the plot obtained using the 2 input features. Is there any point to switch the cluster? How many points? This new cluster using 3 animals cannot be guessed by the naked eye - it has to be identified by a k-means algorithm.

### the input features we want to use
### can be any key in the person-level dictionary (salary, director_fees, etc.)
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
then = "then"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )


### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, _ in finance_features:
    plt.scatter( f1, f2)
plt.show()


4.

What are the maximum and minimum values ​​for the "exercised_stock_options" feature used in this example? 34348384, 3285

What are the maximum and minimum values ​​taken by the "salary" feature used in this example ? 1111258, 477

(Note: If you look at finance_features, you'll see that some of the "NaN" values ​​have been cleaned up and replaced by zeros - so while those may look like minimums, they're deceiving, so they're more like you don't have information about which must be filled with a number. For this question, go back to the data_dict and look for the maximum and minimum values ​​displayed, ignoring any "NaN" entries.)

stocklist = []
salarylist = []
for item in data_dict:
    stock = data_dict[item]['exercised_stock_options']
    salary = data_dict[item]['salary']
    if stock != 'NaN':
        stocklist.append(stock)
    if salary != 'NaN':
        salarylist.append(salary)

print 'max stock:',np.max(stocklist)
print 'min stock:',np.min(stocklist)
print 'max salary:',np.max(salarylist)
print 'min salary:',np.min(salarylist)

5. The effect map after feature scaling and the points that have changed from the original image


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325987782&siteId=291194637