Simple and crude understanding and implementation of machine learning clustering algorithm (VII): clustering Case: Explore user preferences for goods categories subdivided dimensionality reduction

Clustering Algorithm

learning target

  • Master clustering algorithm implementation process
  • We know K-means algorithm theory
  • We know evaluation model clustering algorithm
  • The advantages and disadvantages of K-means
  • Understand way clustering algorithm optimization
  • Application Kmeans achieve clustering task
    Here Insert Picture Description

6.7 Case: Explore user preferences for goods categories subdivided dimensionality reduction

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-ZcEIIvTj-1583251884361) (../ images / instacart.png)]

Data are as follows:

  • order_products__prior.csv: Orders and Product Information
    • 字段:order_id, product_id, add_to_cart_order, reordered
  • products.csv: Product Information
    • 字段:product_id, product_name, aisle_id, department_id
  • orders.csv: customer orders information
    • 字段:order_id,user_id,eval_set,order_number,….
  • aisles.csv: merchandise items belonging to a particular category
    • 字段: aisle_id, aisle

1 Demand

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-MY8YkBEQ-1583251884362) (../ images / instacart% E6% 95% 88% E6% 9E% 9C. png)]

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-DWm3M836-1583251884362) (../ images / instacartPCA% E7% BB% 93% E6% 9E% 9C. png)]

2 analysis

  • 1. Obtain data
  • 2. The basic data processing
    • 2.1 Combined Form
    • 2.2 Cross-merger
    • 2.3 Interception
  • 3. Characteristics engineering - pca
  • 4. Machine Learning (k-means)
  • 5. Evaluation Model
    • sklearn.metrics.silhouette_score(X, labels)
      • Calculating the average of all samples contour coefficient
      • X: Eigenvalue
      • labels: the target are clustered tagged

3 complete code

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
  • 1. Obtain data
order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")
  • 2. The basic data processing

    • 2.1 Combined Form
    # 2.1 合并表格
    table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
    table2 = pd.merge(table1, orders, on=["order_id", "order_id"])
    table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])
    
    • 2.2 Cross-merger
    table = pd.crosstab(table["user_id"], table["aisle"])
    
    • 2.3 Interception
    table = table[:1000]
    
  • 3. Characteristics engineering - pca

    transfer = PCA(n_components=0.9)
    data = transfer.fit_transform(table)
    
  • 4. Machine Learning (k-means)

    estimator = KMeans(n_clusters=8, random_state=22)
    estimator.fit_predict(data)
    
  • 5. Evaluation Model

    silhouette_score(data, y_predict)
    
Published 627 original articles · won praise 839 · views 110 000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104645035