Introductory Research on Machine Learning (17) — Instacart Market User Classification

table of Contents

1. Get data

2. Consolidate data

3. Data dimensionality reduction

4. Use K-means for classification

classification

KMeans predictor return parameters

5. Summary


Instacart Market Basket Analysis is a classic case of customer behavior prediction. Predict and classify users by analyzing a large amount of open source order data

1. Get data

We download the corresponding data from the official website , put it in the local directory, and read the data as follows:

(1) order_products__prior.csv: order and product information. The corresponding fields are: order_id, product_id, add_to_cart_order, reordered, we pass

order_products = pd.read_csv("order_products__prior.csv")

The content obtained in the table is as follows, and only part of the data is intercepted as shown in the figure:

(2) products.csv: product information. The corresponding fields are: product_id, product_name, aisle_id, department_id, we pass

products = pd.read_csv("products.csv")

The content obtained in the table is as follows: only part of the data is intercepted as shown in the figure:

(3) orders.csv: user's order information. The corresponding fields are: order_id, user_id, eval_set, order_number,..., we pass

orders = pd.read_csv("orders.csv")

The content obtained in the table is as follows: only part of the data is intercepted as shown in the figure:

(4) aisles.csv: the specific item category of the product. The corresponding fields are: aisle_id, aisle, we pass

aisles = pd.read_csv("aisles.csv")

The content obtained in the table is as follows: only part of the data is intercepted as shown in the figure:

From the above tables, we cannot predict the user’s shopping behavior. We can use it to predict the category of the user’s purchase. Then we need to establish a relationship between the user’s user_id and the user’s item category aisle, but we It is found that the user information user_id is in orders, and the product category aisle is in aisles, then a relationship needs to be established between user_id and aisle to make predictions.

So our goal is to use user_id and aisle, and user_id as the row index of the table, and aisle as the column index, then we can use the clustering algorithm to predict.

2. Consolidate data

By observing several tables, it is found that aisle and user_id can be associated with aisle_id, product_id, and order_id.

(1) We found that there is aisle_id in aisles.csv and aisle_id in products.csv, so the two tables need to be merged first, and merge in pandas is used to merge the two tables in no order, and the merge defaults Is an inner join, we only need to merge according to aisle_id.

#的是让aisle_id和product_id 在同一张表中
tab1=pd.merge(aisles,products, on =["aisle_id","aisle_id"] )

The resulting table is as follows:

After the above operation, we have got the first table, at this time aisle_id and product_id are in the same table.

PS: pandas.merge function: merge the table according to a field

pandas.merge(left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None
)

Several of the main parameters are introduced as follows:

parameter meaning
left Left table
right Right table
how

Combine table mode, the default is inner, and outer: full connection can also be supported

/left: left connection /right: right connection

on To merge according to a certain field, both must exist. When the two tables do not exist at the same time, you can use left_on/right_on to set
left_on The left table needs to join the fields
right_on The fields on the right table need to be connected
left_index The left table row index is used as the connection between
right_index The right table row index is used as the connection between
  suffixes For duplicate columns that appear in the two data sets, add the suffix _x, _y in the new data set to distinguish
indicator Arrange the merged data according to the connection key, the default is True

(2) We are looking at which table contains product_id, we see that order_products contains product_id, then we need to merge the tab1 and order_products we just obtained 

tab2=pd.merge(tab1,order_products,on=["product_id","product_id"])

The obtained tab2 table structure is as follows:

In this way, the tab2 we get contains aisle_id and order_id

(3) We are looking at which table contains order_id, orders contains order_id, and contains user_id, then after we merge orders and tab2, the resulting table contains aisle_id, aisle_id and user_id, so that aisle_id and user_id are put in the same table, this table contains user_id and aisle.

(4) Tab3 mentioned above is not that the row index we want is user_id, and the class index is the product category aisle. Then we need to use the cross table to find the frequency relationship between the two. Then you need to use a cross table to convert tab3 into a behavior user_id, which is listed as a table structure of the number of product categories. as follows:

table = pd.crosstab(tab3["user_id"],tab3["aisle"])

The table structure after conversion is shown in the figure: 

We can see that the obtained table structure has a total of 206,209 samples and 134 features. 

PS: pandas.crosstab function: used to count grouping frequency, is a special pivot_table()

pandas.crosstab(
    index,
    columns,
    values=None,
    rownames=None,
    colnames=None,
    aggfunc=None,
    margins=False,
    margins_name="All",
    dropna=True,
    normalize=False,
)

The parameters are described as follows:

parameter meaning
index The key value of the row grouping. Supports arrays, if it is an array, it means that it is a multi-level cascading index of behavior
columns The key value of the column group. Supports arrays, if it is an array, it means that the column is a multi-level cascading index
values According to the list of factor aggregation, an aggregation function needs to be developed
rownames The queue, if specified, must match the number of queue columns passed
colnames The queue, if specified, must match the number of queues delivered
aggfunc Used in conjunction with values, the value passed in is a calculation function such as aggfunc=np.average, then the average value is calculated for the rows or columns marked by values
margins

Add a row/column "Total", the default is False. Used in conjunction with normalize.

Under the premise of setting to True:

If normalize='index', margins appear as a new line at the bottom, that is, there is one more line at the bottom where margins_name is the total number of index values;

If normalize='columns', add a column to the far right, that is, add a row to the far right with margins_name as the total number of index values;

If normalize=True, increase the total number of rows and columns with margins_name as the index value, that is, count the number of rows and columns respectively

margins_name The name of the added row/column, the default value is All
drop The default is True, if a column of data is all Nan, delete
normalize

Whether it is converted to a floating point type is converted to a floating point type

The default is False.

Normalize is'index' or 1, then each row is normalized, and the conversion value is the proportion of the cell data in the total data of the row

Normalize is'columns' or 0, then each column is normalized, and the conversion value is the proportion of the data of the cell in the total data of the column

Normalize is'All' or True, then normalize all, the conversion value is the proportion of the data in the cell in the total data

3. Data dimensionality reduction

We see that there are a total of 134 features in the data set finally obtained above, but there are a large number of 0s in the features, indicating that there are a lot of feature redundancy, so PCA is used for dimensionality reduction. The steps are as follows: basically instantiate the converter, and then call the fit_transform of the converter to transform.

#1)实例化一个转换器,一般采用百分比
#2)调用fit_transform
transfer = PCA(n_components=0.95)
data_new = transfer.fit_transform(table)

The converted data_new data is as follows:

Now after PCA dimensionality reduction, there are only 44 features left.

4. Use K-means for classification

Since our current data set has no labels, we need to classify it through unsupervised learning. We use K-means to classify users. Suppose we now divide users into 3 categories, then n_clusters in KMeans in sklearn is 3 at this time.

classification

Let's look at the steps, which are the same as the supervised learning steps we mentioned before:

(1) Instantiate the predictor

estimator = KMeans(n_clusters=3)

(2) Use the predictor to predict, because there is no label here, so just pass in the feature value directly

estimator.fit(data_new)
y_predict = estimator.predict(data_new)

We printed the first 300 pieces of data of the classification result, and found that each sample is now classified into one category, using 0, 1, and 2 to represent these three types of people:

(3) Evaluation predictor

This will be summarized separately later, so I won't summarize it here. When the summary is finished, we are updating.

KMeans predictor return parameters

Let's look at some return parameters in this KMeans predictor:

(1) Return the classification labels corresponding to all samples

estimator.labels_

You can see that the output value is: you can see that it is a one-dimensional array, and the classification category corresponding to each sample

(2)返回分类簇的中心点坐标

estimator.cluster_centers_

由于数据内容比较多,不再截图数组的内容,只展示数组的大小,是一个 

5.总结

关于聚类和降维里面的有些东西自己还是不太清楚,自己还需要在多去学习下。里面涉及到的代码已经上传。由于那个csv文件比较大,可以自行上官网进行下载。

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/104411586