This article explains the data thoroughly (four): data mining

I. Introduction

In the last issue of the article, we have learned that "data" is a huge system (as shown in the figure below); and used the example of "washing vegetables and choosing vegetables" to explain the meaning of data cleaning; and today I mainly give you Explain how to process and cook the clean dishes when the clean dishes are prepared, and turn them into valuable and meaningful delicacies, that is, the process of data mining.

Insert picture description here

2. Data mining (cooking)

Data mining is the process of processing and utilizing established "net data", and we can regard it as a process of cooking and processing.

And data mining has certain rules and corresponding models, which we can also understand through an analogy.

The cleaned high-quality data is like "clean dishes", and the data mining model is like various "cuisines". We know that even if the materials of the "clean dishes" are the same, the cuisines (data mining models) are different, and the final product is the same. totally different!

The following are some of the more common "cuisines" (models) in data mining. Below we will elaborate on the usage scenarios corresponding to the model one by one.

In general, data mining models can be roughly classified through the "supervised mode", classified into supervised models and unsupervised models:

  • Supervision model: Simply put, let the machine learn to draw inferences from one another. It is like a student who knows the question and answer when studying, learns to analyze how to solve the problem, and will do it next time when encountering the same or similar question; supervision model The data inside is divided into training set and test set. Common models include decision trees, LOGISTIC linear regression and so on.
  • Unsupervised model: Simply put, it ignores the "inferences" process in the supervised model. The input is just a bunch of data, there is no label, and there is no distinction between training and validation sets. Let the algorithm be based on the characteristics of the data itself. Learning, common models generally have clustering.
    Insert picture description here

Now that we understand the basic categories of data mining, let's cut into the scene and take a look at how these specific algorithm models can help us conduct data mining in real scenarios.

Cluster analysis-K-Means algorithm is the most typical among them.

Principles and steps:

  1. Select K center points, representing K categories;
  2. Calculate the Euclidean distance between N sample points and K center points;
  3. Divide each sample point into the nearest (the smallest Euclidean distance) center point category-iteration 1;
  4. Calculate the mean value of the sample points in each category, get K mean values, and use K mean values ​​as the new center point-iteration 2;
  5. Repeat 234;
  6. Get the K center points after convergence (the center points no longer change)-iteration 4;

Usage scenarios: In the commercial field, cluster analysis is often combined (RMF model) to be used for customer segmentation; in the field of biology, cluster analysis is often used to classify animals and plants and genes, and conduct population research.

Practical case: Using K-Means algorithm to measure and segment the value of aviation industry customers.

1. Refer to the RMF model and data set to customize the clustering category

Insert picture description here

After obtaining the data set, delete irrelevant, weakly related or redundant attributes, such as membership card number, gender, etc., and the five attributes related to the RMF model can be obtained: C (higher average discount rate, higher position level), F (Number of rides), M (total mileage), R (recent flight) low, L (meeting time), we can classify customer groups according to attributes, and identify important retention customers, important retention customers, low-value customers, etc. .

2. Five clustering categories have been determined, just insert the code for clustering (the code is as follows)

Insert picture description here

3. Perform visual analysis on the results and identify each customer

Insert picture description here
Insert picture description here

Regression analysis-specifically divided into two categories (logistic regression, linear regression).

Then, some students will ask, what is the difference between logistic regression and linear regression?

In fact, the two belong to the same family (generalized linear model), but they face different types of dependent variables. The dependent variable of logistic regression is a categorical variable (male and female, occupation...), and the dependent variable of linear regression is a continuous numeric variable (such as The salary of 1,000 people, unit yuan).

Practical exercise: Least squares OLS regression (a type of linear regression)-for example, below, we study the relationship between wages and loan balances.

STEP1. After importing the data, draw a scatter plot, observe the general trend of the data, and draw a fitting curve:

  • x=data['The balance of various loans']
  • y=data['salary']
  • z1 = np.polyfit(x, y, 1) # 1 means fit with a polynomial of degree 1
  • p1 = np.poly1d(z1)#fitting equation f=p1(x)
  • plt.scatter(data['Loan balance'],data['salary']) plot2=plt.plot(x, f,
  • 'r',label='polyfit values')#Draw fitting line
    Insert picture description here

STEP2. Derive relevant regression data reference indicators, such as fitting R square (the closer to 1, the better, generally 0.7 or more is considered to be more relevant and the fitting effect is better), P value (generally <0.05 is an ideal合) and so on, to test the regression equation.
Insert picture description here

In summary, we can get Y (salary) = 0.0379X (the balance of various loans)-0.8295.

Guess you like

Origin blog.csdn.net/amumuum/article/details/113179812