Based on the Python Random Forest Case: Giving Love to Others

We all know that love must not only have the heart to love you, but also firewood, rice, oil and salt. The best love is not the wind and the snow, but the charcoal in the snow. If one day love you will not believe it. I don't know where you should go in the next days. You must trust the friend of the data analyst next to you. He will use the data to tell you where to go in love?

1. Case background

案例背景

One day, Xiao C’s cousin, Xiao Q, came to see Xiao C and said that he was a little bit troubled-Xiao Q’s colleague introduced her to Mr. Z, who is 37 years old and works as a clerk in a provincial government. Work, but Xiao Q’s standard for choosing a spouse is that the other party’s monthly salary is more than 5w, not to mention Xiao Q’s worship of money, just because the world knows that soft rice is fragrant, but it hasn’t been long since I first met, so I am embarrassed to ask Mr.z directly, so I took it. Don’t change your mind about whether to interact with Mr.z in depth, (to put it bluntly, I don’t know how much money Mr.z has?), so I want Xiao C to help make a decision. As an excellent big data analyst, Xiao C, Xiao C thought about it for a long time, and found that this matter is not simple, but also need to use python, but also need to build a model, so little C suddenly thought of the decision tree algorithm, through associations, and then thought of random forests, the inspiration suddenly appeared, suddenly remembered one Good idea, so he started. . . .

Data set preparation

Everyone knows a well-known data set—the adult data set, which includes tens of millions of sample data. The sample data generally includes age, job nature, statistical weight, education, length of education, marital status, occupation, family education, race, Gender, asset income, asset loss, weekly working hours, origin, income, etc. This data set should be useful, so download it first.
Download link:
Click to download the source data. After downloading,
Insert picture description here
rename it to adult.csv. The original suffix is ​​data. Delete it directly and force it to change to a csv format file.
Insert picture description here

3. Read the data

import pandas as pd
data = pd.read_csv('D:\\Python\\adult.csv',header = None,index_col = False,
                  names = ['年龄','单位性质','权重','学历','受教育时长','婚姻状况','职业',
                           '家庭教育','种族','性别',
                           '资产所得','资产损失','周工作时长','原籍','收入'])
data_lite = data[['年龄','单位性质','学历','性别','周工作时长','职业','收入']]
data_lite.head()

Show running results:
Insert picture description here

4. Use get_dummies to process data

Because in the data set, we can see that the nature of the unit, education, occupation, gender, and income are not integer data, but a string, which is to use get_dummies to add dummy variables to the existing data set to change the data set Into a usable format. Dummy variables, also called dummy variables and discrete feature codes, can be used to express the possible impact of categorical variables and non-quantitative factors.

data_dummies = pd.get_dummies(data_lite)
print('样本原始特征:\n',list(data_lite.columns),'\n')
print('虚拟变量特征:\n',list(data_dummies.columns))

Display of running results:
Insert picture description here
Next we can look at the processed data:

data_dummies.head()

Run the code results:
Insert picture description here

5. Divide feature variables

Each column of data is assigned to feature vector X and classification label y, the input code is as follows:

features = data_dummies.loc[:,'年龄':'职业_ Transport-moving']
X = features.values
y = data_dummies['收入_ >50K'].values #将收入大于50k的作为预测目标
print("代码运行结果:")
print('特征形态:{} 标签形态:{}'.format(X.shape,y.shape))

Code running result display:
Insert picture description here

6. Build a data model

We will build a model with a decision tree this time. We have 32,561 sample data and 44 feature vectors.
1. Split the data into training set and test set

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

2. Fit the data with a random forest with a maximum depth of 5

from sklearn import tree
go_dating_tree = tree.DecisionTreeClassifier(max_depth = 5)
go_dating_tree.fit(X_train,y_train)
print('模型得分:{:.2f}'.format(go_dating_tree.score(X_test,y_test)))

Operation result: From the
Insert picture description here
above random forest model prediction, we can see that the model got 0.8 points in the test set, which can be said to be very good, that is, the accuracy of this model's prediction accuracy is 80%. I believe this model is for small Q Provide enough reference in love.

7. Love prediction

The prediction is to predict his income according to the situation provided by Xiao Q, through the model built by the random forest, and see if he can meet the requirement of Xiao Q's income above 50w.
The following is the basic information provided by Xiao Q:
Mr.Z is 37 years old, works in a provincial government, has a master’s degree, is male, works 40 hours a week, and is a professional clerk. Then we input the corresponding data and make predictions through the model. Enter the code as follows:

Mr_z = [[37,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]]
dating_dec = go_dating_tree.predict(Mr_z)
print("代码运行结果:")
print("=================================\n")
if dating_dec ==1:
    print("大胆去追求真爱吧,这哥们牛批!")
else:
    print("不用去了,不满足你的要求!")

operation result:
Insert picture description here

8. Result analysis

Through the data model prediction, the results are shown as follows. Yes, the machine coldly told Xiao Q a cruel fact that Mr.Z did not meet his requirements. Of course, out of common sense, he also knew that the income of civilian staff would not exceed 5W. Understand that there is no corruption and bribery in China.

9. References

1. In-depth understanding of python machine learning section of the small hand.

Guess you like

Origin blog.csdn.net/qq_44176343/article/details/109769179