After learning of the 50,000 machine data to predict the success of their monthly salary after three years?

Red Python community network machine learning, this wave has gradually become a hot spot, and is the first card Python language learning machine direction, using machine learning to play some of the fun projects must be very interesting. For example, your career, marriage, family, education, etc. to predict the time according to your income, so amazing! Do not believe it, look down with me.

16959493-8e09aa99f7b1f318

1. Data Set

Income has been a hot issue more concerned about, in kaggle the game, there have been such a data set, therefore, this data set is small combat game from kaggle data, data sets, looks like the following:

16959493-7282dfec03e7083f

Above the dense pile of records tens of thousands of income data, each person's income data is actually following these characteristic values, such as your age, type of work, ah, marital status, level of education, ah, ah time, occupation, etc., detailed as follows :

16959493-1f48d658a78452fa

1) Import data set

We will train and test set to join the column label, and then view information about the training set, as shown below:

16959493-d16d8cf1fd767dd9

2. Data cleansing

2) remove the missing values

There is a missing data set values, and to '? 'Form instead, so we have to be removed to such data. Excluding data set and reordering, as shown below:

16959493-6a3082812bb802de

3) Numerical Processing

Because the test set and the training set labels exist 'income' data is not uniform issue, in the test set is '<= 50K.' Form, while the training set is '<= 50K', so the data you want to test the form of centralized instead training set data format.

16959493-bf62f38e97c249cf

In order to make a consistent set of data encoding type, we focus on the data string type, the conversion of numeric type, is used Categorical pandas library function, the conversion is complete, the data is checked by observation 'income' whether the set of data values complete the conversion.

16959493-8ae965478d70abfd

After complete conversion, the original data set will be replaced by a string of numbers, for example, data 'income' is, originally '<= 50K' and '> 50K'. Now correspond to 0 and 1 respectively. Other changes in the column is the same way. That is the last we forecast revenue if it is 0, it means income is less than 50k.

So far, data cleaning work is substantially complete, the following, we observe the relationship between 'education time' and income by pandas library cross-table function (crosstab):

16959493-f9475d9d706129a4

其中‘income’下的1代表收入大于50K,由上图可以看出,当受教育时间小于8年时,收入大于50K的人数可谓是寥寥无几,所以说从统计概率的角度上讲,九年义务教育,对大家的收入还是很有帮助的。

3.用决策树来构建模型

什么是决策树,看下面这个形象的比如,就一清二楚了。

16959493-dd9e06edee423bbe

社区大妈经验丰富,有一套自己的判断逻辑。假设“抽烟”、“染发”和“讲脏话”是社区大妈认为的区分“好坏”学生的三项关键特征,那么这样一个有先后次序的判断逻辑就构成一个决策树模型。在决策树中,最能区分类别的特征将作为最先判断的条件,然后依次向下判断各个次优特征。决策树的核心就在于如何选取每个节点的最优判断条件,也即特征选择的过程。

而在每一个判断节点,决策树都会遵循一套IF-THEN的规则:

IF “抽烟” THEN -> “坏学生”

ELSE

IF “染发” THEN -> “坏学生”

ELSE IF “讲脏话” THEN -> “坏学生”

ELSE -> “好学生”

通过sklearn库提供的决策树算法,可以很方便的进行分类:

16959493-b405a203d38a2f21

首先是建立一个clf的决策树分类器

然后将我们的训练数据导入fit函数,这里我们用到数据集中所有的特征值,因为数据集中的特征值只有14维,并不是很高的特征维度,因此,并不需要进行降维处理。

接着将数据导入训练决策树算法,训练完成后再我们的测试集上进行测试,

最后训练结果显示,准确率在80%左右,效果还是很不错的。

4.预测你的收入

看到这里小白是不是有点晕,这个模型到底靠不靠谱,我们用更直观的收入的例子来试试就知道啦:

16959493-cd3de2937885086d

上面的这个人有一堆参数,如果我们输入模型里面,通过模型来预测一些它的收入到底是多少呢,是不是真的是小于50K:

16959493-c9388c83956e082c

这里的array[0]表示最后的预测收入是小于50k的,而实际上这组数据的收入也确实如此!懂点机器学习还是很有用的,尤其是喂了大量的数据之后,当然我们还可以通过网格来寻找最佳参数,有兴趣的同学可以动手试试!

Here I believe that many small partners want to learn Python, and I was for many years engaged in the development of a Python programmers old, resigned currently doing their own private custom courses, this year, I spent a month compiled a the most suitable for dry goods in 2019 learning to learn, from the most basic to the various frameworks have order, given to every Python junior partner, you may want to get my attention and my private letter in the background: learning, you can get free. Life is short, I use Python!

Guess you like

Origin blog.csdn.net/weixin_33922672/article/details/90984699