【机器学习实例】

https://blog.csdn.net/power1_power2/article/details/79664830

源数据地址：http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

下载数据

from urllib.request import urlretrieve
def load_data(download = True):
    if download:
        data_path,_ = urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "D://pic//adult.csv")
        print('数据已下载')
load_data()

对数据的列名进行赋值并读取数据

import pandas as pd
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
         "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://pic//adult.csv", names=col_names)
print(data[:10])

 age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   
5   37            Private  284582     Masters             14   
6   49            Private  160187         9th              5   
7   52   Self-emp-not-inc  209642     HS-grad              9   
8   31            Private   45781     Masters             14   
9   42            Private  159449   Bachelors             13   

           marital-status          occupation    relationship    race  \
0           Never-married        Adm-clerical   Not-in-family   White   
1      Married-civ-spouse     Exec-managerial         Husband   White   
2                Divorced   Handlers-cleaners   Not-in-family   White   
3      Married-civ-spouse   Handlers-cleaners         Husband   Black   
4      Married-civ-spouse      Prof-specialty            Wife   Black   
5      Married-civ-spouse     Exec-managerial            Wife   White   
6   Married-spouse-absent       Other-service   Not-in-family   Black   
7      Married-civ-spouse     Exec-managerial         Husband   White   
8           Never-married      Prof-specialty   Not-in-family   White   
9      Married-civ-spouse     Exec-managerial         Husband   White   

       sex  capital-gain  capital-loss  hours-per-week  native-country  result  
0     Male          2174             0              40   United-States   <=50K  
1     Male             0             0              13   United-States   <=50K  
2     Male             0             0              40   United-States   <=50K  
3     Male             0             0              40   United-States   <=50K  
4   Female             0             0              40            Cuba   <=50K  
5   Female             0             0              40   United-States   <=50K  
6   Female             0             0              16         Jamaica   <=50K  
7     Male             0             0              45   United-States    >50K  
8   Female         14084             0              50   United-States    >50K  
9     Male          5178             0              40   United-States    >50K

数据的类别信息描述：

age：连续型数值变量；

workcass：雇主类型，多类别变量；

fnlwgt：人口普查员认为观察值的人数，连续型变量；

education：教育程度，多类别变量；

education_num：受教育年限，连续型变量；

marital-status：婚姻状况，多类别变量；

occupation：职业，多类别变量；

relationship：群体性关系，多类别变量；

race：种族，多类别变量；

sex：性别，二分变量；

capital-gain：资本收益，连续型变量；

capital-loss：资本损失，连续型变量；

hours-per-week：每周工作时间，连续型变量；

native-country：国籍，多类别变量；

result：结果，二分变量；

特征处理

查看数据缺失情况：

#方法一
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
result            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
#方法二
print(data.isnull().any())

age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
result            False
dtype: bool

print(data.shape)

(32561, 15)

使用函数可以看出没有缺失的变量，但是实际数据中有很多无效字符?，.，$等，对无效数据进行处理

import numpy as np
data_clean = data.replace(regex=[r'\?|\.|\$'],value=np.nan)
print(data_clean.isnull().any())

age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
result            False
dtype: bool

将所有含有缺失值的行都去掉

adult = data_clean.dropna(how='any')
print(adult.shape)

(30162, 15)

剔除没有用的数据特征

adult = adult.drop(['fnlwgt'],axis=1)
adult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age               30162 non-null int64
workclass         30162 non-null object
education         30162 non-null object
education-num     30162 non-null int64
marital-status    30162 non-null object
occupation        30162 non-null object
relationship      30162 non-null object
race              30162 non-null object
sex               30162 non-null object
capital-gain      30162 non-null int64
capital-loss      30162 non-null int64
hours-per-week    30162 non-null int64
native-country    30162 non-null object
result            30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB

划分训练集与测试集

#监督型机器学习
from sklearn.model_selection import train_test_split
#数据分离
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
         "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
X_train , X_test , y_train , y_test = train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
print(X_train.shape)
print(X_test.shape)
print(X_train.head())
print(y_train.head())

D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/Adult.py
(22621, 12)
(7541, 12)
      workclass      education  education-num       marital-status  \
20607   Private   Some-college             10   Married-civ-spouse   
31257   Private        HS-grad              9   Married-civ-spouse   
31892   Private        HS-grad              9        Never-married   
20220   Private        HS-grad              9             Divorced   
24044   Private   Some-college             10             Divorced   

               occupation    relationship    race      sex  capital-gain  \
20607        Craft-repair         Husband   White     Male             0   
31257       Other-service         Husband   Black     Male             0   
31892        Adm-clerical   Not-in-family   White   Female             0   
20220   Machine-op-inspct       Unmarried   Black   Female             0   
24044               Sales   Not-in-family   White   Female             0   

       capital-loss  hours-per-week  native-country  
20607             0              50   United-States  
31257             0              50   United-States  
31892             0              45   United-States  
20220             0              40   United-States  
24044             0              45   United-States  
20607      >50K
31257     <=50K
31892     <=50K
20220     <=50K
24044      >50K
Name: result, dtype: object

Process finished with exit code 0

【机器学习实例】

猜你喜欢