文章目录

概述

机器学习

为什么需要机器学习？
机器学习的类型
机器学习流程

数据预处理

均值移除(标准化)
范围缩放
归一化
二值化
独热编码
标签编码

概述

机器学习

人工智能：通过人工的方法，实现或者近似实现某些需要人类智能处理的问题，都可以称为人工智能。
机器学习：一个计算机程序在完成任务T之后，获得经验E，而该经验的效果可以通过P得以表现，如果随着T的增加，借助P来表现的E也可以同步增进，则称这样的程序为机器学习系统。
自我完善、自我修正、自我增强。

为什么需要机器学习？

简化或者替代人工方式的模式识别，易于系统的开发维护和升级换代。
对于那些算法过于复杂，或者没有明确解法的问题，机器学习系统具有得天独厚的优势。
借鉴机器学习的过程，反向推理出隐藏在业务数据背后的规则——数据挖掘。

机器学习的类型

有监督学习、无监督学习、半监督学习和强化学习
批量学习和增量学习
基于实例的学习和基于模型的学习

机器学习流程

数据采集
数据清洗数据

数据预处理
选择模型
训练模型
验证模型机器学习

使用模型业务
维护和升级

数据预处理

import sklearn.preprocessing as sp
样本矩阵

                    输入数据             输出数据
               _____特征_____
             /       |        |      \
          身高  体重  年龄  性别
样本1 1.7    60    25     男    -> 8000
样本2 1.5    50    20     女    -> 6000
...

均值移除(标准化)

特征A：10±5
特征B：10000±5000
特征淹没
通过算法调整令样本矩阵中每一列(特征)的平均值为0，标准差为1。这样一来，所有特征对最终模型的预测结果都有接近一致的贡献，模型对每个特征的倾向性更加均衡。所有值=原始值-算术平均值的差除以标准差。
sp.scale(原始样本矩阵)->经过均值移除后的样本矩阵

import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
print(raw_samples.mean(axis=0))
print(raw_samples.std(axis=0))
std_samples = raw_samples.copy()
for col in std_samples.T:
    col_mean = col.mean()
    col_std = col.std()
    col -= col_mean
    col /= col_std
print(std_samples)
print(std_samples.mean(axis=0))
print(std_samples.std(axis=0))
std_samples = sp.scale(raw_samples)
print(std_samples)
print(std_samples.mean(axis=0))
print(std_samples.std(axis=0))

范围缩放

90/150 80/100 5/5
将样本矩阵每一列的元素经过某种线性变换，使得所有列的元素都处在同样的范围区间内。
所有数据减去最小值再除以最大值与最小值的差。

k x + b = y
k col_min + b = min   \  -> k b
k col_max + b = max  /
/ col_min 1 \ x / k \ = / min \
\ col_max 1/    \ b /    \ max /
---------------   -----    --------
           a               x             b
                            = np.linalg.solve(a, b)
                            = np.linalg.lstsq(a, b)[0]

范围缩放器 = sp.MinMaxScaler(
feature_range=(min, max))
范围缩放器.fit_transform(原始样本矩阵)
->经过范围缩放后的样本矩阵
有时候也把以[0, 1]区间作为目标范围的范围缩放称为"归一化"

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
mms_samples = raw_samples.copy()
for col in mms_samples.T:
    col_min = col.min()
    col_max = col.max()
    a = np.array([
        [col_min, 1],
        [col_max, 1]])
    b = np.array([0, 1])
    x = np.linalg.solve(a, b)
    col *= x[0]
    col += x[1]
print(mms_samples)
mms = sp.MinMaxScaler(feature_range=(0, 1))
mms_samples = mms.fit_transform(raw_samples)
print(mms_samples)

归一化

           Python C/C++ Java PHP
2016  20          30        40    10    /100
2017  30          20        30    10    /90
2018  10            5          1      0    /16
用每个样本各个特征值除以该样本所有特征值绝对值之和，以占比的形式来表现特征。
sp.normalize(原始样本矩阵, norm='l1')
    ->经过归一化后的样本矩阵
l1 - l1范数，矢量诸元素的绝对值之和
l2 - l2范数，矢量诸元素的(绝对值的)平方之和
...
ln - ln范数，矢量诸元素的绝对值的n次方之和

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
nor_samples = raw_samples.copy()
for row in nor_samples:
    row_absum = abs(row).sum()
    row /= row_absum
print(nor_samples)
print(abs(nor_samples).sum(axis=1))
nor_samples = sp.normalize(raw_samples, norm='l1')
print(nor_samples)
print(abs(nor_samples).sum(axis=1))

二值化

设定域值，让其只有0或1两种值
根据事先给定阈值，将样本矩阵中高于阈值的元素设置为1，否则设置为0，得到一个完全由1和0组成的二值矩阵。

二值化器 = sp.Binarizer(threshold=阈值)
二值化器.transform(原始样本矩阵)
    ->经过二值化后的样本矩阵

例：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [3, -1.5,  2,   -5.4],
    [0,  4,   -0.3,  2.1],
    [1,  3.3, -1.9, -4.3]])
print(raw_samples)
bin_samples = raw_samples.copy()
bin_samples[bin_samples <= 1.4] = 0
bin_samples[bin_samples > 1.4] = 1
print(bin_samples)
bin = sp.Binarizer(threshold=1.4)
bin_samples = bin.transform(raw_samples)
print(bin_samples)

独热编码

二值化会导致数据细节丢失，所以这里考虑使用独热编码。

用一个只包含一个1和若干个0的序列来表达每个特征值的编码方式，
借此既保留了样本矩阵的所有细节，同时又得到一个只含有1和0的稀疏矩阵，
既可以提高模型的容错性，同时还能节省内存空间。
1        3        2
7        5        4
1        8        6
7        3        9
----------------------
1:10  3:100 2:1000
7:01  5:010 4:0100
          8:001 6:0010
                     9:0001
----------------------
101001000
010100100
100010010
011000001
独热编码器 = sklearn.preprocessing.OneHotEncoder(
    sparse=是否紧缩(缺省True), dtype=类型)
    紧缩格式：只记录特别的位置，比如班上只有三个女生，则只记录三个女生位置，其他的均为男生位置
独热编码器.fit_transform(原始样本矩阵)
    ->经过独热编码后的样本矩阵

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    [1, 3, 2],
    [7, 5, 4],
    [1, 8, 6],
    [7, 3, 9]])
print(raw_samples)
# 建立编码字典列表
code_tables = []
for col in raw_samples.T:
    # 针对一列的编码字典
    code_table = {}
    for val in col:
        code_table[val] = None
    code_tables.append(code_table)
# 为编码字典列表中每个编码字典添加值
for code_table in code_tables:
    size = len(code_table)
    for one, key in enumerate(sorted(
            code_table.keys())):
        code_table[key] = np.zeros(
            shape=size, dtype=int)
        code_table[key][one] = 1
# 根据编码字典表对原始样本矩阵做独热编码
ohe_samples = []
for raw_sample in raw_samples:
    ohe_sample = np.array([], dtype=int)
    for i, key in enumerate(raw_sample):
        ohe_sample = np.hstack(
            (ohe_sample, code_tables[i][key]))
    ohe_samples.append(ohe_sample)
ohe_samples = np.array(ohe_samples)
print(ohe_samples)
ohe = sp.OneHotEncoder(sparse=False, dtype=int)
ohe_samples = ohe.fit_transform(raw_samples)
print(ohe_samples)

结果分析：

sparse=是否紧缩(缺省True), dtype=类型)
False:稀疏格式
[[1 0 1 0 0 1 0 0 0]
 [0 1 0 1 0 0 1 0 0]
 [1 0 0 0 1 0 0 1 0]
 [0 1 1 0 0 0 0 0 1]]
 True：紧缩格式：
  (0, 5)        1
  (0, 2)        1
  (0, 0)        1
  (1, 6)        1
  (1, 3)        1
  (1, 1)        1
  (2, 7)        1
  (2, 4)        1
  (2, 0)        1
  (3, 8)        1
  (3, 2)        1
  (3, 1)        1
  只记录稀疏格式值非0的位置

标签编码

将字符串形式的特征值，按照字典排序，
文本形式的特征值->数值形式的特征值
其编码数值源于标签字符串的字典排序，与标签本身的含义无关
职位车
员工 toyota - 0
组长 ford - 1
经理 audi - 2
老板 bmw - 3
标签编码器 = sp.LabelEncoder()
标签编码器.fit_transform(原始样本矩阵) ->经过标签编码后的样本矩阵
标签编码器.inverse_transform(经过标签编码后的样本矩阵) ->原始样本矩阵

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
raw_samples = np.array([
    'audi', 'ford', 'audi', 'toyota',
    'ford', 'bmw', 'toyota', 'bmw'])
print(raw_samples)
lbe = sp.LabelEncoder()
lbe_samples = lbe.fit_transform(raw_samples)
print(lbe_samples)
raw_samples = lbe.inverse_transform(lbe_samples)
print(raw_samples)

机器学习1-概述及数据预处理