Machine Learning Notes (1) The use of numpy, basic concepts of machine learning

1、numpy

Why use numpy

Python provides a list container that can be used as an array. But the elements in the list can be any object, so the pointer to the object is saved in the list. In this way, in order to save a simple list [1,2,3]. You need three pointers and three integer objects. For numerical operations, this structure is obviously not efficient enough. Although Python also provides the array module, it only supports one-dimensional arrays, not multi-dimensional arrays (in TensorFlow, it is biased towards matrix understanding), and there are no various operation functions. Therefore, it is not suitable for numerical operations. The emergence of NumPy makes up for these shortcomings

(——Excerpt from "Python Scientific Computing" by Zhang Ruoyu)

1.1. Create numpy.array

1.1.1. Conventional method of creating numpy.array

import numpy as np
# 常规创建方法
a = np.array([2,3,4])
b = np.array([2.0,3.0,4.0])
c = np.array([[1.0,2.0],[3.0,4.0]])
d = np.array([[1,2],[3,4]],dtype="float") # 指定数据类型, numpy中默认类型为float

1.1.2. Other ways to create numpy.array

import numpy as np
# 创建全零数组
np.zeros(10)                          # array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])                    
np.zeros(10, dtype=float)             # array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
np.zeros((3, 5))                      # array([[ 0.,  0.,  0.,  0.,  0.],
                                      #       [ 0.,  0.,  0.,  0.,  0.],
                                      #       [ 0.,  0.,  0.,  0.,  0.]])    
np.zeros(shape=(3, 5), dtype=int)     # array([[0, 0, 0, 0, 0],
                                      #       [0, 0, 0, 0, 0],
                                      #       [0, 0, 0, 0, 0]])
# 创建全一数组
np.ones(10)                           # array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
np.ones((3, 5))                       # array([[ 1.,  1.,  1.,  1.,  1.],
                                      #        [ 1.,  1.,  1.,  1.,  1.],
                                      #        [ 1.,  1.,  1.,  1.,  1.]])
# 创建指定值数组            
np.full((3, 5), 666)                  # array([[666, 666, 666, 666, 666],
                                      #        [666, 666, 666, 666, 666],
                                      #        [666, 666, 666, 666, 666]])
np.full(fill_value=666, shape=(3, 5)) # array([[666, 666, 666, 666, 666],
                                      #        [666, 666, 666, 666, 666],
                                      #        [666, 666, 666, 666, 666]])
# 创建arange数组
np.arange(0, 10)                      # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(0, 20, 2)                   # array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
np.arange(0, 1, 0.2)                  # array([ 0. ,  0.2,  0.4,  0.6,  0.8])
# 创建等差数列
np.linspace(0, 20, 10)                # array([  0.        ,   2.22222222,   4.44444444,   6.66666667,  8.88888889,  11.11111111,  13.33333333,  15.55555556,   17.77777778,  20.        ])
np.linspace(0, 20, 11)                # array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.,  20.])
np.linspace(0, 1, 5)                  # array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

1.1.2, other creation of random numbers random

import numpy as np
# 创建随机整数
np.random.randint(0, 10)              # [0, 10)之间的随机整数
np.random.randint(0, 10, 10)          # array([2, 6, 1, 8, 1, 6, 8, 0, 1, 4])
np.random.randint(0, 1, 10)           # array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
np.random.randint(0, 10, size=10)     # array([3, 4, 9, 9, 5, 2, 3, 3, 2, 1])
np.random.randint(0, 10, size=(3,5))  # array([[1, 5, 3, 8, 5],
                                      #        [2, 7, 9, 6, 0],
                                      #        [0, 9, 9, 9, 7]])
np.random.randint(10, size=(3,5))     # array([[4, 8, 3, 7, 2],
                                      #        [9, 9, 2, 4, 4],
                                      #        [1, 5, 1, 7, 7]])
# 创建随机种子
np.random.seed(666)
np.random.randint(0, 10, size=(3, 5)) # array([[2, 6, 9, 4, 3],
                                      #        [1, 0, 8, 7, 5],
                                      #        [2, 5, 5, 4, 8]])
np.random.seed(666)
np.random.randint(0, 10, size=(3,5))  # array([[2, 6, 9, 4, 3],
                                      #        [1, 0, 8, 7, 5],
                                      #        [2, 5, 5, 4, 8]])
# 创建随机小数
np.random.random()                    # (0, 1)之间的随机小数
np.random.random((3,5))               # array([[ 0.8578588 ,  0.76741234,  0.95323137,  0.29097383,  0.84778197],
                                      #        [ 0.3497619 ,  0.92389692,  0.29489453,  0.52438061,  0.94253896],
                                      #        [ 0.07473949,  0.27646251,  0.4675855 ,  0.31581532,  0.39016259]])
# 正态分布数
np.random.normal()                    # 0.9047266176428719
np.random.normal(10, 100)             # -72.62832650185376
np.random.normal(0, 1, (3, 5))        # array([[ 0.82101369,  0.36712592,  1.65399586,  0.13946473, -1.21715355],
                                      #        [-0.99494737, -1.56448586, -1.62879004,  1.23174866, -0.91360034],
                                      #        [-0.27084407,  1.42024914, -0.98226439,  0.80976498,  1.85205227]])
 


1.2. Basic operation of numpy.array

1.2.1, the basic properties of numpy.array

import numpy as np
x = np.arange(10)
X = np.arange(15).reshape((3, 5))
# 查看维数
x.ndim                            # 1
X.ndim                            # 2
x.shape                           # (10,)
X.shape                           # (3, 5)
x.size                            # 10
X.size                            # 15

1.2.2, numpy.array data access

import numpy as np
x = np.arange(10)
X = np.arange(15).reshape((3, 5))
x[0]                              # 0
x[-1]                             # 9
X                                 # array([[ 0,  1,  2,  3,  4],
                                  #        [ 5,  6,  7,  8,  9],
                                  #        [10, 11, 12, 13, 14]])
X[0, 0]                           # 0
X[0, -1]                          # 4
x                                 # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x[0:5]                            # array([0, 1, 2, 3, 4])
x[:5]                             # array([0, 1, 2, 3, 4])
x[5:]                             # array([5, 6, 7, 8, 9])
x[::2]                            # array([0, 2, 4, 6, 8])
x[1::2]                           # array([1, 3, 5, 7, 9])
x[::-1]                           # array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
X[:2, :3]                         # array([[0, 1, 2],
                                  #        [5, 6, 7]])
X[:2][:3]                         # 结果不一样,在numpy中使用","做多维索引
                                  # array([[0, 1, 2, 3, 4],
                                  #        [5, 6, 7, 8, 9]])

X[:2, ::2]                        # array([[0, 2, 4],
                                  #        [5, 7, 9]])
X[::-1, ::-1]                     # array([[14, 13, 12, 11, 10],
                                  #        [ 9,  8,  7,  6,  5],
                                  #        [ 4,  3,  2,  1,  0]])
X[0, :]                           # array([0, 1, 2, 3, 4]) 
X[:, 0]                           # array([ 0,  5, 10])

1.2.3, numpy.array merge and split

import numpy as np
# numpy.array 的合并
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])                           # array([1, 2, 3, 3, 2, 1])
A = np.array([[1, 2, 3],
              [4, 5, 6]])
np.concatenate([A, A])                           # array([[1, 2, 3],
                                                 #        [4, 5, 6],
                                                 #        [1, 2, 3],
                                                 #        [4, 5, 6]])
np.concatenate([A, A], axis=1)
                                                 # array([[1, 2, 3, 1, 2, 3],
                                                 #        [4, 5, 6, 4, 5, 6]])
np.vstack([A, z])                                # array([[  1,   2,   3],
                                                 #        [  4,   5,   6],
                                                 #        [666, 666, 666]])
B = np.full((2,2), 100)
np.hstack([A, B])                                # array([[  1,   2,   3, 100, 100],
                                                 #        [  4,   5,   6, 100, 100]])
# numpy.array 的分割
x = np.arange(10)                                # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x1, x2, x3 = np.split(x, [3, 7])                 # x1 array([0, 1, 2])
                                                 # x2 array([3, 4, 5, 6])
                                                 # x3 array([7, 8, 9])
x1, x2 = np.split(x, [5])                        # x1 array([0, 1, 2, 3, 4])
                                                 # x2 array([5, 6, 7, 8, 9])
    
A = np.arange(16).reshape((4, 4))                # array([[ 0,  1,  2,  3],
                                                 #        [ 4,  5,  6,  7],
                                                 #        [ 8,  9, 10, 11],
                                                 #        [12, 13, 14, 15]])
A1, A2 = np.split(A, [2])                        # A1 array([[0, 1, 2, 3],
                                                 #           [4, 5, 6, 7]])
                                                 # A2 array([[ 8,  9, 10, 11],
                                                 #           [12, 13, 14, 15]])
A1, A2 = np.split(A, [2], axis=1)                # A1 array([[ 0,  1],
                                                 #           [ 4,  5],
                                                 #           [ 8,  9],
                                                 #           [12, 13]])
                                                 # A2 array([[ 2,  3],
                                                 #           [ 6,  7],
                                                 #           [10, 11],
                                                 #           [14, 15]])

1.3. Operations in numpy.array

import numpy as np
X = np.arange(1, 16).reshape((3, 5))
# 给定一个数组,让数组中每一个数加1
X + 1
# 给定一个数组,让数组中每一个数减1
X - 1
# 给定一个数组,让数组中每一个数乘以2
X * 2
# 给定一个数组,让数组中每一个数除以2
X / 2
# 给定一个数组,让数组中每一个数地板除2
X // 2
# 数学公式
np.abs(X)
np.sin(X)
np.cos(X)
np.tan(X)
np.arctan(X)
np.exp(X)
np.exp2(X)
np.power(3, X)
np.log(X)
np.log2(X)
np.log10(X)

# 矩阵运算
A = np.arange(4).reshape(2, 2)
B = np.full((2, 2), 10)
# 矩阵加法
A + B
# 矩阵减法
A + B
# 矩阵乘法
A.dot(B)
# 矩阵转置
A.T
# 求逆矩阵
np.linalg.inv(A)

1.4. Aggregation operations in numpy

Note: axis describes the dimension to be compressed

to sum

np.sum

X = np.arange(16).reshape(4,-1)
np.sum(X)                # 120
np.sum(X, axis=0)        # array([24, 28, 32, 36])
np.sum(X, axis=1)        # array([ 6, 22, 38, 54])

minimum value

np.min

maximum value

np.max

take

np.prod

average value

np.mean

median

np.median

percentile

np.percentile

variance

np.var

standard deviation

np.std

1.5, arg operation in numpy

import numpy as np
# 索引
x = np.random.normal(0, 1, 1000000)
np.argmin(x)                         # 886266
x[886266]                            # -4.8354963762015108
np.min(x)                            # -4.8354963762015108
# 排序和使用索引
x = np.arange(16)
np.random.shuffle(x)                 # array([13,  2,  6,  7, 11, 10,  3,  4,  8,  0,  5,  1,  9, 14, 12, 15])
np.sort(x)                           # array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])
np.argsort(x)                        # array([14,  7, 12, 10,  5,  9,  8,  3,  2,  6,  4, 13, 11, 15,  0,  1])

1.6. Comparison and Fancy Indexing in numpy

1.6.1、Fancy Indexing

import numpy as np

x = np.arange(16)
# 取第四个元素
x[3]                              # 3
# 取4-10元素
x[3:9]                            # array([3, 4, 5, 6, 7, 8])
x[3:9:2]                          # array([3, 5, 7])
# 按索引取值
ind = [3, 5, 7]
x[ind]                            # array([3, 5, 7])
X = x.reshape(4, -1)              # array([[ 0,  1,  2,  3],
                                  #       [ 4,  5,  6,  7],
                                  #       [ 8,  9, 10, 11],
                                  #       [12, 13, 14, 15]])
row = np.array([0, 1, 2])
col = np.array([1, 2, 3])
X[row, col]                       # array([ 1,  6, 11])
X[0, col]                         # array([1, 2, 3])
X[:2, col]                        # array([[1, 2, 3],
                                  #        [5, 6, 7]])
col = [True, False, True, True]
X[0, col]                         # array([0, 2, 3])

1.6.2. Comparison results using numpy.array

import numpy as np
# 比较运算
x = np.arange(16)
# 统计 x <= 3 元素个数 count_nonzero 统计非零元素 x <= 3 是一个boolean 结果集
np.count_nonzero( x <= 3)
np.sum(x <= 3)
# 存在一个元素等于0
np.any(x == 0)                  # True
# 所有元素大于等于0
np.all(x >= 0)                  # True
# 找 3 < x < 10元素个数 
np.sum((x > 3) & (x < 10))      # 6
# 找x 是偶数或者大于10 的元素个数
np.sum((x % 2 == 0) | (x > 10)) # 11
# 取反运算
np.sum(~(x == 0))               # 15

# 比较结果和Fancy Indexing
x < 5                           # array([ True,  True,  True,  True,  True, False, False, False, False,
                                #        False, False, False, False, False, False, False], dtype=bool)
x[x < 5]                        # array([0, 1, 2, 3, 4])
x[x % 2 == 0]                   # array([ 0,  2,  4,  6,  8, 10, 12, 14])
X[X[:,3] % 3 == 0, :]           # array([[ 0,  1,  2,  3],
                                #        [12, 13, 14, 15]])

2. Basic concepts of machine learning

2.1. Data

famous iris data
insert image description here

  • The data as a whole is called a data set (data set)
  • Each row of data is called a sample (sample)
  • Except the last column, each column expresses a feature of the sample (feature)
  • The last column, called the label (label)
    row i sample row is written as X ( i ) X^{(i)}X( i ) , the i-th sample j-th eigenvalueX j ( i ) X_j^{(i)}Xj(i), the label of the i-th sample is written as y ( i ) y^{(i)}y(i)
    insert image description here

Features can also be very abstract. Every pixel of an image is a feature. A 28×28 image has 28×28=784 features.

2.2. Basic tasks of machine learning

2.2.1. Classification tasks

In binary classification problems, we often use 1 for positive categories and 0 or -1 for negative categories.

  • Binary classification task
  1. Judge the message as spam; not spam
  2. It is judged that issuing credit cards to customers is risky; there is no risk
  3. Judgment of benign tumors in patients; malignant tumors
  4. Judging that a stock is rising; falling
  • multi-category task
  1. digital identification
  2. Image Identification
  3. Determining the risk rating of credit cards issued to customers

2.2.2. Return task

  • The result is a continuous numeric value, not a category
  1. Housing Prices
  2. Market analysis
  3. student's result
  4. stock price

2.2.3, the difference between classification and regression

  1. different output
  • The output of the classification task is the category to which the object belongs, and the output of the regression problem is the value of the object

The weather can be divided into three categories: sunny, cloudy, and rainy. We only know the weather today and before, and we will predict the weather conditions for tomorrow and the next few days. For example, it is cloudy tomorrow and sunny next Monday. This is the classification;

The weather temperature of each day, we know the temperature of today and the previous few days, we have to use the previous temperature to predict the temperature in the future, every moment, we can predict a temperature value, the method used to get this value It is regression.

  • The output value of the classification problem is discrete, and the output value of the regression problem is continuous

This discrete and continuous is not discrete and continuous in the pure mathematical sense.

  • The values ​​output by classification problems are qualitative, and the values ​​output by regression problems are quantitative

The so-called qualitative refers to determining the exact composition of something or what a certain substance is . This kind of determination generally does not need to specifically measure various exact numerical quantities of this substance.

The so-called quantification refers to the determination of the exact numerical quantity of a component (a certain substance) . This kind of determination generally does not need to specifically identify what the substance is.

For example, this is a glass of water, this sentence is qualitative; this glass of water has 10 milliliters, this is quantitative.

  1. different purposes

The purpose of classification is to find the decision boundary, that is, the classification algorithm obtains a decision surface for classifying the data in the data set.
The purpose of regression is to find the best fit. The best fit is obtained through the regression algorithm. This line can best approach each point in the data set.

General steps in machine learning
insert image description here

2.3 Classification of machine learning methods

  • Supervised learning
    The training data for the machine has "labels" or "answers"
  1. k-nearest neighbor
  2. Linear Regression and Polynomial Regression
  3. logistic regression
  4. SVM
  5. Decision Trees and Random Forests
  • Unsupervised learning
    Classify data without "labels" - cluster analysis
    1, principal component analysis PCA
  • Semi-supervised learning
    Part of the data has "marks" or "answers", and the other part of the data does not
    usually use unsupervised learning methods to make data, and then use supervised learning methods to do model training and prediction
  • Reinforcement learning
    According to the surrounding environment, take actions, and learn the behavior according to the results of the actions

2.4. Other classifications of machine learning

  • Online learning and batch learning (offline learning)

  • Parametric Learning and Nonparametric Learning

  1. Parameter Learning
    Once the parameters are learned, the original dataset is no longer needed
  2. Non-parametric learning
    does not make too many assumptions on the model, which does not mean that there are no parameters

Reference: https://blog.csdn.net/shuiyixin/article/details/88816416

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123653140