Machine Learning (Lecture Notes)

Chapter 1 Overview of Machine Learning

Basic concepts of machine learning

Machine learning: Machine learning is a computer program that allows a system to automatically learn and improve itself from experience (data) without human editing.

ps: Machine learning is a subcategory of artificial intelligence

related information

Supervised learning and un(non)supervised learning

Supervised learning is a machine learning algorithm that uses existing training data sets (hereinafter referred to as training sets) to model, and then uses the model to classify or perform regression analysis on new data samples.

Unsupervised learning is a method of analyzing unlabeled data and building an appropriate model to provide solutions to problems without a training set.

Classification and regression

Classification and regression are both concepts in supervised learning. Classification predicts which category the sample belongs to, while regression predicts the numerical value of the target field of the sample.

Datasets and features

The data set (dataset) is the raw material of the prediction system and is used to train the historical data of the machine learning model. The data set consists of several pieces of data, and each piece of data contains several features (feature). Features describe the attributes of each sample in the data set, sometimes also called "fields".

feature engineering

Feature engineering is the process before creating a predictive model, in which we analyze, clean and structure the characteristics of the data.

Overfitting and underfitting

When the learner learns the training samples "too well", it may have regarded some characteristics of the training samples themselves as general properties that all potential samples will have, which will lead to a decline in generalization performance. This phenomenon It's called overfitting.

Underfitting means that the general properties of the training samples have not been learned well. The performance on the training set and test set is not good.

Chapter 2 Installation and use of machine learning tools

Omitted (see pandas notes for pandas)

No.? Chapter (it’s too confusing, just take notes)

K nearest neighbor algorithm:

principle:

k-Nearest Neighbor (kNN) learning is a commonly used supervised learning method. The principle is very simple: for a given test sample, find the k closest samples in the training set based on the specified distance metric, and then make predictions based on the information of these k "neighbors". Usually, the "voting method" is used in classification tasks, that is, the category label that appears most among k "neighbors" is selected as the prediction result, and the "average method" is used in regression tasks, that is, the real-valued output of k neighbors is taken. The average value of the marks is used as the prediction result; weighted voting or weighted average can also be performed based on the distance. The closer the distance, the greater the weight of the sample.

Thought:

It is to find the K labeled samples (i.e. K nearest neighbors) that are closest to the sample to be classified in the feature space, and use the labels of these samples as a reference to assign the category label with the highest proportion to the sample to be labeled through voting and other methods. .

Easy-to-understand notes:

It is to give a test sample, randomly select a point after training, and then conduct statistics on the nearest points (within a certain distance or within a certain number), and select the type with the highest frequency to mark the sample.

Example:

K nearest neighbor:
#导入必要的库
from sklearn.datasets import make blobs
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#生成一个数据集
data = make_blobs(n samples=100, centers =2,random state=9)
#将特征和标签赋值给X和y
X,y = data
#使用散点图进行可视化
plt.scatter(X[y==1,0],X[y==1,1],cmap=plt.cm.spring,edgecolor='k',marker='^')
plt.scatter(X[y==0,0], X[y==0.1],cmap=plt.cm.spring, edgecolor='k',marker='o')
#显示图像plt.show()
KNN:
#导入make_regression数据集生成器
from sklearn.datasets import make regression
#生成特征数量为1,噪音为40的数据集
X,y = make_regression(n_features=l,n _informative=l,noise=30,random_state=5)
#用散点图将数据点进行可视化
# plt.scatter(X,y, c='b',edgecolor='k')
# plt. show()
#导入用于回归分析的KNN模型
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor()
#用KNN模型拟合数据
reg.fit(X,y)
#把预测结果用图像进行可视化
z = np.linspace(-2.5,2.5,200).reshape(-1,1)
plt.scatter(X,y, c='b', edgecolor='k')
plt.plot(z,reg.predict(z),c='r',linewidth=3)
#向图像添加标题
plt.title('KNN Regressor')
plt. show()

Naive Bayes:

The British scholar Bayes proposed Bayes' theorem in 1763, which is a theorem about the conditional probabilities P(A) and P(B) of random events A and B. Where P(A|B) is the possibility of A happening if B happens, and P(BA) is the possibility of B happening if A happens.

1. The Naive Bayes algorithm is an algorithm based on Bayes’ theorem that estimates the posterior probability based on the prior probability and correction factors under the assumption that “all features are independent of each other”.

2. The characteristics of the Naive Bayes algorithm are: simple principle, easy to implement, high efficiency in the classification process, and less time consumption

3. There are three common naive Bayes algorithms, including:

(1) Bernoulli Naive Bayes, which is suitable for data sets whose characteristics conform to the Bernoulli distribution.

(2) Gaussian Naive Bayes, which is suitable for data sets whose characteristics roughly conform to the Gaussian distribution (or can be converted to a Gaussian distribution), and performs relatively better in data sets with a larger number of samples.

(3) Polynomial Naive Bayes, which is suitable for data sets whose characteristics conform to polynomial distribution (or can be converted to polynomial distribution). The performance in data sets with small samples is not bad.

official:

Given several samples X of a certain system, calculate the parameters of the system, namely:

P(θ): Probability of daily occurrence without data support: prior probability. P(θ|x): With the support of data X, the probability of daily occurrence: posterior probability. P(x|θ): Probability distribution of a given parameter day: likelihood function.

PS: Given that A and B are two events, if P(AB)=P(A)P(B), events A and B are said to be independent of each other. A and B are independent, then P(A|B)= P(A)

Code:

#导入pandas
import pandas as pd
#读取保存好的csv文件,路径和文件名换成你自己的
data = pd.read csv( '水果.csv',encoding = 'gbk') # 水果.csv单独上传在我上传的资料中
#查看是否读取成功
data

tidy:

data['大小']=data['大小'].replace({'大':1,'小':0})
data['颜色']=data['颜色'].replace({'红色':1,'青色':0))
data['形状']=data['形状'].replace({'圆形':1,'非规则':0})
data['好果']=data['好果'].replace({'是':1,'否':0))
data

core:

#导入伯努利朴素贝叶斯
from sklearn.naive_bayes import BernoulliNB
#定义好样本特征x和分类标签y
X = data.drop(['好果'],axis = 1)
y = data['好果']
#创建一个伯努利朴素贝叶斯分类器
clf = BernoulliNB()
#由于样本数量很少,这里不拆分为训练集和验证集
#使用x和y训练分类器
clf.fit(X,y)
#验证分类器的准确率
clf.score(X,y)

Model score:

#导入数据集生成工具
from sklearn.datasets import make_blobs
#导入数据集拆分工具
from sklearn.model_selection import train_test_split
#生成样本数量为 400,分类数为 4 的数据集
X,y = make_blobs(n_samples=400, centers=4,random state=8)
#将数据集拆分成训练集和验证集,固定随机状态为8
Xtrain,X test,y_train,y_test=train test_split(X,y,random state=8)
#使用伯努利贝叶斯拟合数据
nb = BernoulliNB()
nb.fit(X_train,y_train)
print('模型得分:{:.3f)'.format(nb.score(X_test, y_test)))

Decision tree, random forest algorithm

Decision tree:

Decisions are made by making a series of "yes" or "no" judgments on sample characteristics.

The concept of entropy: quantifying the degree of uncertainty

The proportion of k-th category samples in data set D is pk

Information gain:

ps: The information gain is the largest and the division is the best.

principle:

Random Forest

Build a forest randomly. The random forest algorithm consists of many decision trees, and there is no correlation between each decision tree. After the forest is established, when new samples enter, each decision tree will be judged separately, and then the classification results will be given based on the voting method.

Random Forest is an extended variant of Bagging. Based on the Bagging ensemble constructed with a decision tree-based learner, it further introduces random feature selection in the training process of the decision tree. Therefore, it can be summarized that random forest includes four Parts: 1 Randomly select samples (replacement sampling) 2 Randomly select features 3. Build decision trees 4. Random forest voting (average)

Linear regression--gradient descent method (search-based optimization method)

gradient descent method

import numpy as np
import matplotlib.pyplot as plt
def f(xx):
  return xx**2-5*xx+6
def df(xx):
  return 2*xx-5x0=-6
alpha=0.1#1.2#0.01#0.1#0.5 #0.8
iter=0x list=[x0]
while True:
  g=df(x0)x=x0-alpha*g
  if abs(f(x)-f(x0))<1e-15:
    break
  x_list.append(x)
  x0=Xiter+=1
  if iter>1000:
    break
t=np.array(x_list)
plt.plot(xx,yy)
plt.plot(t,f(t),marker='+',color='r)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_regression
from mpl_toolkits.mplot3d import Axes3D
X,y=make_regression(n_samples=300,n_features=1noise=10.bias=3,random state=123)


def L(a,b,X,y):
  x=X.ravel0return (1.0/y.size)*np.sum((a*x+b-y)*(ax+b-y)
  
  
def La(a,b,Xy):
  x=X.ravel0return (2.0/y.size)*npsum((a*x+b-y)*x)
  
  
def Lb(a,b,Xy):
  x=Xravel0return (2.0/y.size)*np.sum(a*x+b-y)
  
  
a0 = 50
b0 = 30
alpha = 0.1
iter=0
ab_list=[lao,bol]
while True:
  la=La(a0,b0,X,y)
  lb=Lb(a0.b0,X,y)
  a=a0-alpha*la
  b=b0-alpha*lb
  if abs(L(a,b,X,y)-L(a0,b0,X,y)) < 1e-15:
    break
  ab_list.append([a,b])
  a0=a
  b0=b
  iter+=1
  if iter > 1000:
    break
A=np.linspace(-12,52,num=50)
B=np.linspace(-3032,num=50)
AA,BB=np.meshgrid(A,B)
ZZ=np.array([L(a,b,X,y) for a,b in zip(AA.ravel(),BB.ravel())]).reshape(AA.shape)
%matplotlib qt
ax3d=Axes3D(plt.fiqure0)
ax3d.plot_surface(AA,BB,ZZ,cmap='jet')
aa=np.arrav(ab list)[:0]
bb=np.arrav(ab list)[:1]
cc=[L(a,b,Xy) for (a,b) in zip(aa,bb)]
ax3d.plot3D(aa,bb,cc,color='r',marker="+")

Normalized:

Normalization is to convert all data into numbers between [0,1] or [-1,1]. Its purpose is to cancel the order of magnitude difference between the data of each dimension and avoid errors caused by the large order of magnitude difference between the input and output data. This causes the network prediction error to be too large.

  1. For the convenience of subsequent data processing, normalization can avoid some unnecessary numerical problems.

  1. In order to make the program converge faster when running

  1. Unified dimensions. The evaluation criteria for sample data are different, so it needs to be dimensionalized and the evaluation criteria unified. This is considered a requirement at the application level.

  1. Avoid neuronal saturation. That is to say, when the activation of neurons is close to 0 or 1, in these areas, the gradient is almost 0, so that during the back propagation process, the local gradient will be close to 0, which is very unfavorable for network training.

  1. Ensure that small values ​​in the output data are not swallowed

  1. PS: Although I’m still a little confused by the picture the teacher gave me, I’ll put it up first.

Ridge Regression Algorithm

multiple linear regression

在研究现实问题时,因变量的变化往往受几个重要因素的影响,此时就需要用两个或两个以上的影响因素作为自变量来解释因变量的变化,这就是多元回归。当多个自变量与因变量之间是线性关系时,所进行的回归分析就是多元性回归。线性回归的数学模型为:

岭回归

先记到这了以后有时间再发

Guess you like

Origin blog.csdn.net/hello__D/article/details/129363685