[PaddlePaddle] [学习笔记] PaddlePaddle 官方文档 —— ①使用Python和NumPy构建神经网络模型;②使用PaddlePaddle预测波士顿房价

1. 机器学习和深度学习综述

1.1 人工智能、机器学习、深度学习的关系

近些年人工智能、机器学习和深度学习的概念十分火热,但很多从业者却很难说清它们之间的关系,外行人更是雾里看花。在研究深度学习之前,先从三个概念的正本清源开始。概括来说,人工智能、机器学习和深度学习覆盖的技术范畴是逐层递减的,三者的关系如下图所示,即:人工智能 > 机器学习 > 深度学习。

Insert image description here

人工智能(Artificial Intelligence,AI)是最宽泛的概念,是研发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。由于这个定义只阐述了目标,而没有限定方法,因此实现人工智能存在的诸多方法和分支,导致其变成一个“大杂烩”式的学科。机器学习(Machine Learning,ML)是当前比较有效的一种实现人工智能的方式。深度学习(Deep Learning,DL)是机器学习算法中最热门的一个分支,近些年取得了显著的进展,并替代了大多数传统机器学习算法。

1.2 机器学习

区别于人工智能,机器学习、尤其是监督学习则有更加明确的指代。机器学习是专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构,使之不断改善自身的性能。这句话有点“云山雾罩”的感觉,让人不知所云,下面我们从机器学习的实现和方法论两个维度进行剖析,帮助读者更加清晰地认识机器学习的来龙去脉。

1.2.1 机器学习的实现

机器学习的实现可以分成两步:①训练和②预测,类似于归纳和演绎:

  • 归纳: 从具体案例中抽象一般规律,机器学习中的“训练”亦是如此。从一定数量的样本(已知模型输入 X X X 和模型输出 Y Y Y)中,学习输出 Y Y Y 与输入 X X X 的关系(可以想象成是某种表达式)。

  • 演绎: 从一般规律推导出具体案例的结果,机器学习中的“预测”亦是如此。基于训练得到的 Y Y Y X X X 之间的关系,如出现新的输入 X X X,计算出输出 Y Y Y。通常情况下,如果通过模型计算的输出和真实场景的输出一致,则说明模型是有效的。

1.2.2 机器学习的方法论

机器学习的方法论和人类科研的过程有着异曲同工之妙,下面以“机器从牛顿第二定律实验中学习知识”为例,帮助读者更加深入理解机器学习(监督学习)的方法论本质,即在“机器思考”的过程中确定模型的三个关键要素:

  1. 假设
  2. 评价
  3. 优化
1.2.2.1 案例:机器从牛顿第二定律实验中学习知识

牛顿第二定律是艾萨克·牛顿在1687年于《自然哲学的数学原理》一书中提出的,其常见表述:物体加速度的大小跟作用力成正比,跟物体的质量成反比,与物体质量的倒数成正比。牛顿第二运动定律和第一、第三定律共同组成了牛顿运动定律,阐述了经典力学中基本的运动规律。

在中学课本中,牛顿第二定律有两种实验设计方法:①倾斜滑动法和②水平拉线法,如图所示。

Insert image description here

I believe many readers have fond memories of their youthful days playing with pulleys and small wooden blocks to do physics experiments. Through multiple experimental data, the acceleration of the wood block under different forces can be calculated as shown in the table below.

frequency ForceXX _X AccelerationYY _Y
1 4 2
2 4 2
n 6 3

It is not difficult to guess from observing the experimental data that the acceleration of the object aaa and forceFFThe relationship between F should be linear. Therefore we propose the hypothesisa = w ⋅ F a=w \cdot Fa=wF , inside,aaa represents acceleration,FFF represents force,www is the parameter to be determined.

Through training with a large amount of experimental data, the parameter ww is determinedw is the reciprocal of the mass of the object1 m \frac{1}{m}m1, that is, the complete model formula a = F ⋅ 1 ma = F \cdot \frac{1}{m} is obtaineda=Fm1. When the forces acting on an object are known, the acceleration of the object can be quickly predicted based on the model. For example: the thrust of the fuel on the rocket F = 10 F=10F=10 , the mass of the rocketm = 2 m=2m=2,可快速得出火箭的加速度 a = 5 a=5 a=5

1.2.2.2 如何确定模型参数?

这个有趣的案例演示了机器学习的基本过程,但其中有一个关键点的实现尚不清晰,即:如何确定模型参数 w = 1 m w=\frac{1}{m} w=m1

确定参数的过程与科学家提出假说的方式类似,合理的假说可以最大化的解释所有的已知观测数据。如果未来观测到不符合理论假说的新数据,科学家会尝试提出新的假说。如:天文史上,使用大圆和小圆组合的方式计算天体运行,在中世纪是可以拟合观测数据的。但随着欧洲工业革命的推动,天文观测设备逐渐强大,已有的理论已经无法解释越来越多的观测数据,这促进了使用椭圆计算天体运行的理论假说出现。因此,模型有效的基本条件是能够拟合已知的样本,这给我们提供了学习有效模型的实现方案。

Insert image description here

上图是以 H H H 为模型的假设,它是一个关于参数 w w w 和输入 x x x 的函数,用 H ( w , x ) H(w,x) H(w,x) 表示。模型的优化目标是 H ( w , x ) H(w,x) H(w,The output of x ) and the real outputYYY should be as consistent as possible, and the difference between the two is the evaluation function of the model effect (the smaller the difference, the better).

Then, the process of determining parameters is to continuously reduce the evaluation function ( HHHYYY gap) process. Until the model learns a parameterwww , which minimizes the value of the evaluation function.The evaluation function that measures the difference between the model's predicted value and the true value is also called the loss function (Loss).

Suppose the machine learns knowledge (model parameters ww ) by trying to answer (minimize loss) a large number of exercises (known samples) correctlyw ), and expect the model H ( w , x ) H(w,x)represented by the learned knowledgeH(w,x ) , answer exam questions for which you don’t know the answer (unknown sample). Minimizing loss is the optimization goal of the model.The method to achieve loss minimization is called an optimization algorithm, also known as a solution-finding algorithm (finding the parameter solution that minimizes the loss function). Parameterwww and enterxxx 组成公式的基本结构称为假设。在牛顿第二定律的案例中,基于对数据的观测,我们提出了线性假设,即作用力和加速度是线性关系,用线性方程表示。由此可见,①模型假设、②评价函数(损失/优化目标)和③优化算法 是构成模型的三个关键要素。

1.2.2.3 模型结构

模型假设、评价函数和优化算法是如何支撑机器学习流程的呢?如下图所示。

Insert image description here

  • 模型假设:世界上的可能关系千千万,漫无目标的试探 Y ← X Y \leftarrow X YX 之间的关系显然是十分低效的。因此假设空间先圈定了一个模型能够表达的关系可能,如蓝色圆圈所示。机器还会进一步在假设圈定的圆圈内寻找最优的 Y ← X Y \leftarrow X YX 关系,即确定参数 w w w
  • 评价函数:寻找最优之前,我们需要先定义什么是最优,即评价一个 Y ← X Y \leftarrow X YX 关系的好坏的指标。通常衡量该关系是否能很好的拟合现有观测样本,将拟合的误差最小作为优化目标。
  • 优化算法:设置了评价指标后,就可以在假设圈定的范围内,将使得评价指标最优(损失函数最小/最拟合已有观测样本)的 Y ← X Y \leftarrow X YFind the X relationship, and this method of finding the optimal solution is the optimization algorithm. The stupidest optimization algorithm is to calculate the loss function by exhaustively enumerating every possible value according to the possible parameters, and retain the parameters that minimize the loss function as the final result.

From the above process, it can be concluded that the machine learning process is basically consistent with the learning process of Newton's second law, and is divided into three stages: hypothesis , evaluation and optimization :

  • Hypothesis : By observing the acceleration aaa and forceFFObservation data of F , assuming aaa andFFF is a linear relationship, that is,a = w ⋅ F a=w \cdot Fa=wF
  • Evaluation : The fitting effect on known observation data is good, that is, w ⋅ F w\cdot FwThe result of F calculation should be consistent with the observedaaa as close as possible.
  • Optimization : in parameter wwAmong all possible values ​​of w , it is found that w = 1 mw=\frac{1}{m}w=m1It can make the evaluation the best (best fit the observed sample).

The framework for machines to perform learning tasks reflects that the essence of learning is "parameter estimation" (Learning is parameter estimation)

The above methodology uses a more standardized representation as shown in the figure below, the unknown objective function fff , with training samplesD = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( xn , yn ) D = (x_1, y_1), (x_2, y_2), ..., (x_n, y_n)D=(x1,y1),(x2,y2),...,(xn,yn) is based on. From the hypothesis setHHIn H , through learning algorithmAAA finds a functionggg . ifggg can best fit the training sampleDDD , then the functionggg is close to the objective functionfff

Insert image description here

On this basis, many seemingly completely different problems can be learned using the same framework, such as scientific laws, image recognition, machine translation and automatic question answering, etc.Their learning goals are to fit a "big formula ff"f,As shown below.

Insert image description here

1.3 Deep learning

The theory of machine learning algorithms matured in the 1990s and achieved success in many fields, but the quiet days only lasted until about 2010. With the emergence of big data and the improvement of computer computing power, deep learning models have emerged, which has greatly changed the application landscape of machine learning. Today, most machine learning tasks can be solved using deep learning models. Especially in fields such as speech, computer vision, and natural language processing, the effects of deep learning models are significantly improved compared to traditional machine learning algorithms .

Compared with traditional machine learning algorithms, what improvements has deep learning made? In fact, the two are consistent in theoretical structure, namely: model assumptions, evaluation functions and optimization algorithms . The fundamental difference lies in the complexity of the assumptions . As shown in the second example (image recognition) in the figure above, for photos of beautiful women, the human brain can receive colorful optical signals and can quickly respond that the picture is of a beautiful woman. But for a computer, it can only receive a digital matrix. For a high-level semantic concept like beauty, the complexity of information transformation from pixels to high-level semantic concepts is unimaginable. This conversion process is shown in the figure below . Show.

Insert image description here

This transformation can no longer be expressed using mathematical formulas, so researchers drew on the structure of human brain neurons and designed a neural network model, as shown in the figure below.

Insert image description here

Figure (a) shows the design of the basic unit of the neural network - the perceptron. The way it processes information is very similar to a single neuron in the human brain; Figure (b) shows several classic The neural network structure (will be explained in detail in subsequent chapters) is similar to the various organs with different functions formed based on a large number of neuron connections in the human brain.

1.3.1 Basic concepts of neural networks

人工神经网络包括多个神经网络层,如:卷积层(Convolution Layer)、全连接层(Fully Connected Layer)、LSTM(Long Short-Term Memory,长短期记忆网络)等,每一层又包括很多神经元(neuron),超过三层的非线性神经网络都可以被称为深度神经网络(D-CNN)。通俗的讲,深度学习的模型可以视为是输入到输出的映射函数,如图像到高级语义(美女)的映射,足够深的神经网络理论上可以拟合任何复杂的函数。因此神经网络非常适合学习样本数据的内在规律和表示层次,对文字、图像和语音任务有很好的适用性。这几个领域的任务是人工智能的基础模块,因此深度学习被称为实现人工智能的基础也就不足为奇了。神经网络(NN)基本结构如下图所示。

Insert image description here

其中:

  • 神经元(neuron): 神经网络中每个节点称为神经元,由两部分组成:
    1. 加权和:将所有输入加权求和。
    2. 非线性变换(激活函数):加权和的结果经过一个非线性函数变换,让神经元计算具备非线性的能力。
  • 多层连接: 大量这样的节点按照不同的层次排布,形成多层的结构连接起来,即称为神经网络。
  • 前向计算: 从输入计算输出的过程,顺序从网络前至后。
  • Computational graph : Graphically displaying the computational logic of a neural network is also called a computational graph. The computational graph of a neural network can also be expressed in the form of a formula: Y = f 3 ( f 2 ( f 1 ( w 1 ⋅ x 1 + w 2 ⋅ x 2 + w 3 ⋅ x 3 + b ) + . . . ) + . . . ) Y = f_3(f_2(f_1(w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b) + ...)+ ...)Y=f3(f2(f1(w1x1+w2x2+w3x3+b)+...)+...)

It can be seen that the neural network (NN) is not that mysterious. Its essence is a "big formula" containing many parameters .

1.3.2 The development history of deep learning

The idea of ​​neural networks was proposed more than 70 years ago, and today's design theories of neural networks and deep learning are becoming more and more perfect step by step. In these long years of development, some shining moments of key breakthroughs are worth remembering by deep learning enthusiasts, as shown in the figure below.

Insert image description here

  • 1940s : The structure of a neuron is first proposed, but the weights are not learnable.
  • 1950s-1960s : The weight learning theory was proposed, and the neuron (Neuron) structure tended to be perfect, opening the first golden age of neural networks (NN).
  • 1969 : The XOR problem was raised (people were surprised to find that the neural network model could not solve even simple XOR problems, and their expectations fell from the clouds to the bottom), and the neural network model entered the dark age of being shelved.
  • 1986 年:新提出的多层神经网络解决了异或问题,但随着 90 年代后理论更完备并且实践效果更好的 SVM 等机器学习模型的兴起,神经网络并未得到重视。
  • 2010 年左右:深度学习进入真正兴起时期。随着神经网络模型改进的技术在语音和计算机视觉任务上大放异彩,也逐渐被证明在更多的任务,如自然语言处理(NLP)以及海量数据的任务上更加有效。至此,神经网络模型重新焕发生机,并有了一个更加响亮的名字:深度学习(Deep Learning)。

为何神经网络到 2010 年后才焕发生机呢?这与深度学习成功所依赖的先决条件:①大数据涌现、②硬件发展和③算法优化有关。

  1. 大数据涌现:大数据是神经网络发展的有效前提。神经网络(NN)和深度学习(DL)是非常强大的模型,需要足够量级的训练数据。时至今日,之所以很多传统机器学习算法和人工特征依然是足够有效的方案,原因在于很多场景下没有足够的标记数据来支撑深度学习。深度学习的能力特别像科学家阿基米德的豪言壮语:“给我一根足够长的杠杆,我能撬动地球!”。深度学习也可以发出类似的豪言:“给我足够多的数据,我能够学习任何复杂的关系”。但在现实中,足够长的杠杆与足够多的数据一样,往往只能是一种美好的愿景。直到近些年,各行业 IT 化程度提高,累积的数据量爆发式地增长,才使得应用深度学习模型成为可能。
  2. 硬件发展和算法优化:依靠硬件的发展和算法的优化。现阶段,依靠更强大的计算机、GPU、autoencoder 预训练和并行计算等技术,深度学习在模型训练上的困难已经被逐渐克服。其中,数据量和硬件是更主要的原因。没有前两者,科学家们想优化算法都无从进行。

1.3.3 深度学习的研究和应用蓬勃发展

As early as 1998, some scientists had used neural network models to recognize handwritten digit images (MNIST). However, the rise of deep learning in computer vision applications began with the use of AlexNet for image classification in the 2012 ImageNet competition. If you compare the 1998 and 2012 models, you will find that the two are very similar in network structure, with only some optimizations in details. In the past 14 years, the substantial improvement in computing performance and the explosive growth in data volume have prompted the model to complete the leap from "simple number recognition" to "complex image classification".

Although it has a long history, deep learning is still booming today. On the one hand, basic research is developing rapidly, and on the other hand, industrial practices are emerging in endlessly. Based on statistics from ICLR (International Conference on Learning Representations), the top conference on deep learning, the number of papers related to deep learning is increasing year by year, as shown in the figure below. At the same time, not only deep learning conferences, but also a large number of papers from international conferences such as ICML and KDD related to data and model technology, CVPR focusing on vision, and EMNLP focusing on natural language processing, all involve deep learning technology. Research in this field and related fields is in the ascendant, and technology is still undergoing innovation and breakthroughs.

Insert image description here

On the other hand, artificial intelligence technology based on deep learning has extremely broad application scenarios in upgrading and transforming many traditional industry fields. The picture below is taken from a research report by iResearch. Artificial intelligence technology can not only be applied in many industries (breadth), but also has achieved market realization in some industries (such as security, remote sensing, Internet, finance, industry, etc.) and rapid growth (depth), contributing huge economic value to society.

Insert image description here

As shown in the figure below, taking the industry application distribution of computer vision (CV) as an example, according to IDC statistics and forecasts, with the penetration of artificial intelligence into various industries, the proportion of the output value of the Internet industry that currently uses artificial intelligence has increased. will gradually become smaller.

Insert image description here

1.3.4 Deep learning has changed the research and development model of AI applications

1.3.4.1 Implemented end-to-end (End2End) learning

深度学习改变了很多领域算法的实现模式。在深度学习兴起之前,很多领域建模的思路是投入大量精力做特征工程,将专家对某个领域的“人工理解”沉淀成特征表达,然后使用简单模型完成任务(如分类或回归)。而在数据充足的情况下,深度学习模型可以实现端到端(End2End)的学习,即不需要专门做特征工程,将原始的特征输入模型中,模型可同时完成特征提取和分类任务,如下图所示。

Insert image description here

以计算机视觉任务为例,特征工程是诸多图像科学家基于人类对视觉理论的理解,设计出来的一系列提取特征的计算步骤,典型如 SIFT 特征。在 2010 年之前的计算机视觉领域,人们普遍使用 SIFT 一类特征 + SVM 一类的简单浅层模型完成建模任务。

说明

  1. SIFT 特征由 David Lowe 在 1999 年提出,在 2004 年加以完善。SIFT 特征是基于物体上的一些局部外观的兴趣点而与影像的大小和旋转无关。对于光线、噪声、微视角改变的容忍度也相当高。基于这些特性,它们是高度显著而且相对容易撷取,在母数庞大的特征数据库中,很容易辨识物体而且鲜有误认。使用 SIFT 特征描述对于部分物体遮蔽的侦测率也相当高,甚至只需要 3 个以上的 SIFT 物体特征就足以计算出位置与方位。在现今的电脑硬件速度下和小型的特征数据库条件下,辨识速度可接近即时运算。SIFT 特征的信息量大,适合在海量数据库中快速准确匹配。

  2. 深度学习和传统机器学习最本质的区别在于特征工程的需求

    • 在传统机器学习中,特征工程是一个重要的步骤,需要领域专家手动从原始数据中设计和选择相关的特征,作为输入供机器学习算法使用。这个过程通常需要大量领域知识、数据预处理和人工努力来创建信息丰富、有意义的数据表示。
    • 而深度学习则能够在训练过程中自动地从原始数据中学习到相关特征。深度学习模型,尤其是神经网络,能够通过多个计算层次逐渐学习到数据的层次化表示。这种自动学习特征的能力使得深度学习更加适用于处理复杂和高维数据。
1.3.4.2 实现了深度学习框架标准化

除了应用广泛的特点外,深度学习还推动人工智能进入工业大生产阶段,算法的通用性导致标准化、自动化和模块化的框架产生,如下图 所示。

Insert image description here

在此之前,不同流派的机器学习算法理论和实现均不同,导致每个算法均要独立实现,如随机森林(RF)和支撑向量机(SVM)。但在深度学习框架下,不同模型的算法结构有较大的通用性,如常用于计算机视觉的卷积神经网络模型(CNN)和常用于自然语言处理的长期短期记忆模型(LSTM),都可以分为组网模块、梯度下降的优化模块和预测模块等。这使得抽象出统一的框架成为了可能,并大大降低了编写建模代码的成本。一些相对通用的模块,如网络基础算子的实现、各种优化算法等都可以由框架实现。建模者只需要关注数据处理,配置组网的方式,以及用少量代码串起训练和预测的流程即可。

在深度学习框架出现之前,机器学习工程师处于“手工作坊”生产的时代。为了完成建模,工程师需要储备大量数学知识,并为特征工程工作积累大量行业知识。每个模型是极其个性化的,建模者如同手工业者一样,将自己的积累形成模型的“个性化签名”。而今,“深度学习工程师”进入了工业化大生产时代,只要掌握深度学习必要但少量的理论知识,掌握 Python 编程,即可在深度学习框架上实现非常有效的模型,甚至与该领域最领先的模型不相上下。建模领域的技术壁垒面临着颠覆,也是新入行者的机遇。

Insert image description here

1.4 人工智能的职业发展空间广阔

下面就从经济回报的视角,分析下人工智能是不是一个有前途的职业。坦率的说,如巴菲特所言,选择一个自己喜欢的职业是真正的好职业。但对于多数普通人,经济回报也是职业选择的重要考虑因素。一个有高经济回报的职业一定是市场需求远远大于市场供给的职业,且市场需求要保持长期的增长,而市场供给难以中短期得到补充。

1.4.1 人工智能岗位的市场需求旺盛

根据各大咨询公司的行业研究报告,人工智能相关产业在未来十年预计有 30% ~ 40% 的年增长率。一方面,人工智能的应用会从互联网行业逐渐扩展到金融、工业、农业、能源、城市、交通、医疗、教育等更广泛的行业,应用空间和潜力巨大;另一方面,受限于工智能技术本身的成熟度以及人工智能落地要结合场景的数据处理、系统改造和业务流程优化等条件的制约,人工智能应用的价值释放过程会相对缓慢。这使得市场对人工智能的岗位需求形成了一条稳步又长期增长的曲线,与互联网行业相比,对多数的求职者更加友好,如下图所示。

Insert image description here

互联网行业由于技术成熟周期短,应用落地的推进速度快,反而形成一条增长率更高(年增长率超过 100%)但增长周期更短的曲线(电脑互联网时代 10 年,移动互联网时代 10 年)。当行业增长达到顶峰,对岗位的需求也会相应回落,如同 2021 年底的互联网行业的现状。

1.4.2 复合型人才成为市场刚需

在人工智能落地到千行万业的过程中,企业需求量最大、也最为迫切的是既懂行业知识和场景,又懂人工智能理论,还具备实践能力和经验的“复合型人才”。成为“复合型人才”不仅需要学习书本知识,还要大量进行产业实践,使得这种人才有成长深度,供给增长缓慢。从上述分析可见,当人工智能产业在未来几十年保持稳定的增长,而产业需要的“复合型人才”又难以大量供给的情况下,人工智能应用研发岗位会维持一个很好的经济回报。

2. 使用 Python 和 NumPy 构建神经网络模型

上一章我们初步认识了神经网络的基本概念(如神经元、多层连接、前向计算、计算图)和模型结构三要素(模型假设、评价函数和优化算法)。本章将以“波士顿房价预测”任务为例,介绍使用 Python 和 NumPy 来构建神经网络模型的思考过程和操作方法。

波士顿房价预测是一个经典的机器学习任务,类似于程序员世界的“Hello World”。和大家对房价的普遍认知相同,波士顿地区的房价受诸多因素影响。该数据集统计了 13 种可能影响房价的因素和该类型房屋的均价,期望构建一个基于 13 个因素进行房价预测的模型,如下图所示。

Insert image description here

对于预测问题,可以根据预测输出的类型是连续的实数值,还是离散的标签,区分为①回归任务和②分类任务。因为房价是一个连续值,所以房价预测显然是一个回归任务。下面我们尝试用最简单的线性回归模型解决这个问题,并用神经网络来实现这个模型。

2.1 线性回归(Linear Regression)模型

假设房价和各影响因素之间能够用线性关系来描述:

y = ∑ j = 1 M x j w j + b (1) y = \sum_{j=1}^M x_jw_j + b \tag{1} y=j=1Mxjwj+b(1)

模型的求解即是通过数据拟合出每个 w j w_j wj b b b。其中, w j w_j wj b b b 分别表示该线性模型的权重(Weight)和偏置(Bias)。一维情况下, w j w_j wj b b b 是直线的斜率和截距。

线性回归模型使用均方误差(Mean Squared Error,MSE)作为损失函数(Loss),用以衡量预测房价和真实房价的差异,公式如下:

M S E = 1 n ∑ i = 1 n ( Y ^ i − Y i ) 2 (2) \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n (\hat{Y}_i - Y_i)^2 \tag{2} MSE=n1i=1n(Y^iYi)2(2)

思考

为什么要以均方误差(MSE)作为损失函数?即将模型在每个训练样本上的预测误差加和,来衡量整体样本的准确性。这是因为损失函数的设计不仅仅要考虑“合理性”,同样需要考虑“易解性”,这个问题在后面的内容中会详细阐述。


神经网络的标准结构中每个神经元(Neuron)由加权和与非线性变换构成,然后将多个神经元分层的摆放并连接形成神经网络(NN)。线性回归模型可以认为是神经网络模型的一种极简特例,是一个只有加权和、没有非线性变换的神经元(无需形成网络),如下图所示。

Insert image description here

2.2 使用 Python 和 NumPy 实现波士顿房价预测任务

深度学习不仅实现了模型的端到端学习,还推动了人工智能进入工业大生产阶段,产生了标准化、自动化和模块化的通用框架。不同场景的深度学习模型具备一定的通用性,五个步骤即可完成模型的构建和训练,如下图所示。

Insert image description here

正是由于深度学习的建模和训练的过程存在通用性,在构建不同的模型时,只有模型三要素不同,其它步骤基本一致,深度学习框架才有用武之地。

2.2.1 数据处理

Data processing includes five parts: ① data import, ② data shape transformation, ③ data set division, ④ data normalization processing and ⑤ packaging load_datafunction. Generally, data can only be called by the model after being preprocessed.

2.2.1.1 Reading data

Read the data through the following code to understand the data set structure of Boston housing prices. The data is stored in housing.datafiles in the local directory.

Data set download address: http://paddlemodels.bj.bcebos.com/uci_housing/housing.data

import numpy as np
import json


# 读取数据
datafile = "/data/data_01/lijiandong/Datasets/boston_house_price/housing.data"
data = np.fromfile(datafile, sep=" ")
print(data)  # /data/data_01/lijiandong/Datasets/boston_house_price/housing.data
print(data.shape)  # (7084,)

2.2.1.2 Data shape transformation

Since the original data read in is 1-dimensional, all data are connected together. Therefore, we need to transform the shape of the data to form a 2-dimensional matrix, each row has a data sample (14 columns), and each data sample contains 13 XXX (characteristics that affect house prices) and aYYY (average price of houses of this type).

"""
读入之后的数据被转化成1维array,其中array的第0-13项是第一条数据,第14-27项是第二条数据,以此类推.... 
这里对原始数据做reshape,变成N x 14的形式
"""
feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
feature_num = len(feature_names)
data = data.reshape([data.shape[0] // feature_num, feature_num])

# 查看数据
x = data[0]  # 取第一行
print(x.shape)  # (14,)
print(x)
"""
[6.320e-03 1.800e+01 2.310e+00 0.000e+00 5.380e-01 6.575e+00 6.520e+01
 4.090e+00 1.000e+00 2.960e+02 1.530e+01 3.969e+02 4.980e+00 2.400e+01]
"""
2.2.1.3 Data set division

The data set is divided into a training set and a test set. The training set is used to determine the parameters of the model, and the test set is used to evaluate the effect of the model. Why should we split the data set and not directly apply it to model training? This is similar to the relationship between teaching and examination during student days, as shown in the figure below.

Insert image description here

When I was in school, there were always some smart students who didn't study seriously. They crammed before the exam and memorized the exercises, but their grades were often not good. Because schools expect students to master knowledge, not just the exercises themselves. Introducing new test questions can encourage students to work hard to understand the principles behind the exercises. Similarly, we expect the model to learn the essential rules of the task, rather than the training data itself. Only data that is not used in model training can more truly evaluate the effect of the model.

In this case, we use 80% of the data as the training set and 20% as the test set. The implementation code is as follows.

ratio = 0.8
offset = int(data.shape[0] * ratio)
training_data = data[:offset]
test_data = data[offset:]

print(data.shape)  # (506, 14)
print(training_data.shape)  # (404, 14)
print(test_data.shape)  # (102, 14)

By printing the shape of the training set, we can find that there are 404 samples in total, each containing 13 features and 1 predicted value.

2.2.1.4 Data normalization processing

Data normalization is a common data preprocessing technique used to scale data to a certain range, usually [0, 1] [0, 1][0,1 ] or[ − 1 , 1 ] [-1, 1][1,1 ] (the former is more widely used, so we use the former). The formula for data normalization is as follows:

For each feature (or each column), assume that the range of the original data is [ min ⁡ , max ⁡ ] [\min, \max][min,max ] , the normalized data can be calculated by the following formula:

Normalized value = original value − min ⁡ max ⁡ − min ⁡ Normalized value = \frac{Original value - \min}{\max - \min}Normalized value=maxminoriginal valuemin

Among them, "original value" refers to the original data value in a specific feature, "min" is the minimum value of the feature in the data set, and "max" is the maximum value of the feature in the data set.

Use normalization to scale the value of each feature to [0, 1] [0, 1][0,1 ] between. This has two benefits:

  1. Model training is more efficient, which will be explained in detail in the second half of this section;
  2. The weight before a feature can represent the contribution of the variable to the prediction result (because each feature value itself has the same range).
# 计算train数据集的最大值和最小值
maxinums, minimus = training_data.max(axis=0), training_data.min(axis=0)

# 对数据进行归一化处理
for col_name in range(feature_num):
    # 训练集归一化
    training_data[:, col_name] = (training_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
    # 测试集归一化(确保了测试集上的数据也使用了与训练集相同的归一化转换,避免了引入测试集信息污染)
    test_data[:, col_name] = (test_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])

# 验证是否有大于1的值
print(np.any(training_data[:, :-1] > 1.0))  # False
print(np.any(test_data[:, :-1] > 1.0))  # True(这是正常的,因为我们使用了训练集的归一化参数)
2.2.1.5 封装成 load_data 函数

将上述几个数据处理操作封装成 load_data 函数,以便下一步模型的调用,实现方法如下。

import numpy as np
import json


def data_load():
    # 2.2.1.1 读入数据
    datafile = "/data/data_01/lijiandong/Datasets/boston_house_price/housing.data"
    data = np.fromfile(datafile, sep=" ")


    # 2.2.1.2 数据形状变换
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)
    data = data.reshape([data.shape[0] // feature_num, feature_num])  # [N, 14]


    # 2.2.1.3 数据集划分
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]
    test_data = data[offset:]


    # 2.2.1.4 数据归一化处理
    # 计算train数据集的最大值和最小值
    maxinums, minimus = training_data.max(axis=0), training_data.min(axis=0)

    # 对数据进行归一化处理
    for col_name in range(feature_num):
        # 训练集归一化
        training_data[:, col_name] = (training_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        # 测试集归一化(确保了测试集上的数据也使用了与训练集相同的归一化转换,避免了引入测试集信息污染)
        test_data[:, col_name] = (test_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        
    return training_data, test_data


# 获取数据
training_data, test_data = data_load()
x = training_data[:, :-1]  # 训练数据
target = training_data[:, -1:]  # 目标值

# 查看数据
print(f"x.shape: {
      
      x.shape}")  # (404, 13)
print(f"target.shape: {
      
      target.shape}")  # (404,)

2.2.2 模型设计

模型设计是深度学习模型关键要素之一,也称为网络结构设计,相当于模型的假设空间,即实现模型“前向计算”(从输入到输出)的过程。

如果将输入特征和输出预测值均以向量表示,输入特征 x x x 有 13 个分量, y y y 有 1 个分量,那么参数权重的形状(shape)是 13 × 1 13 \times 1 13×1。假设我们以如下任意数字赋值参数做初始化:

w = [ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , − 0.1 , − 0.2 , − 0.3 , − 0.4 , 0.0 ] w = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, -0.1, -0.2, -0.3, -0.4, 0.0] w=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.1,0.2,0.3,0.4,0.0]

用代码表示即:

w = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, -0.1, -0.2, -0.3, -0.4, 0.0]
w = np.array(w).reshape([13, 1])

取出第 1 条样本数据,观察样本的特征向量与参数向量相乘的结果。

x1 = x[0]
t = np.dot(x1, w)
print(t)  # [0.69474855]

完整的线性回归公式,还需要初始化偏移量 b b b,同样随意赋初值 -0.2。那么,线性回归模型的完整输出是 z = t + b z=t+b z=t+b,这个从特征和参数计算输出值的过程称为“前向计算”。

b = -0.2  # bias
z = t + b
print(z)  # [0.49474855]

将上述计算预测输出的过程以“类和对象”的方式来描述,类成员变量有参数 w w w b b b。通过写一个 forward 函数(代表“前向计算”)完成上述从特征和参数到输出预测值的计算过程,代码如下所示。

class Network:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z

基于 Network 类的定义,模型的计算过程如下所示。

net = Network(13)
x1 = x[0]
y1 = target[0]
z = net.forward(x1)
print(z)  # [2.39362982]

从上述前向计算的过程可见,线性回归也可以表示成一种简单的神经网络(只有一个神经元,且激活函数为恒等式 y = y y = y y=y)。这也是机器学习模型普遍为深度学习模型替代的原因:由于深度学习网络强大的表示能力,很多传统机器学习模型的学习能力等同于相对简单的深度学习模型

2.2.3 训练配置

模型设计完成后,需要通过训练配置寻找模型的最优值,即通过损失函数来衡量模型的好坏。训练配置也是深度学习模型关键要素之一。

Calculate x 1 x_1 through the modelx1The housing price corresponding to the influencing factors represented should be zzz , but the actual data tells us that the house price isyyy . At this time we need some kind of indicator to measure the predicted valuezzz and the true valueyythe difference between y . For regression problems, the most commonly used measurement method is to use the mean square error (MSE) as an indicator to evaluate the quality of the model. The specific definition is as follows:

L o s s = ( y − z ) 2 (3) \mathrm{Loss} = (y - z)^2 \tag{3} Loss=(yz)2(3)

Loss in the above formula (abbreviated as: LLL ) is also often called the loss function, which is a measure of the quality of the model. Here we need to think about a question: If we want to measure the gap between predicted house prices and real house prices, can we just add the absolute value of the gap for each sample? The sum of the absolute values ​​of the differences is a more intuitive and simple idea. Why the sum of the squares?

This is because the design of the loss function must not only consider the "reasonableness" of accurately measuring the problem, but also usually consider the "ease of optimization and solution". As for the answer to this question, it will be revealed after the optimization algorithm is introduced.

In regression problems, mean square error (MSE) is a relatively common form. In classification problems, cross entropy (Cross Entropy) is usually used as the loss function, which will be introduced in more detail in subsequent chapters. The implementation of calculating the loss function value for a sample is as follows.

loss = (y1 - z) ** 2
print(loss)  # [3.88644793]

Because the loss function value of each sample needs to be taken into account when calculating the loss function, we need to sum the loss function of a single sample and divide it by the total number of samples NNN

L = 1 N ∑ i = 1 N ( y i − z i ) 2 (4) L = \frac{1}{N} \sum_{i=1}^N (y_i - z_i)^2 \tag{4} L=N1i=1N(yizi)2(4)

The calculation process of adding the loss function under the Network class is as follows.

class Network:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, pred, gt):
        loss_value_sum = (pred - gt) ** 2
        return np.mean(loss_value_sum)

Using the defined Network class, the predicted value and loss function can be easily calculated. It should be noted that the variables in the class x, w, b, pred, gt, loss_value_sumare all vectors. Taking the variable xas an example, there are two dimensions, one represents the number of features (the value is 13), and the other represents the number of samples. The code is as follows.

# 组成向量一次性计算多个值
x1 = x[:3]  # 前三行
y1 = target[:3]  # 前三行

pred = net.forward(x1)
print(f"pred: {
      
      pred}")

loss = net.loss(pred, target)
print(f"loss: {
      
      loss}")

result:

pred: [[2.39362982]
 [2.46752393]
 [2.02483479]]
loss: 3.573658599044957

2.2.4 Training process

The above calculation process describes how to construct a neural network and complete the calculation of predicted values ​​and loss functions through the neural network. Next, we will introduce how to solve the parameter www andbbThe value of b ,this process is also called the model training process. The training process is one of the key elements of the deep learning model, and its goal is to make the defined loss functionL oss \mathrm{Loss}Loss should be as small as possible, that is to say find a parameter solutionwww andbbb , making the loss function obtain a minimum value.

Let’s do a small test first: As shown in the figure below, based on calculus knowledge, find that the slope of a curve at a certain point is equal to the derivative value of the function at that point. So think about it, when it is at the extreme point of the curve, what is the slope of that point?

Insert image description here

This question is not difficult to answer. The slope at the extreme point of the curve is 0, that is, the derivative of the function at the extreme point is 0. Then, let the loss function take the minimum value of www andbbb should be the solution to the following system of equations:

∂ L ∂ w = 0 (5) \frac{\partial L}{\partial w} = 0 \tag{5} wL=0(5)

∂ L ∂ b = 0 (6) \frac{\partial L}{\partial b} = 0 \tag{6} bL=0(6)

where LLL represents the value of the loss function,www is the model weight,bbb is the bias term. www andbbb are all model parameters to be learned.

Express the loss function in the form of a matrix, as follows:

L = 1 N ∣ ∣ y − ( X w + b ) ∣ ∣ 2 (7) L = \frac{1}{N} ||y - (Xw + b)||^2 \tag{7} L=N1∣∣y(Xw+b)2(7)

Among them yyy isNNA vector composed of label values ​​​​of N samples, with a shape ofN × 1 N\times 1N×1 X X X isNNA matrix composed of N sample feature vectors, with a shape of N × DN × DN×D D D D is the data feature length;www is a weight vector with a shape ofD × 1 D × 1D×1 b b b means that all elements arebbvector of b with shapeN × 1 N × 1N×1

Calculation formula 7 for parameter bbPartial derivative of b :

∂ L ∂ b = 1 T [ y − ( X w + b ) ] (8) \frac{\partial L}{\partial b} = 1^T[y - (Xw + b)] \tag{8} bL=1T[y(Xw+b)](8)

Note that the above formula ignores the coefficient 2 N \frac{2}{N}N2, does not affect the final result. of which 1 11 isNNAn N -dimensional all-1 vector.

Let formula 8 equal 0, we get

b ∗ = x ‾ T w − y ‾ (9) b^* = \overline{x}^Tw - \overline{y} \tag{9} b=xTwy(9)

其中y ‾ = 1 N 1 T y \overline{y} = \frac{1}{N} 1^T yy=N11T yis the average of all labels,x ‾ = 1 N ( 1 TX ) T \overline{x} = \frac{1}{N}(1^TX)^Tx=N1(1TX)T is the average of all feature vectors. Willb ∗ b^*b is brought into formula 7 and the parameterwwTaking the partial derivative of w , we get

∂ L ∂ w = ( X − x ‾ T ) T [ ( y − y ‾ ) − ( X − x ‾ T ) w ] (10) \frac{\partial L}{\partial w} = (X - \overline x^T)^T [(y - \overline y) - (X - \overline x^T)w] \tag{10} wL=(XxT)T[(yy)(XxT)w](10)

Let formula 10 equal 0 to get the optimal parameters

w ∗ = [ ( X − x ‾ T ) T ( X − x ‾ T ) ] − 1 ( X − x ‾ T ) T ( y − y ‾ ) (11) w^* = [(X - \overline x^T)^T(X - \overline x^T)]^{-1}(X - \overline x^T)^T(y - \overline y) \tag{11} w=[(XxT)T(XxT)]1(XxT)T(yy)(11)

b ∗ = x ‾ T w ∗ − y ‾ (12) b^* = \overline x^T w^* - \overline y \tag{12} b=xTwy(12)

将样本数据 ( x , y ) (x,y) (x,y) 带入上面的公式 11 和公式 12 中即可求解出 w w w b b b 的值,但这种方法只对线性回归这样简单的任务有效。如果模型中含有非线性变换,或者损失函数不是均方差这种简单的形式,则很难通过上式求解。为了解决这个问题,下面我们将引入更加普适的数值求解方法:梯度下降法。

2.2.4.1 梯度下降法

在现实中存在大量的函数正向求解容易,但反向求解较难,被称为单向函数,这种函数在密码学中有大量的应用。密码锁的特点是可以迅速判断一个密钥是否是正确的(已知 x x x,求yyy is very easy), but even if you obtain the password lock system, you cannot crack the correct key (it is known thatyyy , findxxx is difficult).

This situation is particularly similar to a blind man who wants to walk from a mountain to a valley. He cannot see where the valley is (he cannot solve the problem inversely to find L oss \mathrm{Loss}The parameter value when the Loss derivative is 0), but you can stretch your feet to explore the slope around you (the derivative value of the current point, also called the gradient). Then, solving the minimum value of the Loss function can be achieved as follows: taking the value from the current parameter, descending step by step in the downhill direction until it reaches the lowest point. The author calls this method the "blind downhill method." Oh no, there is a more formal term "gradient descent method".

The key to training is to find a set of (w, b) (w,b)(w,b ) , making the loss functionLLL takes a minimum value. Let’s first look at the loss functionLLL only takes two parametersw 5 w_5w5sum w 9 w_9w9Simple situations during changes inspire ideas for finding solutions.

L = L ( w 5 , w 9 ) (13) L = L(w_5, w_9) \tag{13} L=L(w5,w9)(13)

这里将 w 0 , w 1 , . . . , w 12 w_0, w_1, ..., w_{12} w0,w1,...,w12 中除 w 5 , w 9 w_5, w_9 w5,w9 之外的参数和 b b b 都固定下来,可以用图画出 L ( w 5 , w 9 ) L(w_5, w_9) L(w5,w9) 的形式,并在三维空间中画出损失函数随参数变化的曲面图。

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


net = Network(num_of_weights=13)

# 只画出参数w5和w9在区间[-160, 160]的曲线部分,以及包含损失函数的极值
w5 = np.arange(-160.0, 160.0, 1.0)
w9 = np.arange(-160.0, 160.0, 1.0)
losses = np.zeros(shape=[len(w5), len(w9)])

# 计算设定区域内每个参数取值所对应的Loss
for i in range(len(w5)):
    for j in range(len(w9)):
        # 更改模型参数的数值
        net.w[5] = w5[i]
        net.w[9] = w9[i]
        
        # 模型infer并计算、记录loss
        pred = net.forward(x)
        loss = net.loss(pred=pred, gt=target)
        losses[i, j] = loss
        
# 使用matplotlib将两个变量和对应的Loss作3D图
fig = plt.figure(dpi=300)
ax = fig.add_axes(Axes3D(fig))

# 设置坐标轴标签
ax.set_xlabel('w5')
ax.set_ylabel('w9')
ax.set_zlabel('Loss')

w5, w9 = np.meshgrid(w5, w9)

ax.plot_surface(w5, w9, losses, rstride=1,cstride=1, cmap="rainbow")
plt.savefig("gd_sample_demo.png")

Insert image description here

从图中可以明显观察到有些区域的函数值比周围的点小。需要说明的是:为什么选择 w 5 w_5 w5 w 9 w_9 w9 来画图呢?这是因为选择这两个参数的时候,可比较直观的从损失函数的曲面图上发现极值点的存在。其他参数组合,从图形上观测损失函数的极值点不够直观。

观察上述曲线呈现出“圆滑”的坡度,这正是我们选择以均方误差作为损失函数的原因之一。下图呈现了只有一个参数维度时,均方误差和绝对值误差(只将每个样本的误差累加,不做平方处理)的损失函数曲线图。

Insert image description here

由此可见,均方误差表现的“圆滑”的坡度有两个好处:

  1. 曲线的最低点是可导的。
  2. 越接近最低点,曲线的坡度逐渐放缓,有助于通过当前的梯度来判断接近最低点的程度(是否逐渐减少步长,以免错过最低点)。

绝对值误差是不具备这两个特性的,这也是损失函数的设计不仅仅要考虑“合理性”,还要追求“易解性”的原因。

现在我们要找出一组 [ w 5 , w 9 ] [w_5, w_9] [w5,w9] 的值,使得损失函数最小,实现梯度下降法的方案如下:

  • 步骤 1:随机的选一组初始值,例如: [ w 5 , w 9 ] = [ − 100.0 , − 100.0 ] [w_5, w_9] = [-100.0, -100.0] [w5,w9]=[100.0,100.0]
  • 步骤 2:选取下一个点 [ w 5 ′ , w 9 ′ ] [w'_5, w'_9] [w5,w9] ,使得 L ( w 5 ′ , w 9 ′ ) < L ( w 5 , w 9 ) L(w'_5, w'_9) < L(w_5, w_9) L(w5,w9)<L(w5,w9)
  • 步骤 3:重复步骤 2,直到损失函数几乎不再下降。

如何选择 [ w 5 ′ , w 9 ′ ] [w'_5, w'_9] [w5,w9] 是至关重要的,第一要保证 L L L 是下降的,第二要使得下降的趋势尽可能的快。微积分的基础知识告诉我们:沿着梯度的反方向,是函数值下降最快的方向,如下图所示。简单理解,函数在某一个点的梯度方向是曲线斜率最大的方向,但梯度方向是向上的,所以下降最快的是梯度的反方向。

Insert image description here

2.2.4.2 梯度计算

上文已经介绍了损失函数的计算方法,这里稍微改写。为了使梯度计算更加简洁,引入因子 1 2 \frac{1}{2} 21,定义损失函数如下:

L = 1 2 N ∑ i = 1 N ( y i − z i ) 2 (14) L = \frac{1}{2N} \sum_{i=1}^N(y_i - z_i)^2 \tag{14} L=2N1i=1N(yizi)2(14)

其中 z i z_i zi 是网络对第 i i i 个样本的预测值:

z i = ∑ j = 0 12 x i j ⋅ w j + b (15) z_i = \sum_{j=0}^{12} x_i^j \cdot w_j + b\tag{15} zi=j=012xijwj+b(15)

梯度的定义:

g r a d i e n t = ( ∂ L ∂ w 0 , ∂ L ∂ w 1 , . . . , ∂ L ∂ w 12 , ∂ L ∂ b ) (16) \mathrm{gradient} = (\frac{\partial L}{\partial w_0}, \frac{\partial L}{\partial w_1}, ..., \frac{\partial L}{\partial w_{12}}, \frac{\partial L}{\partial b}) \tag{16} gradient=(w0L,w1L,...,w12L,bL)(16)

可以计算出 L L L w w w b b b 的偏导数:

∂ L ∂ w j = 1 N ∑ i = 1 N ( z i − y i ) ∂ z i ∂ w j = 1 N ∑ i = 1 N ( z i − y i ) x i j (17) \begin{aligned} \frac{\partial L}{\partial w_j} & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) \frac{\partial z_i}{\partial w_j}\\ & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) x_i^j \tag{17} \end{aligned} wjL=N1i=1N(ziyi)wjzi=N1i=1N(ziyi)xij(17)

∂ L ∂ b = 1 N ∑ i = 1 N ( z i − y i ) ∂ z i ∂ b = 1 N ∑ i = 1 N ( z i − y i ) (18) \begin{aligned} \frac{\partial L}{\partial b} & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) \frac{\partial z_i}{\partial b}\\ & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) \tag{18} \end{aligned} bL=N1i=1N(ziyi)bzi=N1i=1N(ziyi)(18)

It can be seen from the calculation process of the derivative that the factor 1 2 \frac{1}{2}21is eliminated because the derivation of the quadratic function will produce a factor of
2, which is why we rewrite the loss function.

Next we consider the case where there is only one sample and calculate the gradient:

L = 1 2 ( y i − z i ) 2 (19) L = \frac{1}{2}(y_i - z_i)^2 \tag{19} L=21(yizi)2(19)

z 1 = x 1 0 ⋅ w 0 + x 1 1 ⋅ w 1 + . . . + x 1 12 ⋅ w 12 + b (20) z_1 = x_1^0 \cdot w_0 + x_1^1 \cdot w_1 + ... + x_1^{12} \cdot w_{12} + b \tag{20} z1=x10w0+x11w1+...+x112w12+b(20)

It can be calculated:

L = 1 2 ( x 1 0 ⋅ w 0 + x 1 1 ⋅ w 1 + . . . + x 1 12 ⋅ w 12 + b − y 1 ) 2 (21) L = \frac{1}{2}(x_1^0 \cdot w_0 + x_1^1 \cdot w_1 + ... + x_1^{12} \cdot w_{12} + b - y_1)^2 \tag{21} L=21(x10w0+x11w1+...+x112w12+by1)2(21)

LL can be calculatedL towww andbbPartial derivative of b :

∂ L ∂ w 0 = ( x 1 0 ⋅ w 0 + x 1 1 ⋅ w 1 + . . . + x 1 12 ⋅ w 12 + b − y 1 ) ⋅ x 1 0 = ( z 1 − y 1 ) ⋅ x 1 0 (22) \begin{aligned} \frac{\partial L}{\partial w_0} & = (x_1^0 \cdot w_0 + x_1^1 \cdot w_1 + ... + x_1^{12} \cdot w_{12} + b - y_1) \cdot x_1^0 \\ & = (z_1 - y_1) \cdot x_1^0 \tag{22} \end{aligned} w0L=(x10w0+x11w1+...+x112w12+by1)x10=(z1y1)x10(22)

∂ L ∂ b = ( x 1 0 ⋅ w 0 + x 1 1 ⋅ w 1 + . . . + x 1 12 ⋅ w 12 + b − y 1 ) ⋅ 1 = ( z 1 − y 1 ) (23) \begin{aligned} \frac{\partial L}{\partial b} & = (x_1^0 \cdot w_0 + x_1^1 \cdot w_1 + ... + x_1^{12} \cdot w_{12} + b - y_1) \cdot 1 \\ & = (z_1 - y_1) \tag{23} \end{aligned} bL=(x10w0+x11w1+...+x112w12+by1)1=(z1y1)(23)

The data and dimensions of each variable can be viewed through specific procedures.

x1 = x[0]
y1 = target[0]
z1 = net.forward(x1)
print(f"x1.shape: {
      
      x1.shape}, The value is: {
      
      x1}")
print(f"y1.shape: {
      
      y1.shape}, The value is: {
      
      y1}")
print(f"z1.shape: {
      
      z1.shape}, The value is: {
      
      z1}")

result:

x1.shape: (13,), The value is: 
[0.         0.18       0.07344184 0.         0.31481481 0.57750527
 0.64160659 0.26920314 0.         0.22755741 0.28723404 1.
 0.08967991]
y1.shape: (1,), The value is: [0.42222222]
z1.shape: (1,), The value is: [130.86954441]

According to the above formula, when there is only one sample, a certain wj w_j can be calculatedwj, such as w 0 w_0w0gradient.

gradient_w0 = (z1 - y1) * x1[0]
print(f"The gradient of w0 is: {
      
      gradient_w0}")  # [0.]

Similarly we can calculate w 1 w_1w1gradient.

gradient_w1 = (z1 - y1) * x1[1]
print(f"The gradient of w1 is: {
      
      gradient_w1}")  # [0.35485337]

Calculate wj w_j in turnwjgradient.

gradients_of_weights = []
for i in range(net.w.shape[0]):
    gradient = (z1 - y1) * x1[i]
    gradients_of_weights.append(gradient)
    
print(gradients_of_weights)

result:

[array([0.]), array([0.35485337]), array([0.14478381]), array([0.]), array([0.62062832]), array([1.13849828]), array([1.26486811]), array([0.53070911]), array([0.]), array([0.44860841]), array([0.56625537]), array([1.9714076]), array([0.17679566])]
2.2.4.3 Using NumPy for gradient calculation

Based on the NumPy broadcast mechanism (vector and matrix calculations are the same as calculations on a single variable), gradient calculations can be implemented more quickly. In the code to calculate the gradient , ( z 1 − y 1 ) ⋅ x 1 (z_1 - y_1) \cdot x_1 is used directly.(z1y1)x1, the result is a 13-dimensional vector, each component represents the gradient of that dimension.

gradient_w = (z1 - y1) * x1
print(f"[The gradient of w by sampled] gradient_w.shape: {
      
      gradient_w.shape}, gradient_w: \r\n{
      
      gradient_w}")

result:

[The gradient of w by sampled] gradient_w.shape: (13,), gradient_w: 
[0.         0.35485337 0.14478381 0.         0.62062832 1.13849828
 1.26486811 0.53070911 0.         0.44860841 0.56625537 1.9714076
 0.17679566]

There are multiple samples in the input data, each contributing to the gradient. The above code calculates the gradient value when there is only sample 1. The same calculation method can also calculate the contribution of sample 2 and sample 3 to the gradient.

for i in range(x.shape[0]):
    input = x[i]
    gt = target[i]
    pred = net.forward(input)
    gradient = (pred - gt) * input
    if 1 <= i <= 3:
        print(f"[The gradient of w by sampled_{
      
      i}] gradient.shape: {
      
      gradient.shape}, gradient: \r\n{
      
      gradient}")

result:

[The gradient of w by sampled_1] gradient.shape: (13,), gradient: 
[4.95115308e-04 0.00000000e+00 5.50693832e-01 0.00000000e+00
 3.62727044e-01 1.15004718e+00 1.64259797e+00 7.32343840e-01
 9.12450018e-02 2.40970621e-01 1.16094704e+00 2.09863504e+00
 4.29108324e-01]
 
[The gradient of w by sampled_2] gradient.shape: (13,), gradient: 
[3.21688482e-04 0.00000000e+00 3.58140452e-01 0.00000000e+00
 2.35897372e-01 9.47722033e-01 8.18057517e-01 4.76275452e-01
 5.93406432e-02 1.56713807e-01 7.55014992e-01 1.34780052e+00
 8.66203097e-02]
 
[The gradient of w by sampled_3] gradient.shape: (13,), gradient: 
[2.95458209e-04 0.00000000e+00 6.89019665e-02 0.00000000e+00
 1.51571633e-01 6.64543743e-01 4.45830114e-01 4.52623356e-01
 8.77472466e-02 7.37333335e-02 6.54837165e-01 1.00206898e+00
 3.36921340e-02]

Some readers may once again think that you can use a for loop to calculate the contribution of each sample to the gradient, and then average it. But we don't need to do this, we can still use NumPy's matrix operations to simplify the operation, such as the case of 3 samples.

# 注意这里是一次取出3个样本的数据,不是取出第3个样本
x_3_samples = x[:3]
y_3_samples = target[:3]
z_3_samples = net.forward(x_3_samples)

print('x {}, shape {}'.format(x_3_samples, x_3_samples.shape))
print('y {}, shape {}'.format(y_3_samples, y_3_samples.shape))
print('z {}, shape {}'.format(z_3_samples, z_3_samples.shape))
x [[0.00000000e+00 1.80000000e-01 7.34418420e-02 0.00000000e+00
  3.14814815e-01 5.77505269e-01 6.41606591e-01 2.69203139e-01
  0.00000000e+00 2.27557411e-01 2.87234043e-01 1.00000000e+00
  8.96799117e-02]
 [2.35922539e-04 0.00000000e+00 2.62405717e-01 0.00000000e+00
  1.72839506e-01 5.47997701e-01 7.82698249e-01 3.48961980e-01
  4.34782609e-02 1.14822547e-01 5.53191489e-01 1.00000000e+00
  2.04470199e-01]
 [2.35697744e-04 0.00000000e+00 2.62405717e-01 0.00000000e+00
  1.72839506e-01 6.94385898e-01 5.99382080e-01 3.48961980e-01
  4.34782609e-02 1.14822547e-01 5.53191489e-01 9.87519166e-01
  6.34657837e-02]], shape (3, 13)
  
y [[0.42222222]
 [0.36888889]
 [0.66      ]], shape (3, 1)
 
z [[2.39362982]
 [2.46752393]
 [2.02483479]], shape (3, 1)

x_3_samples, y_3_samples , z_3_samples The size of the first dimension is all 3, which means there are 3 samples. The contribution of these 3 samples to the gradient is calculated below.

gradient_w = (z_3_samples - y_3_samples) * x_3_samples
print('gradient_w {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))
gradient_w [[0.00000000e+00 3.54853368e-01 1.44783806e-01 0.00000000e+00
  6.20628319e-01 1.13849828e+00 1.26486811e+00 5.30709115e-01
  0.00000000e+00 4.48608410e-01 5.66255375e-01 1.97140760e+00
  1.76795660e-01]
 [4.95115308e-04 0.00000000e+00 5.50693832e-01 0.00000000e+00
  3.62727044e-01 1.15004718e+00 1.64259797e+00 7.32343840e-01
  9.12450018e-02 2.40970621e-01 1.16094704e+00 2.09863504e+00
  4.29108324e-01]
 [3.21688482e-04 0.00000000e+00 3.58140452e-01 0.00000000e+00
  2.35897372e-01 9.47722033e-01 8.18057517e-01 4.76275452e-01
  5.93406432e-02 1.56713807e-01 7.55014992e-01 1.34780052e+00
  8.66203097e-02]], gradient.shape (3, 13)

It can be seen here that gradient_wthe dimension of calculating the gradient is 3 × 13 3 × 133×13 , and the first row is consistent with the gradient calculated by the first sample abovegradient_w_by_sample1, the second row is consistent with the gradient calculated by the second sample aboveis consistentgradient_w_by_sample2with the gradient calculated by the third sample abovegradient_w_by_sample3Using matrix operations here, it is more convenient to calculate each of the three samples' contribution to the gradient.

Then for NNIn the case of N samples, we can directly calculate the contribution of all samples to the gradient using the following method. This is the convenience brought by using the broadcast function of the NumPy library. Let’s summarize the broadcast function using the NumPy library here:

  • On the one hand, the dimension of the parameters can be expanded, replacing the for loop to calculate 1 sample pair from w 0 w_0w0Tow 12 w_{12}w12gradients of all parameters.
  • On the other hand, the dimension of the sample can be expanded and the for loop can be used to calculate the gradient of the parameters from sample 0 to sample 403.
z = net.forward(x)
gradient_w = (z - target) * x
print('gradient_w shape {}'.format(gradient_w.shape))
print(gradient_w)
gradient_w shape (404, 13)
[[0.00000000e+00 3.54853368e-01 1.44783806e-01 ... 5.66255375e-01
  1.97140760e+00 1.76795660e-01]
 [4.95115308e-04 0.00000000e+00 5.50693832e-01 ... 1.16094704e+00
  2.09863504e+00 4.29108324e-01]
 [3.21688482e-04 0.00000000e+00 3.58140452e-01 ... 7.55014992e-01
  1.34780052e+00 8.66203097e-02]
 ...
 [7.66711387e-01 0.00000000e+00 3.35694398e+00 ... 3.87578270e+00
  4.79373123e+00 2.45903597e+00]
 [4.83683601e-01 0.00000000e+00 3.14256160e+00 ... 3.62826605e+00
  4.20149273e+00 2.30075782e+00]
 [1.42480820e+00 0.00000000e+00 3.58013213e+00 ... 4.13346610e+00
  5.11244491e+00 2.54493671e+00]]

Each row above gradient_wrepresents the contribution of a sample to the gradient. According to the gradient calculation formula, the total gradient is the average value of each sample's contribution to the gradient.

∂ L ∂ w j = 1 N ∑ i = 1 N ( z i − y i ) ∂ z i ∂ w j = 1 N ∑ i = 1 N ( z i − y i ) x i j (17) \begin{aligned} \frac{\partial L}{\partial w_j} & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) \frac{\partial z_i}{\partial w_j}\\ & = \frac{1}{N} \sum_{i=1}^N(z_i - y_i) x_i^j \tag{17} \end{aligned} wjL=N1i=1N(ziyi)wjzi=N1i=1N(ziyi)xij(17)

This process can be accomplished using NumPy's mean function. The code is implemented as follows.

# axis = 0 表示把每一行做相加然后再除以总的行数
gradient_w = np.mean(gradient_w, axis=0)
print('gradient_w ', gradient_w.shape)
print('w ', net.w.shape)
print(gradient_w)
print(net.w)
gradient_w  (13,)
w  (13, 1)
[0.10197566 0.20327718 1.21762392 0.43059902 1.05326594 1.29064465
 1.95461901 0.5342187  0.88702053 1.15069786 1.5790441  2.43714929
 0.87116361]
[[ 1.76405235]
 [ 0.40015721]
 [ 0.97873798]
 [ 2.2408932 ]
 [ 1.86755799]
 [-0.97727788]
 [ 0.95008842]
 [-0.15135721]
 [-0.10321885]
 [ 0.4105985 ]
 [ 0.14404357]
 [ 1.45427351]
 [ 0.76103773]]

Using NumPy's matrix operation can easily complete the calculation of gradient, but it introduces a problem, gradient_wthe shape is (13,) (13,)(13,),Butlolw 的维度是 ( 13 , 1 ) (13, 1) (13,1)。导致该问题的原因是使用 np.mean函数时消除了第 0 维。为了加减乘除等计算方便,gradient_w w w w 必须保持一致的形状。因此我们将 gradient_w 的维度也设置为 ( 13 , 1 ) (13,1) (13,1),代码如下:

gradient_w = gradient_w[:, np.newaxis]
print('gradient_w shape', gradient_w.shape)  # gradient_w shape (13, 1)

综合上面的剖析,计算梯度的代码如下所示。

pred = net.forward(x)
gradient_w = (pred - target) * x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
print(gradient_w)
[[0.10197566]
 [0.20327718]
 [1.21762392]
 [0.43059902]
 [1.05326594]
 [1.29064465]
 [1.95461901]
 [0.5342187 ]
 [0.88702053]
 [1.15069786]
 [1.5790441 ]
 [2.43714929]
 [0.87116361]]

上述代码非常简洁地完成了 w w w 的梯度计算。同样,计算 b b b 的梯度的代码也是类似的原理。

gradient_b = (pred - target)
gradient_b = np.mean(gradient_b)
# 此处b是一个数值,所以可以直接用np.mean得到一个标量
print(gradient_b)  # 2.599327274554706

将上面计算 w w w b b b 的梯度的过程,写成 Network 类的 gradient 函数,实现方法如下所示。

class Network_with_gradient:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, pred, gt):
        loss_value_sum = (pred - gt) ** 2
        return np.mean(loss_value_sum)
    
    def gradient(self, x, gt):
        pred = self.forward(x)
        gradient_w = (pred - gt) * x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (pred - gt)
        gradient_b = np.mean(gradient_b)
        
        return gradient_w, gradient_b
    
    
# 初始化网络
net = Network_with_gradient(13)
# 设置[w5, w9] = [-100., -100.]
net.w[5] = -100.0
net.w[9] = -100.0

pred = net.forward(x)
loss = net.loss(pred, target)
gradient_w, gradient_b = net.gradient(x, target)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))
point [-100.0, -100.0], loss 7873.345739941161
gradient [-45.87968288123223, -35.50236884482904]
2.2.4.4 梯度更新

下面研究更新梯度的方法,确定损失函数更小的点。首先沿着梯度的反方向移动一小步,找到下一个点 P 1 P_1 P1,观察损失函数的变化。

# 在[w5, w9]平面上,沿着梯度的反方向移动到下一个点P1
# 定义移动步长 eta
eta = 0.1
# 更新参数w5和w9
net.w[5] = net.w[5] - eta * gradient_w5
net.w[9] = net.w[9] - eta * gradient_w9
# 重新计算z和loss
pred = net.forward(x)
loss = net.loss(pred, target)
gradient_w, gradient_b = net.gradient(x, target)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))
point [-95.41203171187678, -96.4497631155171], loss 7214.694816482369
gradient [-43.883932999069096, -34.019273908495926]

运行上面的代码,可以发现沿着梯度反方向走一小步,下一个点的损失函数的确减少了。感兴趣的话,大家可以尝试不停的点击上面的代码块,观察损失函数是否一直在变小。

In the above code, the statement used each time to update the parameters:net.w[5] = net.w[5] - eta * gradient_w5

  • Subtraction : The parameters need to be moved in the opposite direction of the gradient.
  • eta : Controls the size of each parameter value change along the opposite direction of the gradient, that is, the step size of each movement, also known as the learning rate.

You can think about it, why did we need to normalize the input features before to keep the scale consistent? This is to make the uniform step size more appropriate and make training more efficient.

As shown in the figure below, after the feature input is normalized, the Loss output by different parameters is a relatively regular curve, and the learning rate can be set to a unified value; when the feature input is not normalized, the steps required for the parameters corresponding to different features The lengths are inconsistent, parameters with larger scales require large step sizes, and parameters with smaller scales require small step sizes, resulting in the inability to set a unified learning rate.

Insert image description here

2.2.4.5 Encapsulating the Train function

Encapsulate the above loop calculation process in the trainand updatefunction, and the implementation method is as follows.

import numpy as np
import json
import matplotlib.pyplot as plt


class Network_with_gradient:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, pred, gt):
        loss_value_sum = (pred - gt) ** 2
        return np.mean(loss_value_sum)
    
    def gradient(self, x, gt):
        pred = self.forward(x)
        gradient_w = (pred - gt) * x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (pred - gt)
        gradient_b = np.mean(gradient_b)
        
        return gradient_w, gradient_b
    
    def update(self, gradient_w5, gradient_w9, lr=0.01):
        self.w[5] = self.w[5] - lr * gradient_w5
        self.w[9] = self.w[9] - lr * gradient_w9
        
    def train(self, inp, gt, interations=100, lr=0.01):
        pts = []
        losses = []
        
        for i in range(interations):
            pts.append([self.w[5][0], self.w[9][0]])
            pred = self.forward(inp)
            loss = self.loss(pred, gt)
            gradient_w, gradient_b = self.gradient(x=inp, gt=gt)
            gradient_w5 = gradient_w[5][0]
            gradient_w9 = gradient_w[9][0]
            self.update(gradient_w5, gradient_w9, lr=lr)
            losses.append(loss)
            
            if i % 50 == 0:
                print(f"iter: {
      
      i}, point: {
      
      [self.w[5][0], self.w[9][0]]}, Loss: {
      
      loss}")
        return pts, losses


def data_load():
    # 2.2.1.1 读入数据
    datafile = "/data/data_01/lijiandong/Datasets/boston_house_price/housing.data"
    data = np.fromfile(datafile, sep=" ")


    # 2.2.1.2 数据形状变换
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)
    data = data.reshape([data.shape[0] // feature_num, feature_num])  # [N, 14]


    # 2.2.1.3 数据集划分
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]
    test_data = data[offset:]


    # 2.2.1.4 数据归一化处理
    # 计算train数据集的最大值和最小值
    maxinums, minimus = training_data.max(axis=0), training_data.min(axis=0)

    # 对数据进行归一化处理
    for col_name in range(feature_num):
        # 训练集归一化
        training_data[:, col_name] = (training_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        # 测试集归一化(确保了测试集上的数据也使用了与训练集相同的归一化转换,避免了引入测试集信息污染)
        test_data[:, col_name] = (test_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        
    return training_data, test_data
    
    
if __name__ == "__main__":
    # 获取数据
    train_data, test_data = data_load()
    inputs = train_data[:, :-1]
    targets = train_data[:, -1:]
    
    # 创建网络
    model = Network_with_gradient(num_of_weights=13)
    num_iteration = 2000
    
    # 启动训练
    points, losses = model.train(inputs, targets, num_iteration, lr=0.01)
    
    # 画出损失函数的变化趋势
    plot_x = np.arange(num_iteration)
    plot_y = np.array(losses)
    plt.plot(plot_x, plot_y)
    plt.xlabel("Iteration")
    plt.ylabel("Loss")
    plt.savefig("test_model_loss.png")
iter: 0, point: [-0.9901843263352099, 0.39909152329488246], Loss: 8.74595446663459
iter: 50, point: [-1.565620876463878, -0.12301073215571104], Loss: 6.306355217447029
iter: 100, point: [-2.022354418155828, -0.5542991021920926], Loss: 4.711795652571785
iter: 150, point: [-2.3835826672990916, -0.9119900464266062], Loss: 3.6675975768722684
...
iter: 1850, point: [-3.2091283870302365, -3.3267726714708044], Loss: 1.4862553256192466
iter: 1900, point: [-3.1961376490600024, -3.344849195951625], Loss: 1.4842710198735334
iter: 1950, point: [-3.183547669731126, -3.3622933329823237], Loss: 1.4824177199043154

Insert image description here

2.2.4.6 The training process is extended to all parameters

In order to give readers an intuitive feeling, the gradient descent process demonstrated above only contains w 5 w_5w5sum w 9 w_9w9Two parameters. But the model for predicting house prices must be accurate for all parameters www andbbb to solve, this requires modifyingupdatethe andin NetworktrainSince the parameters involved in the calculation are no longer limited (all parameters are involved in the calculation), the modified code is more concise.

Implementation logic:

  1. forward calculation output
  2. Calculate Loss based on output and real value
  3. Calculate gradient based on Loss and input
  4. Update parameter values ​​based on gradient

The four parts are executed repeatedly until the loss function is minimized.

import numpy as np
import json
import matplotlib.pyplot as plt


class Network_full_weights:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, pred, gt):
        loss_value_sum = (pred - gt) ** 2
        return np.mean(loss_value_sum)
    
    def gradient(self, x, gt):
        pred = self.forward(x)
        gradient_w = (pred - gt) * x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (pred - gt)
        gradient_b = np.mean(gradient_b)
        
        return gradient_w, gradient_b
    
    def update(self, gradient_w, gradient_b, lr=0.01):
        self.w = self.w - lr * gradient_w  # 负梯度,所以是减
        self.b = self.b - lr * gradient_b  # 负梯度,所以是减
        
    def train(self, inp, gt, interations=100, lr=0.01):
        pts = []
        losses = []
        
        for i in range(interations):
            pts.append([self.w[5][0], self.w[9][0]])
            pred = self.forward(inp)
            loss = self.loss(pred, gt)
            gradient_w, gradient_b = self.gradient(x=inp, gt=gt)
            self.update(gradient_w, gradient_b, lr=lr)
            losses.append(loss)
            
            if i % 50 == 0:
                print(f"iter: {
      
      i}, point: {
      
      [self.w[5][0], self.w[9][0]]}, Loss: {
      
      loss}")
        return pts, losses


def data_load():
    # 2.2.1.1 读入数据
    datafile = "/data/data_01/lijiandong/Datasets/boston_house_price/housing.data"
    data = np.fromfile(datafile, sep=" ")


    # 2.2.1.2 数据形状变换
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)
    data = data.reshape([data.shape[0] // feature_num, feature_num])  # [N, 14]


    # 2.2.1.3 数据集划分
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]
    test_data = data[offset:]


    # 2.2.1.4 数据归一化处理
    # 计算train数据集的最大值和最小值
    maxinums, minimus = training_data.max(axis=0), training_data.min(axis=0)

    # 对数据进行归一化处理
    for col_name in range(feature_num):
        # 训练集归一化
        training_data[:, col_name] = (training_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        # 测试集归一化(确保了测试集上的数据也使用了与训练集相同的归一化转换,避免了引入测试集信息污染)
        test_data[:, col_name] = (test_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        
    return training_data, test_data
    
    
if __name__ == "__main__":
    # 获取数据
    train_data, test_data = data_load()
    inputs = train_data[:, :-1]
    targets = train_data[:, -1:]
    
    # 创建网络
    model = Network_full_weights(num_of_weights=13)
    num_iteration = 2000
    
    # 启动训练
    points, losses = model.train(inputs, targets, num_iteration, lr=0.01)
    
    # 画出损失函数的变化趋势
    plot_x = np.arange(num_iteration)
    plot_y = np.array(losses)
    plt.plot(plot_x, plot_y)
    plt.xlabel("Iteration")
    plt.ylabel("Loss")
    plt.savefig("test_model_loss.png")
iter: 0, point: [-0.9901843263352099, 0.39909152329488246], Loss: 8.74595446663459
iter: 50, point: [-1.2510816196755312, 0.10537114344977061], Loss: 1.2774697388163774
iter: 100, point: [-1.248700167154482, 0.013128221346003856], Loss: 0.8996702309578077
iter: 150, point: [-1.207833898147619, -0.03959546161885696], Loss: 0.7517595081577438
iter: 200, point: [-1.1647360167738359, -0.08022928159585666], Loss: 0.639972629138626
...
iter: 1850, point: [-0.593635514684025, -0.23371944218812102], Loss: 0.10208939582379549
iter: 1900, point: [-0.5835935123829064, -0.23148564712851874], Loss: 0.09960847979302702
iter: 1950, point: [-0.5736614290879241, -0.2292815676008641], Loss: 0.09724348540057882

Insert image description here

2.2.4.7 随机梯度下降法( Stochastic Gradient Descent)

在上述程序中,每次损失函数和梯度计算都是基于数据集中的全量数据。对于波士顿房价预测任务数据集而言,样本数比较少,只有 404 个。但在实际问题中,数据集往往非常大,如果每次都使用全量数据进行计算,效率非常低,通俗地说就是“杀鸡焉用牛刀”。由于参数每次只沿着梯度反方向更新一点点,因此方向并不需要那么精确。一个合理的解决方案是每次从总的数据集中随机抽取出小部分数据来代表整体,基于这部分数据计算梯度和损失来更新参数,这种方法被称作随机梯度下降法(Stochastic Gradient Descent,SGD),核心概念如下:

  • mini-batch:每次迭代时抽取出来的一批数据被称为一个 mini-batch。
  • batch_size:一个 mini-batch 所包含的样本数目称为 batch_size。
  • epoch:当程序迭代的时候,按 mini-batch 逐渐抽取出样本,当把整个数据集都遍历到了的时候,则完成了一轮训练,也叫一个 epoch。启动训练时,可以将训练的轮数 num_epochs 和 batch_size 作为参数传入。

下面结合程序介绍具体的实现过程,涉及到数据处理和训练过程两部分代码的修改。

2.2.4.7.1 数据处理代码修改

数据处理需要实现拆分数据批次和样本乱序(为了实现随机抽样的效果)两个功能。

# 获取数据
train_data, test_data = data_load()
print(train_data.shape)  # (404, 14)

train_data 中一共包含 404 条数据,如果 batch_size=10,即取前 0 ~ 9 号样本作为第一个 mini-batch,命名 train_data1

train_data1 = train_data[:10]
print(train_data1.shape)  # (10, 14)

使用 train_data1 的数据(0 ~ 9 号样本)计算梯度并更新网络参数。

# 获取数据
train_data, test_data = data_load()
print(train_data.shape)  # (404, 14)

train_data1 = train_data[:10]
print(train_data1.shape)  # (10, 14)

model = Network_full_weights(13)
inputs = train_data1[:, :-1]
targets = train_data1[:, -1:]

points, losses = model.train(inputs, targets, 1, lr=0.01)
print(losses[0])  # 4.497480200683046

Then take samples No. 10 ~ 19 as the second mini-batch, calculate the gradient and update the network parameters. According to this method, new mini-batches are continuously taken out and the network parameters are gradually updated.

Next, train_datadivide into multiple mini_batch of size batch_size, as shown in the following code: will be train_datadivided into 404 10 + 1 = 41 \frac{404}{10} + 1 = 4110404+1=There are 41 mini_batch, the first 40 mini_batch each contains 10 samples, and the last mini_batch only contains 4 samples.

# 获取数据
train_data, test_data = data_load()
batch_size = 10
n = len(train_data)
mini_batches = [train_data[k: k+batch_size] for k in range(0, n, batch_size)]  # 如果不够k+batch_size了,那么就取完

print(f"第一个mini_batch的shape为: {
      
      mini_batches[0].shape}")  # (10, 14)
print(f"最后一个mini_batch的shape为: {
      
      mini_batches[-1].shape}")  # (4, 14)

In addition, mini_batch is read sequentially here, while in SGD a part of samples are randomly selected to represent the population. In order to achieve the effect of random sampling, we first train_datarandomly disrupt the order of samples inside and then extract mini_batch. To randomly disrupt the order of samples, you need to use np.random.shufflethe function. Let’s first introduce its usage.

Explanation : Through a large number of experiments, it has been found that the model is more affected by the later stages of training, similar to how the human brain always remembers recent events more clearly. In order to avoid the order of the data sample set from interfering with the training effect of the model, it is necessary to perform sample reordering operations. Of course, if the order of the training samples is the order in which the samples were generated, and we expect the model to pay more attention to recently generated samples (predicted samples will be closer to the distribution of recent training samples), then there is no need to shuffle this step.

a = np.array(list(range(1, 13)))
print(f"Shuffle前的数组为: {
      
      a}")  # [ 1  2  3  4  5  6  7  8  9 10 11 12]
np.random.shuffle(a)  # inplace操作
print(f"Shuffle后的数组为: {
      
      a}")  # [ 6  9  1 11  4 10  7  3  2 12  5  8]

Run the above code multiple times and you can find that shufflethe order of numbers after executing the function is different each time. The above example is a case of a 1-dimensional array being out of order. Let's look at the effect of a 2-dimensional array being out of order.

a = np.array(list(range(1, 13))).reshape([6, 2])
print(f"Shuffle前的数组为: \r\n{
      
      a}")
np.random.shuffle(a)  # inplace操作
print(f"Shuffle后的数组为: \r\n{
      
      a}")
Shuffle前的数组为: 
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]]
 
Shuffle后的数组为: 
[[ 5  6]
 [ 1  2]
 [ 3  4]
 [11 12]
 [ 9 10]
 [ 7  8]]

观察运行结果可发现,数组的元素在第 0 维被随机打乱,但第 1 维的顺序保持不变。例如数字 2 仍然紧挨在数字 1 的后面,数字 8 仍然紧挨在数字 7 的后面,而第二维的 [3, 4] 并不排在 [1, 2] 的后面。将这部分实现 SGD 算法的代码集成到 Network 类中的 train 函数中,最终的完整代码如下。

# 获取数据
train_data, test_data = data_load()
    
# Shuffle
# 在训练集需要进行shuffle(随机打乱样本顺序)的情况下,测试
np.random.shuffle(train_data)
    
# 将 train_data 分成多个 mini_batch
batch_size = 10
n = len(train_data)
mini_batches = [train_data[k: k+batch_size] for k in range(0
    
# 创建网络
model = Network_full_weights(13)
# 依次使用每个mini_batch的数据
for idx, mini_batch in enumerate(mini_batches):
    inputs = mini_batch[:, :-1]
    targets = mini_batch[:, -1:]
    loss = model.train(inputs, targets, interations=1, lr=0.
    print(f"mini_batch[{
      
      idx}]'s loss: {
      
      loss}")
iter: 0, point: [-0.988214319272273, 0.4040241417140549], Loss: 5.310067748287684
mini_batch[0]'s loss: ([[-0.977277879876411, 0.41059850193837233]], [5.310067748287684])
iter: 0, point: [-1.0000665119658712, 0.3901139032526141], Loss: 9.948974311703784
mini_batch[1]'s loss: ([[-0.988214319272273, 0.4040241417140549]], [9.948974311703784])
iter: 0, point: [-1.0099096964900296, 0.3859828997852013], Loss: 4.582071696627923
...
mini_batch[39]'s loss: ([[-1.2343588100526453, 0.14439735680045832]], [1.0128812556524254])
iter: 0, point: [-1.237167617412837, 0.1414588534395813], Loss: 0.09470445497034455
mini_batch[40]'s loss: ([[-1.236799748470745, 0.14160559639242365]], [0.09470445497034455])
2.2.4.7.2 训练过程代码修改

将每个随机抽取的 mini-batch 数据输入到模型中用于参数训练。训练过程的核心是两层循环:

  1. 第一层循环,代表样本集合要被训练遍历几次,称为“epoch”,代码如下:
for epoch in range(epochs)
  1. 第二层循环,代表每次遍历时,样本集合被拆分成的多个批次,需要全部执行训练,称为“iter (iteration)”,代码如下:
for iter_idx, mini_batch in enumerate(mini_batches):

在两层循环的内部是经典的四步训练流程:前向计算->计算损失->计算梯度->更新参数,这与大家之前所学是一致的,代码如下:

# 前向推理
pred = model.forward(inputs)

# 计算损失
loss = model.train(inputs, targets, interations=1, lr=0.01)

# 计算梯度
gradient_w, gradient_b = model.gradient(inputs, targets)

# 更新参数
model.update(gradient_w, gradient_b, lr=0.01)

将两部分改写的代码集成到 Network 类中的 train 函数中,最终的实现如下。

import numpy as np
import json
import matplotlib.pyplot as plt


class Network_full_weights:
    def __init__(self, num_of_weights):
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)  # 随机产生w的初始值
        self.b = 0.0  # 不使用偏置
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, pred, gt):
        loss_value_sum = (pred - gt) ** 2
        return np.mean(loss_value_sum)
    
    def gradient(self, x, gt):
        pred = self.forward(x)
        gradient_w = (pred - gt) * x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (pred - gt)
        gradient_b = np.mean(gradient_b)
        
        return gradient_w, gradient_b
    
    def update(self, gradient_w, gradient_b, lr=0.01):
        self.w = self.w - lr * gradient_w  # 负梯度,所以是减
        self.b = self.b - lr * gradient_b  # 负梯度,所以是减
        
    def train(self, inputs, epochs=100, batch_size=1, lr=0.01):
        n = len(inputs)
        losses = []
        
        for epoch in range(epochs):
            # 在每轮迭代开始之前,将训练数据的顺序随机打乱
            # 然后再按每次取batch_size条数据的方式取出
            np.random.shuffle(inputs)
            
            # 将训练数据进行拆分,每个mini_batch包含batch_size条的数据
            mini_batches = [inputs[k: k+batch_size] for k in range(0, n, batch_size)]
            
            # 对每个mini_batch进行训练
            for iter_idx, mini_batch in enumerate(mini_batches):
                x = mini_batch[:, :-1]
                gt = mini_batch[:, -1:]
                
                # infer
                pred = self.forward(x)

                # calc loss
                loss = self.loss(pred, gt)

                # record loss
                losses.append(loss)

                # calc gradients
                gradient_w, gradient_b = self.gradient(x, gt)
                
                # update weights and bias
                self.update(gradient_w, gradient_b, lr)
                
                # print
                print(f"Epoch: {
      
      epoch}\titer: {
      
      iter_idx}\tloss: {
      
      loss:.4f}")
        
        return losses


def data_load():
    # 2.2.1.1 读入数据
    datafile = "/data/data_01/lijiandong/Datasets/boston_house_price/housing.data"
    data = np.fromfile(datafile, sep=" ")


    # 2.2.1.2 数据形状变换
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)
    data = data.reshape([data.shape[0] // feature_num, feature_num])  # [N, 14]


    # 2.2.1.3 数据集划分
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]
    test_data = data[offset:]


    # 2.2.1.4 数据归一化处理
    # 计算train数据集的最大值和最小值
    maxinums, minimus = training_data.max(axis=0), training_data.min(axis=0)

    # 对数据进行归一化处理
    for col_name in range(feature_num):
        # 训练集归一化
        training_data[:, col_name] = (training_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        # 测试集归一化(确保了测试集上的数据也使用了与训练集相同的归一化转换,避免了引入测试集信息污染)
        test_data[:, col_name] = (test_data[:, col_name] - minimus[col_name]) / (maxinums[col_name] - minimus[col_name])
        
    return training_data, test_data
    
    
if __name__ == "__main__":
    # get data
    train_data, test_data = data_load()
    
    # create model
    model = Network_full_weights(13)

    # start training
    losses = model.train(train_data, epochs=100, batch_size=100, lr=0.1)
    
    # 画出损失函数的变化趋势
    plot_x = np.arange(len(losses))
    plot_y = np.array(losses)
    plt.plot(plot_x, plot_y)
    plt.xlabel("Iteration = samples num / batch size * epochs")
    plt.ylabel("Loss")
    plt.savefig("test_model_loss.png")

结果:

Epoch: 0        iter: 0 loss: 10.1354
Epoch: 0        iter: 1 loss: 3.8290
Epoch: 0        iter: 2 loss: 1.9208
Epoch: 0        iter: 3 loss: 1.7740
Epoch: 0        iter: 4 loss: 0.3366
Epoch: 1        iter: 0 loss: 1.6275
Epoch: 1        iter: 1 loss: 0.9842
Epoch: 1        iter: 2 loss: 0.9121
Epoch: 1        iter: 3 loss: 0.9971
Epoch: 1        iter: 4 loss: 0.6071
...
Epoch: 98       iter: 0 loss: 0.0662
Epoch: 98       iter: 1 loss: 0.0468
Epoch: 98       iter: 2 loss: 0.0267
Epoch: 98       iter: 3 loss: 0.0340
Epoch: 98       iter: 4 loss: 0.0203
Epoch: 99       iter: 0 loss: 0.0342
Epoch: 99       iter: 1 loss: 0.0688
Epoch: 99       iter: 2 loss: 0.0377
Epoch: 99       iter: 3 loss: 0.0317
Epoch: 99       iter: 4 loss: 0.0061

Insert image description here

观察上述 Loss 的变化,随机梯度下降加快了训练过程,但由于每次仅基于少量样本更新参数和计算损失,所以损失下降曲线会出现震荡。

说明

  1. 由于房价预测的数据量过少,所以难以感受到随机梯度下降带来的性能提升。
  2. 在机器学习中,迭代次数(Iteration)是指训练过程中更新模型参数的总次数。它由元素数(数据集中样本的数量)除以批次大小(Batch Size)再乘以训练轮数(Epoch)得出。即: 迭代次数( I t e r a t i o n ) = 样本数量 B a t c h S i z e × E p o c h \rm 迭代次数(Iteration) =\frac{样本数量}{Batch Size} \times Epoch Number of iterations ( Iteration )=BatchSizeSample size×The Epoch formula tells us how many times the model will update parameters during the entire training process. The training process will go through multiple Epochs. Each Epoch divides the data set into several batches for training. The size of the batch is determined by Batch Size.

2.2.5 Model saving

Numpy provides savean interface that can directly save the model weight array into .npya file in the format.

np.save("w.npy", model.w)  # 保存模型参数
np.save("b.npy", model.b)  # 保存模型偏置

2.2.6 Summary

In this chapter, we introduce in detail how to use NumPy to implement the gradient descent algorithm, and build and train a simple linear model to predict Boston housing prices. It can be concluded that there are three key points in using neural networks to model housing price predictions:

  1. Build the network and initialize the parameters www andbbb , define the calculation method of prediction and loss function.
  2. Randomly select the initial point and establish the gradient calculation method and parameter update method.
  3. Extract part of the data from the total data set as a mini_batch, calculate the gradient and update the parameters, and continue iterating until the loss function almost no longer decreases.

3. Use PaddlePaddle to rewrite the Boston house price prediction task

The cases in this tutorial cover mainstream application scenarios such as computer vision, natural language processing, and recommendation systems. The process of using Flying Paddle to implement these cases is basically the same, as shown in the figure below.

Insert image description here

In Chapter 2, we learned how to use Python and NumPy to implement the Boston housing price prediction task. In this chapter, we will try to use Flying Paddle to rewrite the housing price prediction task and experience the similarities and differences between the two. Before data processing, the relevant class libraries of the flying paddle framework need to be loaded first.

import paddle
from paddle.nn import Linear
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as optimizer
import numpy as np
import os
import random

The meaning of the parameters in the code is as follows:

  • paddle: The main library of Paddle. The aliases of commonly used APIs are reserved in the paddle root directory, currently including:
    • paddle.tensor
    • paddle.framework
    • paddle.device
    • All APIs in the directory;
  • Linear: The fully connected layer function of the neural network, which contains the basic neuron structure of the sum of all input weights. In the house price prediction task, a linear regression model is implemented using a neural network with only one layer (fully connected layer).
  • paddle.nn: Networking-related APIs, including Linear, convolution Conv2D, recurrent neural network LSTM, loss function CrossEntropyLoss, activation function ReLU, etc.;
  • paddle.nn.functional: Like paddle.nn, it includes networking-related APIs, such as Linear, activation function ReLU, etc. The modules with the same names included in both have the same functions, and the running performance is basically the same. The difference is that the modules in the paddle.nn directory are all classes, and each class has its own module parameters; the modules in the paddle.nn.functional directory are all functions, and the parameters required for function calculation need to be manually passed in. In actual use, convolution, fully connected layers themselves have learnable parameters, and it is recommended to use paddle.nn; while operations such as activation functions and pooling do not have learnable parameters, you can consider using paddle.nn.functional .

Note : Flying Paddle supports two deep learning modeling writing methods, the dynamic graph mode that is more convenient for debugging and the static graph mode that has better performance and is easier to deploy.

  • Dynamic graph mode (imperative programming paradigm, analogous to Python): analytical execution method. Users do not need to define a complete network structure in advance, and can obtain calculation results at the same time every time they write a line of network code;
  • Static graph mode (declarative programming paradigm, analogous to C++): compile first and then execute. Users need to define a complete network structure in advance, and then compile and optimize the network structure before executing and obtaining the calculation results.

The Flying Paddle Framework 2.0 and later versions use dynamic graphics mode for coding by default, and also provide complete dynamic to static support. Developers only need to add a decorator ( ), and Flying Paddle will automatically convert the dynamic graphics program into a static one to_static. Graph program, and use this program to train and save static models for inference deployment.

3.1 Data processing

The code for data processing does not rely on framework implementation and is the same as the code for building a house price prediction task using Python. Please refer to Chapter 2 for detailed explanation and will not be repeated here.

def load_data():
    # 从文件导入数据
    datafile = './work/housing.data'
    data = np.fromfile(datafile, sep=' ', dtype=np.float32)

    # 每条数据包括14项,其中前面13项是影响因素,第14项是相应的房屋价格中位数
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', \
                      'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)

    # 将原始数据进行Reshape,变成[N, 14]这样的形状
    data = data.reshape([data.shape[0] // feature_num, feature_num])

    # 将原数据集拆分成训练集和测试集
    # 这里使用80%的数据做训练,20%的数据做测试
    # 测试集和训练集必须是没有交集的
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]

    # 计算train数据集的最大值,最小值
    maximums, minimums = training_data.max(axis=0), training_data.min(axis=0)
    
    # 记录数据的归一化参数,在预测时对数据做归一化
    global max_values
    global min_values
   
    max_values = maximums
    min_values = minimums
    
    # 对数据进行归一化处理
    for i in range(feature_num):
        data[:, i] = (data[:, i] - min_values[i]) / (maximums[i] - minimums[i])

    # 训练集和测试集的划分比例
    training_data = data[:offset]
    test_data = data[offset:]
    return training_data, test_data
# 验证数据集读取程序的正确性
training_data, test_data = load_data()
print(training_data.shape)
print(training_data[1,:])
(404, 14)

[2.35922547e-04 0.00000000e+00 2.62405723e-01 0.00000000e+00
 1.72839552e-01 5.47997713e-01 7.82698274e-01 3.48961979e-01
 4.34782617e-02 1.14822544e-01 5.53191364e-01 1.00000000e+00
 2.04470202e-01 3.68888885e-01]

3.2 Model design

The essence of model definition is to define the network structure of linear regression. Feipiao recommends completing the definition of the model network by creating a Python class. This class needs to inherit the parent class and define functions and functions paddle.nn.Layerin the class . A function is a function specified by the framework to implement forward calculation logic. The program will automatically execute it when calling a model instance. The network layer used in the function needs to be declared in the function.__init__forwardforwardforward__init__

  • Definition __init__function : Declare the implementation function of each layer of network in the initialization function of the class. In the house price prediction task, only one fully connected layer needs to be defined, and the model structure remains consistent with Chapter 2;
  • Definition forwardfunction : Construct a neural network structure, implement the forward calculation process, and return the prediction results. In this task, the house price prediction results are returned.
class Regression(nn.Layer):
    def __init__(self):
        super(Regression, self).__init__()
        
        # 定义一层全连接层,输入维度是13,输出维度是1
        self.fc = nn.Linear(in_features=13, out_features=1)
        
    
    def forward(self, inputs):
        x = self.fc(inputs)
        return x

3.3 Training configuration

The training configuration process is shown in the figure below:

Insert image description here

if __name__ == "__main__":
    # 实例化模型
    model = Regression()
    
    # 开启模型训练模式
    model.train()
    
    # 加载数据
    training_data, test_data = load_data()
    
    # 定义优化算法,使用SGD
    optimizer = optimizer.SGD(learning_rate=0.01, parameters=model.parameters())

Note : Model instances have two states: training state .train()and prediction state .eval(). During training, two processes, forward calculation and back propagation of gradients, need to be performed. However, during prediction, only forward calculation needs to be performed to specify the running status for the model. There are two reasons:

  1. Some advanced operators have different execution logic in the two states, such as: Dropoutand BatchNorm(will be introduced in detail in the subsequent "Computer Vision" chapter);
  2. From the perspective of performance and storage space, it saves memory when predicting the state (no need to record reverse gradients) and has better performance.

Using the flying paddle framework, you only need to set the parameters of the SGD function and call it to realize the optimizer settings, which greatly simplifies the process.

3.4 Training process

The training process adopts a two-level loop nesting method:

  • Inner loop : Responsible for traversing the entire data set in batches. Assume that the number of samples in the data set is 1000, and there are 10 samples in a batch, then the number of batches to traverse the data set once is 1000/10=100, that is, the inner loop needs to be executed 100 times.
  • Outer loop : Define the number of times to traverse the data set, epochsset through parameters.
# 外层循环: 定义遍历数据集的次数,通过参数 epochs 设置
for epoch in range(epochs):
    ...
    # 内层循环: 负责整个数据集的一次遍历,采用分批次方式(batch)。
    for iter_idx, mini_batch in enumerate(mini_batch):
        ...

Note : batchThe value of will affect the model training effect:

  • batchIf it is too large, it will increase memory consumption and calculation time, and the training effect will not be significantly improved (each parameter only moves a small step in the opposite direction of the gradient, so the direction does not need to be particularly precise);
  • batchIf it is too small, batchthe sample data of each will have no statistical significance, and the calculated gradient direction may have a large deviation.

Since the training data set of the house price prediction model is small, is batchset to 10. Each inner loop needs to perform the steps shown in the figure below. The calculation process is exactly the same as writing the model in Python.

Insert image description here

  • Data preparation : convert a batch of data into nparrayformat first, and then into Tensorformat;
  • Forward calculation : pour a batch of sample data into the network and calculate the output results;
  • Calculate the loss function : take the forward calculation result and the real house price as input, and square_error_costcalculate the loss function value (Loss) through the loss function API.
  • Back propagation : perform gradient back propagation function, that is, calculate the gradient of each layer layer by layer from back to front, and update the parameters ( function) backwardaccording to the set optimization algorithm .opt.step
epochs = 100  # 总样本需要训练的轮次
batch_size = 10  # 每个 mini_batch 的样本数量

# 外层循环: 定义遍历数据集的次数,通过参数 epochs 设置
for epoch in range(epochs + 1):
    # 在每个 epoch 开始之前,将训练数据进行 Shuffle
    np.random.shuffle(training_data)
    
    # 将训练数据按照batch_size进行拆分
    mini_batches = [training_data[k: k + batch_size] for k in range(0, len(training_data), batch_size)]
    
    epoch_loss = []  # 记录损失
    
    # 内层循环: 负责整个数据集的一次遍历,采用分批次方式(mini-batch)。
    for iter_idx, mini_batch in enumerate(mini_batches):
        x = np.array(mini_batch[:, :-1])  # 训练样本
        y = np.array(mini_batch[:, -1:])  # 目标值

        # ndarray -> tensor
        inputs = paddle.to_tensor(x)
        targets = paddle.to_tensor(y)
        
        # infer
        preds = model(inputs)
        
        # calc losses
        losses = F.square_error_cost(input=preds, label=targets)
        
        # mean loss
        avg_loss = paddle.mean(losses, axis=0)
        epoch_loss.append(avg_loss.numpy())
            
        # 反向传播,计算每层参数的梯度值
        avg_loss.backward()

        # 让优化器更新参数
        optimizer.step()
        
        # 为下一个 epoch 清空当前梯度
        optimizer.clear_grad()
        
    # 打印损失
    print(f"Epoch: {
      
      epoch}\tLoss: {
      
      np.average(epoch_loss):.4f}")
    epoch_loss.clear()  # 清空历史损失

result:

W0805 03:36:18.712543 116394 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.2, Runtime API Version: 10.2
W0805 03:36:18.715507 116394 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2.
Epoch: 0        Loss: 0.2431
Epoch: 1        Loss: 0.0846
Epoch: 2        Loss: 0.0774
...
Epoch: 99       Loss: 0.0133
Epoch: 100      Loss: 0.0134

It is the same as PyTorch. Forward calculation, calculation loss and backpropagation gradient, each operation only requires 1 to 2 lines of code to implement. Flying Paddle has helped us automatically implement the process of reverse gradient calculation and parameter update, and we no longer need to write code one by one.

3.5 Save and test the model

3.5.1 Save model

Use the paddle.save API to save the current parameter data of the model model.state_dict()to a file for program calls for model prediction or verification.

# 保存模型
paddle.save(obj=model.state_dict(), path="LR_model.pdparams")
print(f"模型保存成功, 路径为: LR_model.pdparams")

Explanation : Why do we need to save the model instead of directly using the trained model for prediction?

In theory, predictions can be completed by directly using model instances, but in practical applications, training the model and using the model are often different scenarios . Model training usually uses a large number of offline servers (which do not provide online services to external corporate customers/users); model prediction is usually implemented using online servers that provide prediction services or the completed prediction model is embedded into mobile phones or other terminal devices. use. Therefore, the explanation method of "save the model first, then load the model" in this tutorial is more suitable for use in real scenarios.

3.5.2 Test model

Next, select a data sample to test the prediction effect of the model. The testing process is consistent with the process of using the model in the application scenario, and can be divided into the following three steps:

  1. Configure machine resources for model predictions.
  2. Load the trained model parameters LR_model.pdparamsinto the model instance. Completed by two statements:
    1. The first sentence is to read the model parameters from the file;
    2. The second sentence is to load the parameter content into the model.
    • After loading, you need to adjust the model's status to eval()(verification).
    • As mentioned above, the model in the training state needs to support both forward calculation and reverse conduction gradient. The implementation of the model is relatively bloated, while the model in the verification and prediction state only needs to support forward calculation. The implementation of the model is simpler and the performance is better. good.
  3. Input the sample features to be predicted into the model and print the output prediction results.

We use load_one_examplethe function to extract a sample from the data set as a test sample. The specific implementation code is as follows.

def load_one_example(test_data):
    # 从已经加载的测试集中随机选择一条作为测试数据
    idx = np.random.randint(low=0, high=test_data.shape[0])
    one_data, label = test_data[idx, :-1], test_data[idx, -1:]
    
    # 修改该条数据的shape为 [1, 13]
    one_data = one_data.reshape([1, -1])
    
    return one_data, label


# 读取模型
model_dict = paddle.load("LR_model.pdparams")
model = Regression()
model.load_dict(state_dict=model_dict)
# 转换模型状态
model.eval()
    
# 读取单条测试数据
one_data, label = load_one_example(test_data=test_data)
one_data = paddle.to_tensor(one_data)
    
# 模型推理
pred = model(one_data)
    
# 对推理结果做反归一化处理
pred = pred * (max_values[-1] - min_values[-1]) + min_values[-1]
# 对真实标签做反归一化处理
label = label * (max_values[-1] - min_values[-1]) + min_values[-1]
    
print(F"推理结果为: {
      
      pred.numpy()}\t对应的真实值为: {
      
      label}")

result:

推理结果为: [[16.670935]]       对应的真实值为: [13.6]

By comparing the "model prediction value" and the "real house price", it can be seen that the prediction effect of the model is close to the real house price. House price prediction is only the simplest model, and you can get twice the result with half the effort if you use Flying Paddle to write it. Then for more complex models in industrial practice, the cost savings of using flying paddles are immeasurable. At the same time, Fei Paddle has optimized performance for many application scenarios and machine resources, and its functions and performance are far stronger than those of self-written models.

4. Grammatical differences between PaddlePaddle and PyTorch

4.1 Forward propagation: same

"""=============== PyTorch ==============="""
# 前向传播
loss = model(input_data, target)


"""=============== Paddle ==============="""
# 前向传播
loss = model(input_data, target)

4.2 Backpropagation: Same

"""=============== PyTorch ==============="""
loss.backward()


"""=============== Paddle ==============="""
# 反向传播
loss.backward()

4.3 Update parameters: same

"""=============== PyTorch ==============="""
# 定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# 更新参数
optimizer.step()


"""=============== Paddle ==============="""
# 定义优化器
optimizer = paddle.optimizer.SGD(learning_rate=learning_rate, parameters=model.parameters())

# 更新参数
optimizer.step()

4.4 Clear gradient: different

"""=============== PyTorch ==============="""
# 清空梯度
optimizer.zero_grad()


"""=============== Paddle ==============="""
# 清空梯度
optimizer.clear_gradients()

4.5 Model parameter file and model suffix: the same

"""=============== PyTorch ==============="""
# 保存模型参数
torch.save(model.state_dict(), "model.pt")  # 后缀一般是 .pt 或 .pth


"""=============== Paddle ==============="""
# 保存模型参数
paddle.save(model.state_dict(), "model.pdparams")  # 后缀一般是 .pdparams

4.6 Reading model: different

"""=============== PyTorch ==============="""
# 加载模型参数
model_state_dict = torch.load("model.pt")
model.load_state_dict(model_state_dict)


"""=============== Paddle ==============="""
# 加载模型参数
model_state_dict = paddle.load("model.pdparams")
model.set_state_dict(model_state_dict)

knowledge source

  1. https://www.paddlepaddle.org.cn/tutorials/projectdetail/3520300
  2. https://www.paddlepaddle.org.cn/tutorials/projectdetail/5836960
  3. https://www.paddlepaddle.org.cn/tutorials/projectdetail/4309126

おすすめ

転載: blog.csdn.net/weixin_44878336/article/details/132096948