Author: Liu Guangcong
ZTE senior system architect, focusing on machine learning algorithms, distributed system architecture and optimization.
Original: TensorFlow Architecture and Design: OP Essentialism
---------------------------------------------------------------------------------------------------------------------------------------------------------------
The system structure of TensorFlow is bounded by the C API, and the entire system is divided into two subsystems, "front-end" and "back-end". The front-end system plays the role of Client, completes the construction of the computation graph, forwards the Protobuf format GraphDef
to the Master of the back-end system, and starts the execution process of the computation graph.
Finally, the Master splits the graph and registers the sub-picture segments to the Worker through the RegisterGraph
interface . GraphDef
Therefore, it GraphDef
is a knowledge model that describes the calculation graph, and the entire TensorFlow calculation process is developed around GraphDef
it.
The unit of computation in TensorFlow is the OP, which represents some kind of abstract computation. This chapter first describes NodeDef, OpDef
the metadata model, and then describes the flow of metadata through a simple example.
metadata
OP means some kind of abstract computation that has 0 or more "inputs/outputs", and 0 or more "properties". Among them, the input/output exists in the form of Tensor.
In the system implementation, the metadata of the OP is OpDef
described in the Protobuf format to realize the data exchange between the front-end and the back-end, and the unification of its domain model.
Definition of OpDef
The OpDef definition includes the OP's name, input and output list, attribute list, optimization options, etc. Among them, properties are often used to describe the type, size, default value, constraints, and other characteristics of the OP.
OP naming
OP is indexed by name, so OP's name must be guaranteed to be globally unique. According to the specification, the OP's name uses the "CamelCase" naming style, while the Python front-end uses the "lowercase underscore" naming style. The latter, also often referred to as "OP constructors", are also programming interfaces (APIs) exposed to users.
Also, OPs starting with an underscore are reserved by the system internal implementation. For example, _Send, _Recv
they are used for the OP of inter-device communication; _Source, _Sink
identifying the start and end nodes of the computation graph.
input Output
The input/output of OP exists in the form of Tensor, and there are the following 4 cases.
- 0 Tensors
- zero input
- zero output
- 1 Tensor
- Type determination
- type indeterminate
- Multiple Tensors
- same type
- not the same type
Relative to the OP's properties, the OP's input is dynamic, and its value changes every time it iterates (Step).
Attributes
An OP can have a "property set" that describes the type, size, default value, constraints, and other characteristics of the OP's input and output. Among them, when the calculation graph is constructed, the attribute value (AttrValue) is determined (carried by NodeDef and passed to the back-end execution system through GraphDef).
That is to say, OP's "property definition" and "property value setting" are two separate processes. Among them, the attribute definition is determined when the OP is registered and described by AttrDef; the attribute value setting is determined when the calculation graph is constructed (when the OP is added to the calculation graph), and it is described by AttrValue.
The OP's properties are static relative to the OP's input. The OP attribute value is determined during the construction of the computational graph, including the type, size, shape, etc. of the input and output, and will not change during the calculation iteration process.
NodeDef definition
OP index
NodeDef
By indexing op
from .OpRegistry
OpDef
input list
通过input
指定节点的输入列表,它也是构造计算图最重要的知识所在。它存在2种情况,分别表示普通边与控制依赖边。
按照约定,为了解析方便,input
列表前面存储普通边,随后存储控制依赖边。
node:src_output
表示此边为普通边,承载Tensor的数据流。其中,node
为前驱节点的名称,src_output
为前驱节点输出边的索引。特殊地,当src_output
为0时,可以略去0
。
^node
表示该边为控制依赖边。其中,node
为前驱节点的名称。
设备规范
通过device
可以支持用户自定义设备分配方案。例如,
"@other/node"
: 与other/node
节点分配在同一设备;"/job:worker/replica:0/task:1/gpu:3"
:完整规范"/job:worker/gpu:3"
:部分规范""
:空规范
属性值列表
在计算图的构造期,OP属性值得以确定,包括输入/输出的类型,Shape等信息。OP的属性值承载于OpDef
的attr
属性列表之中。
符号编程
TensorFlow的计算过程是一个延迟计算,是一种典型的基于符号的编程范式。从计算时间轴看,计算过程基本分为2个阶段:
- 图构造期:负责计算图的构造;
- 图执行期:负责计算图的执行。
其中,在系统初始化时,系统实现对所有OP进行扫描注册,并保存于OpRegistry
之中。
注册OP
理论上,OP的注册发生在系统初始化阶段。后端系统,可以使用REGISTER_OP
实用宏注册OP。前端系统,也存在类似的OP注册机制。
使用REGISTER_OP
注册OP过程,实际上是一个REGISTER_OP
描述到OpDef
表示的翻译过程。OpDefBuilder
通过链式调用Input
, Output
, Attr
方法分别构造OP的输入、输出列表,及其属性列表。最后,通过调用Finalize
成员函数,经过解析字符串表示,将其翻译为OpDef
的内在表示,最后注册到OpRegistry
之中。
例如,REGISTER_OP("ZerosLike")
向系统注册了一个zeros_like
的OP,在运行时实现了OpDef
的翻译表达。
构造OP
在前端,用户使用OP构造器实现OP的构造,并将OP注册到计算图中。在计算图构造期间,OP的输入/输出的类型,Shape得以确定,OP属性值也得以确定。
计算图的构造过程,实际上就是GraphDef
定义过程。其中,OP的属性值承载于NodeDef
,计算图构造期间,NodeDef
的属性值得以确定。
在计算图执行启动时,通过调用Session.run
,将整个GraphDef
传递给后端,并启动计算图的执行。例如,存在如下的计算图构造过程:
tensor = tf.constant([1, 2], name="n1")
zeros = tf.zeros_like(tensor, name="n2")
ZerosLike
的上游节点为n1
,其src_output=0
输出边流入ZerosLike
。此时,ZerosLike
的属性T
的值自动推演为DT_INT32
,两个节点构造了一个简单的计算图。
执行OP
在计算图执行期间,输入由上游OP流入得以确定,根据特定设备类型,输入输出类型,多态选择合适的Kernel实现,并启动Kernel的计算过程。
例如,如果zeros_like
上游输入为[1, 2, 3, 4]
,进过zeros_like
的OP运算,输出为[0, 0, 0, 0]
。