文章目录

Computing gradients
Gradient tapes
Gradients with respect to a model
Controlling what the tape watches
Intermediate results
Gradients of non-scalar targets
Cases where gradients returns None
References

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

Computing gradients

要实现自动微分，TensorFlow 需要记住在前向传递（forward pass）过程中哪些运算以何种顺序发生。随后，在后向传递（backward pass）期间，TensorFlow 以相反的顺序遍历此运算列表来计算梯度。

Gradient tapes

TensorFlow 为自动微分提供了 tf.GradientTape API，即计算某个量相对于某些输入（通常是 tf.Variable）的梯度。TensorFlow 会将在 tf.GradientTape 内执行的相关运算“记录”到“条带”上，随后会使用该“条带”通过反向模式微分（reverse mode differentiation）计算梯度。

“记录”过程：

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = x ** 2

记录一些运算后，使用 GradientTape.gradient(target, sources) 计算某个目标（通常为损失）相对于某个源（通常为模型参数）的梯度：

dy_dx = tape.gradient(y, x)
dy_dx.numpy()
"""
6.0
"""

上面这个例子使用了一个标量，tf.GradientTape 在任何张量上都可以运行：

w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]

with tf.GradientTape(persistent=True) as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y**2)

要获得 loss 相对于两个变量的梯度，可以将这两个变量同时作为 gradient 方法的源传递。梯度带在关于源的传递方式上非常灵活，可以接受列表或字典的任何嵌套组合，并以相同的方式返回梯度结构。

相对于每个源的梯度具有源的形状：

[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print(w.shape)
print(dl_dw.shape)
"""
(3, 2)
(3, 2)
"""

源处也可以传入变量字典：

source = {
    
    
    'w': w,
    'b': b
}

grad = tape.gradient(loss, source)
grad['b']
"""
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.85382605, -4.2623644 ], dtype=float32)>
"""

Gradients with respect to a model

通常会将 tf.Variables 收集到 tf.Module 或其子类之一（如 layers.Layer、keras.Model）中，用于 checkpoint 或者导出。

在大多数情况下，我们需要计算相对于模型的可训练变量的梯度。由于 tf.Module 的所有子类都在 Module.trainable_variables 属性中聚合其变量，梯度的计算也非常简单：

layer = tf.keras.layers.Dense(2, activation='relu')
x = tf.constant([[1., 2., 3.]])

with tf.GradientTape() as tape:
    y = layer(x)
    loss = tf.reduce_mean(y**2)

grad = tape.gradient(loss, layer.trainable_variables)

for var, gra in zip(layer.trainable_variables, grad):
    print(f'{
      
      var.name}, shape: {
      
      gra.shape}')
"""
dense/kernel:0, shape: (3, 2)
dense/bias:0, shape: (2,)
"""

Controlling what the tape watches

默认情况下 TensorFlow 会在访问可训练的 tf.Variable 后记录所有运算。以下示例无法计算梯度，因为默认情况下 tf.Tensor 不被 tf.GradientTape 监视，或者将 tf.Variable 的 trainable 属性设置为 False：

# A trainable variable
x0 = tf.Variable(3.0, name='x0')
# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)
# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0
# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    y = (x0**2) + (x1**2) + (x2**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)
"""
tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None
"""

要想记录相对于 tf.Tensor 的梯度，我们需要调用 GradientTape.watch(x)：

x = tf.constant(3.0)
with tf.GradientTape() as tape:
    tape.watch(x)
    y = x ** 2

dy_dx = tape.gradient(y, x)
dy_dx.numpy()

Intermediate results

我们也可以得到在 tf.GradientTape 中计算的中间值的梯度：

x = tf.constant(3.0)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x
    z = y * y

# dz_dy = 2 * y and y = x ** 2 = 9
print(tape.gradient(z, y).numpy())

默认情况下，只要调用 GradientTape.gradient 方法，就会释放 GradientTape 保存的资源。要在同一计算中计算多个梯度，可以设置 persistent=True。这样一来，当梯度带对象作为垃圾回收时，随着资源的释放，可以对 gradient 方法进行多次调用。例如：

x = tf.constant([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x * x
    z = y * y

print(tape.gradient(z, x).numpy())  # [4.0, 108.0] (4 * x**3 at x = [1.0, 3.0])
print(tape.gradient(y, x).numpy())  # [2.0, 6.0] (2 * x at x = [1.0, 3.0])
"""
[  4. 108.]
[2. 6.]
"""

Gradients of non-scalar targets

梯度从根本上说是对标量的运算。对于计算多个目标的梯度，下面的例子中计算的是每个目标的梯度总和：

x = tf.Variable(2.0)
with tf.GradientTape() as tape:
    y0 = x**2
    y1 = 1 / x

print(tape.gradient({
    
    'y0': y0, 'y1': y1}, x).numpy())
"""
3.75
"""

如果目标不是标量，则计算总和的梯度：

x = tf.Variable(2.)

with tf.GradientTape() as tape:
    y = x * [3., 4.]

print(tape.gradient(y, x).numpy())
"""
7.0
"""

对每个条目都需要单独的梯度涉及到雅可比矩阵。在某些情况下，可以跳过雅可比矩阵。对于逐元素计算，总和的梯度给出了每个元素相对于其输入元素的导数，因为每个元素都是独立的：

x = tf.linspace(-10.0, 10.0, 200+1)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.nn.sigmoid(x)

dy_dx = tape.gradient(y, x)

plt.plot(x, y, label='y')
plt.plot(x, dy_dx, label='dy/dx')
plt.legend()
_ = plt.xlabel('x')

在这里插入图片描述

Cases where gradients returns None

当目标未连接到源时，gradient 将返回 None：

x = tf.Variable(2.)
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y * y
print(tape.gradient(z, x))
"""
None
"""

我们还可以通过几种不太明显的方式将梯度断开：

使用张量替换变量

x = tf.Variable(2.0)

for epoch in range(2):
    with tf.GradientTape() as tape:
        y = x+1

    print(type(x).__name__, ":", tape.gradient(y, x))
    x = x + 1   # This should be `x.assign_add(1)`
"""
ResourceVariable : tf.Tensor(1.0, shape=(), dtype=float32)
EagerTensor : None
"""

在 TensorFlow 之外进行了计算

如果计算退出 TensorFlow, 梯度带将无法记录梯度路径：

x = tf.Variable([[1.0, 2.0],
                 [3.0, 4.0]], dtype=tf.float32)

with tf.GradientTape() as tape:
    x2 = x ** 2

    # This step is calculated with NumPy
    y = np.mean(x2, axis=0)

    # Like most ops, reduce_mean will cast the NumPy array to a constant tensor
    # using `tf.convert_to_tensor`.
    y = tf.reduce_mean(y, axis=0)

print(tape.gradient(y, x))
"""
None
"""

通过整数或字符串获取梯度

整数和字符串不可微分。如果计算路径使用这些数据类型，则不会出现梯度。

x = tf.constant(10)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x

print(tape.gradient(y, x))
"""
None
"""

References

TensorFlow 官方网站，https://tensorflow.google.cn/guide/autodiff.

TensorFlow 基础（三）梯度和自动微分