Article Directory
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
Computing gradients
To achieve automatic differentiation, TensorFlow needs to remember which operations occurred in which order during the forward pass (forward pass) . Then, during the backward pass, TensorFlow traverses this op list in reverse order to compute gradients.
Gradient tapes
TensorFlow provides APIs for automatic differentiation , i.e. computing the gradient of tf.GradientTape
some quantity with respect to some input (usually ). TensorFlow "records" the relevant operations performed within a "stripe", which is then used to compute gradients via reverse mode differentiation.tf.Variable
tf.GradientTape
"Recording" process:
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
y = x ** 2
After recording some operations, use to GradientTape.gradient(target, sources)
compute the gradient of some target ( usually a loss ) with respect to some source (usually a model parameter):
dy_dx = tape.gradient(y, x)
dy_dx.numpy()
"""
6.0
"""
The example above uses a scalar and will tf.GradientTape
work on any tensor :
w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]
with tf.GradientTape(persistent=True) as tape:
y = x @ w + b
loss = tf.reduce_mean(y**2)
To obtain the gradient of loss
with respect to two variables, gradient
pass both variables as sources to the method. GradientStrap is very flexible about how sources are passed, accepting any nested combination of lists or dictionaries, and returning gradient structures in the same way.
The gradient with respect to each source has the shape of the source:
[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print(w.shape)
print(dl_dw.shape)
"""
(3, 2)
(3, 2)
"""
The source can also pass in variable dictionaries:
source = {
'w': w,
'b': b
}
grad = tape.gradient(loss, source)
grad['b']
"""
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.85382605, -4.2623644 ], dtype=float32)>
"""
Gradients with respect to a model
Typically will be tf.Variables
collected into tf.Module
or one of its subclasses (eg layers.Layer
, keras.Model
) for checkpoint or export.
In most cases, we need to compute gradients with respect to the model's trainable variables . tf.Module
Since all subclasses Module.trainable_variables
of aggregate their variables in the attribute, the calculation of the gradient is also very simple:
layer = tf.keras.layers.Dense(2, activation='relu')
x = tf.constant([[1., 2., 3.]])
with tf.GradientTape() as tape:
y = layer(x)
loss = tf.reduce_mean(y**2)
grad = tape.gradient(loss, layer.trainable_variables)
for var, gra in zip(layer.trainable_variables, grad):
print(f'{
var.name}, shape: {
gra.shape}')
"""
dense/kernel:0, shape: (3, 2)
dense/bias:0, shape: (2,)
"""
Controlling what the tape watches
By default TensorFlow logs all operations tf.Variable
after . The following example cannot compute gradients because is not monitored by tf.Tensor
default , or the property of is set to :tf.GradientTape
tf.Variable
trainable
False
# A trainable variable
x0 = tf.Variable(3.0, name='x0')
# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)
# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0
# Not a variable
x3 = tf.constant(3.0, name='x3')
with tf.GradientTape() as tape:
y = (x0**2) + (x1**2) + (x2**2)
grad = tape.gradient(y, [x0, x1, x2, x3])
for g in grad:
print(g)
"""
tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None
"""
To record tf.Tensor
the , we need to call GradientTape.watch(x)
:
x = tf.constant(3.0)
with tf.GradientTape() as tape:
tape.watch(x)
y = x ** 2
dy_dx = tape.gradient(y, x)
dy_dx.numpy()
Intermediate results
We can also tf.GradientTape
get the gradient of the intermediate values computed in :
x = tf.constant(3.0)
with tf.GradientTape() as tape:
tape.watch(x)
y = x * x
z = y * y
# dz_dy = 2 * y and y = x ** 2 = 9
print(tape.gradient(z, y).numpy())
By default GradientTape.gradient
, GradientTape
resources held by are released whenever the method is called . To compute multiple gradients in the same computation, you can set persistent=True
. This allows multiple calls to the gradient method to be made as resources are released when the gradient strip object is garbage collected. For example:
x = tf.constant([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
y = x * x
z = y * y
print(tape.gradient(z, x).numpy()) # [4.0, 108.0] (4 * x**3 at x = [1.0, 3.0])
print(tape.gradient(y, x).numpy()) # [2.0, 6.0] (2 * x at x = [1.0, 3.0])
"""
[ 4. 108.]
[2. 6.]
"""
Gradients of non-scalar targets
Gradients are fundamentally operations on scalars . For computing gradients for multiple targets, the following example computes the sum of gradients for each target :
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
y0 = x**2
y1 = 1 / x
print(tape.gradient({
'y0': y0, 'y1': y1}, x).numpy())
"""
3.75
"""
If the target is not a scalar, the gradient of the sum is computed :
x = tf.Variable(2.)
with tf.GradientTape() as tape:
y = x * [3., 4.]
print(tape.gradient(y, x).numpy())
"""
7.0
"""
For each entry a separate gradient is required involving the Jacobian. In some cases the Jacobian can be skipped. For element-wise computation, the gradient of the sum gives the derivative of each element with respect to its input element, since each element is independent :
x = tf.linspace(-10.0, 10.0, 200+1)
with tf.GradientTape() as tape:
tape.watch(x)
y = tf.nn.sigmoid(x)
dy_dx = tape.gradient(y, x)
plt.plot(x, y, label='y')
plt.plot(x, dy_dx, label='dy/dx')
plt.legend()
_ = plt.xlabel('x')
Cases where gradients returns None
When the target is not connected to the source , it gradient
will return None
:
x = tf.Variable(2.)
y = tf.Variable(3.)
with tf.GradientTape() as tape:
z = y * y
print(tape.gradient(z, x))
"""
None
"""
We can also disconnect gradients in several less obvious ways:
- Use tensors to replace variables
x = tf.Variable(2.0)
for epoch in range(2):
with tf.GradientTape() as tape:
y = x+1
print(type(x).__name__, ":", tape.gradient(y, x))
x = x + 1 # This should be `x.assign_add(1)`
"""
ResourceVariable : tf.Tensor(1.0, shape=(), dtype=float32)
EagerTensor : None
"""
- Computed outside of TensorFlow
If the calculation exits TensorFlow, the gradient tape will not be able to record the gradient path:
x = tf.Variable([[1.0, 2.0],
[3.0, 4.0]], dtype=tf.float32)
with tf.GradientTape() as tape:
x2 = x ** 2
# This step is calculated with NumPy
y = np.mean(x2, axis=0)
# Like most ops, reduce_mean will cast the NumPy array to a constant tensor
# using `tf.convert_to_tensor`.
y = tf.reduce_mean(y, axis=0)
print(tape.gradient(y, x))
"""
None
"""
- get gradient by integer or string
Integers and strings are not differentiable. Gradients will not occur if the computation path uses these data types.
x = tf.constant(10)
with tf.GradientTape() as tape:
tape.watch(x)
y = x * x
print(tape.gradient(y, x))
"""
None
"""
References
TensorFlow official website, https://tensorflow.google.cn/guide/autodiff .