Article Directory

Computing gradients
Gradient tapes
Gradients with respect to a model
Controlling what the tape watches
Intermediate results
Gradients of non-scalar targets
Cases where gradients returns None
References

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

Computing gradients

To achieve automatic differentiation, TensorFlow needs to remember which operations occurred in which order during the forward pass (forward pass) . Then, during the backward pass, TensorFlow traverses this op list in reverse order to compute gradients.

Gradient tapes

TensorFlow provides APIs for automatic differentiation , i.e. computing the gradient of tf.GradientTapesome quantity with respect to some input (usually ). TensorFlow "records" the relevant operations performed within a "stripe", which is then used to compute gradients via reverse mode differentiation.tf.Variabletf.GradientTape

"Recording" process:

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = x ** 2

After recording some operations, use to GradientTape.gradient(target, sources)compute the gradient of some target ( usually a loss ) with respect to some source (usually a model parameter):

dy_dx = tape.gradient(y, x)
dy_dx.numpy()
"""
6.0
"""

The example above uses a scalar and will tf.GradientTape work on any tensor :

w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = [[1., 2., 3.]]

with tf.GradientTape(persistent=True) as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y**2)

To obtain the gradient of losswith respect to two variables, gradientpass both variables as sources to the method. GradientStrap is very flexible about how sources are passed, accepting any nested combination of lists or dictionaries, and returning gradient structures in the same way.

The gradient with respect to each source has the shape of the source:

[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print(w.shape)
print(dl_dw.shape)
"""
(3, 2)
(3, 2)
"""

The source can also pass in variable dictionaries:

source = {
    
    
    'w': w,
    'b': b
}

grad = tape.gradient(loss, source)
grad['b']
"""
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.85382605, -4.2623644 ], dtype=float32)>
"""

Gradients with respect to a model

Typically will be tf.Variablescollected into tf.Moduleor one of its subclasses (eg layers.Layer, keras.Model) for checkpoint or export.

In most cases, we need to compute gradients with respect to the model's trainable variables . tf.ModuleSince all subclasses Module.trainable_variablesof aggregate their variables in the attribute, the calculation of the gradient is also very simple:

layer = tf.keras.layers.Dense(2, activation='relu')
x = tf.constant([[1., 2., 3.]])

with tf.GradientTape() as tape:
    y = layer(x)
    loss = tf.reduce_mean(y**2)

grad = tape.gradient(loss, layer.trainable_variables)

for var, gra in zip(layer.trainable_variables, grad):
    print(f'{
      
      var.name}, shape: {
      
      gra.shape}')
"""
dense/kernel:0, shape: (3, 2)
dense/bias:0, shape: (2,)
"""

Controlling what the tape watches

By default TensorFlow logs all operations tf.Variableafter . The following example cannot compute gradients because is not monitored by tf.Tensordefault , or the property of is set to :tf.GradientTapetf.VariabletrainableFalse

# A trainable variable
x0 = tf.Variable(3.0, name='x0')
# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)
# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2') + 1.0
# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    y = (x0**2) + (x1**2) + (x2**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)
"""
tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None
"""

To record tf.Tensorthe , we need to call GradientTape.watch(x):

x = tf.constant(3.0)
with tf.GradientTape() as tape:
    tape.watch(x)
    y = x ** 2

dy_dx = tape.gradient(y, x)
dy_dx.numpy()

Intermediate results

We can also tf.GradientTapeget the gradient of the intermediate values computed in :

x = tf.constant(3.0)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x
    z = y * y

# dz_dy = 2 * y and y = x ** 2 = 9
print(tape.gradient(z, y).numpy())

By default GradientTape.gradient, GradientTaperesources held by are released whenever the method is called . To compute multiple gradients in the same computation, you can set persistent=True . This allows multiple calls to the gradient method to be made as resources are released when the gradient strip object is garbage collected. For example:

x = tf.constant([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x * x
    z = y * y

print(tape.gradient(z, x).numpy())  # [4.0, 108.0] (4 * x**3 at x = [1.0, 3.0])
print(tape.gradient(y, x).numpy())  # [2.0, 6.0] (2 * x at x = [1.0, 3.0])
"""
[  4. 108.]
[2. 6.]
"""

Gradients of non-scalar targets

Gradients are fundamentally operations on scalars . For computing gradients for multiple targets, the following example computes the sum of gradients for each target :

x = tf.Variable(2.0)
with tf.GradientTape() as tape:
    y0 = x**2
    y1 = 1 / x

print(tape.gradient({
    
    'y0': y0, 'y1': y1}, x).numpy())
"""
3.75
"""

If the target is not a scalar, the gradient of the sum is computed :

x = tf.Variable(2.)

with tf.GradientTape() as tape:
    y = x * [3., 4.]

print(tape.gradient(y, x).numpy())
"""
7.0
"""

For each entry a separate gradient is required involving the Jacobian. In some cases the Jacobian can be skipped. For element-wise computation, the gradient of the sum gives the derivative of each element with respect to its input element, since each element is independent :

x = tf.linspace(-10.0, 10.0, 200+1)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.nn.sigmoid(x)

dy_dx = tape.gradient(y, x)

plt.plot(x, y, label='y')
plt.plot(x, dy_dx, label='dy/dx')
plt.legend()
_ = plt.xlabel('x')

insert image description here

Cases where gradients returns None

When the target is not connected to the source , it gradientwill return None:

x = tf.Variable(2.)
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y * y
print(tape.gradient(z, x))
"""
None
"""

We can also disconnect gradients in several less obvious ways:

Use tensors to replace variables

x = tf.Variable(2.0)

for epoch in range(2):
    with tf.GradientTape() as tape:
        y = x+1

    print(type(x).__name__, ":", tape.gradient(y, x))
    x = x + 1   # This should be `x.assign_add(1)`
"""
ResourceVariable : tf.Tensor(1.0, shape=(), dtype=float32)
EagerTensor : None
"""

Computed outside of TensorFlow

If the calculation exits TensorFlow, the gradient tape will not be able to record the gradient path:

x = tf.Variable([[1.0, 2.0],
                 [3.0, 4.0]], dtype=tf.float32)

with tf.GradientTape() as tape:
    x2 = x ** 2

    # This step is calculated with NumPy
    y = np.mean(x2, axis=0)

    # Like most ops, reduce_mean will cast the NumPy array to a constant tensor
    # using `tf.convert_to_tensor`.
    y = tf.reduce_mean(y, axis=0)

print(tape.gradient(y, x))
"""
None
"""

get gradient by integer or string

Integers and strings are not differentiable. Gradients will not occur if the computation path uses these data types.

x = tf.constant(10)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x

print(tape.gradient(y, x))
"""
None
"""

References

TensorFlow official website, https://tensorflow.google.cn/guide/autodiff .

TensorFlow Basics (3) Gradient and Automatic Differentiation