Computational Graph Basics
--------------------------

A computational graph contains operators (as units of operations) and
tensors (as units of data). The operator nodes in a graph are connected
with directed edges, which indicate the state of each tensor and
dependencies between operators. :numref:`simpledag`\ shows a
computational graph example of
:math:`\boldsymbol{Z}=ReLU(\boldsymbol{X}\times\boldsymbol{Y})`.

.. _simpledag:

.. figure:: ../img/ch03/simple-graph.png
   :width: 300px

   Simple computational graph


Tensors and Operators
~~~~~~~~~~~~~~~~~~~~~

In mathematics, tensors are a generalization of scalars and vectors.
Machine learning defines multidimensional data as tensors. The rank of a
tensor refers to the number of axes (or dimensions) the tensor has. A
scalar is a rank-0 tensor containing a single value, without axes; a
vector is a rank-1 tensor with one axis; and a three-channel RGB color
image is a rank-3 tensor with three axes. See :numref:`tensor`.

.. _tensor:

.. figure:: ../img/ch03/tensor.png
   :width: 800px

   Tensors


In a machine learning framework, a tensor stores not only data itself
but also attributes such as the data type, data shape, rank, and
gradient transfer status. :numref:`tensor_attr`\ describes the main
attributes of a tensor.

.. _tensor_attr:

.. table:: Tensor attributes

   +-----------------------------------------------+-----------------------+
   | Tensor Attribute                              | Description           |
   +===============================================+=======================+
   | shape                                         | Length of each        |
   |                                               | dimension, for        |
   |                                               | example, [3,3,3].     |
   +-----------------------------------------------+-----------------------+
   | dim                                           | Number of axes (or    |
   |                                               | dimensions). The      |
   |                                               | value is 0 for a      |
   |                                               | scalar and 1 for a    |
   |                                               | vector.               |
   +-----------------------------------------------+-----------------------+
   | dtype                                         | Data type, such as    |
   |                                               | bool, uint8, int16,   |
   |                                               | float32, and float64. |
   +-----------------------------------------------+-----------------------+
   | device                                        | Target device, such   |
   |                                               | as a CPU or GPU.      |
   +-----------------------------------------------+-----------------------+
   | name                                          | Tensor name.          |
   +-----------------------------------------------+-----------------------+


In the following, we explore each tensor attribute with image data as an
example. Assume that our machine learning framework loads a 96-pixel by
96-pixel RGB (3-channel) image and converts the image data into a tensor
for storage. A *rank*-3 tensor of *shape* [96,96,3] is generated, with
the three dimensions representing the image height, image width, and
number of channels, respectively. The pixels in the RGB image are
represented by unsigned integers ranging from 0 to 255. Therefore, the
*dtype* of the resulting tensor is uint8. The image data is normalized
before it is fed into a CNN for training. Specifically, its data type is
reformatted to float32 so that it is compatible with the default data
type of common machine learning frameworks.

Before training, the machine learning framework determines the compute
device (i.e., CPU, GPU, or other hardware) and stores the data and
weight parameters necessary for training in the memory of the
corresponding hardware — as specified by the *device* attribute.
Typically, the device attribute of a tensor is automatically assigned by
the machine learning framework based on the hardware environment.
Tensors are either mutable or immutable. Mutable tensors store weight
parameters and are updated based on gradient information, for example,
convolution kernel tensors that participate in convolution operations.
Immutable tensors store initial user data or data input to models, for
example, the image data tensor mentioned above.

What does a tensor look like in machine learning settings? Most tensors,
like image data and convolution kernel tensors, are “rectangular” or
“cubic” in shape. That is, such a tensor has the same number of elements
along each of its axes. However, there are specialized tensors that have
different shapes: ragged and sparse tensors. As shown in Figure
:numref:`tensorclass`, a tensor is ragged if it has variable numbers
of elements along some axes. Ragged tensors enable efficient storage and
processing of irregularly shaped data, such as variable-length texts in
natural language processing (NLP) applications. Sparse tensors often
handle graph data of graph neural networks (GNNs) and are encoded using
special formats such as the coordinate list (COO) to improve storage
efficiency.

.. _tensorclass:

.. figure:: ../img/ch03/tensor-class.png
   :width: 800px

   Types of tensors


Operators are the basic compute units of neural networks. They process
tensor data and implement common computational logic in machine
learning, including data transformation, conditional control,
mathematical calculation, etc. Based on their functionalities, operators
are classified into tensor operators, neural network operators, data
flow operators, and control flow operators.

-  **Tensor operators** involve tensor structure and mathematical
   operations. Typical tensor structure operations include reshaping
   tensors, permuting tensor dimensions, concatenating tensors, etc. For
   example, we may need to change the dimension order (between “channels
   first” and “channels last”) of image data tensors in CNN
   applications. Mathematical operations are tensor-based and include
   matrix multiplication, norm calculation, determinant calculation,
   eigenvalue calculation, etc. They are often seen in the gradient
   computation of machine learning models.

-  **Neural network operators**, the foundation of neural network
   models, are the most common operators, including feature extraction,
   activation functions, loss functions, optimization algorithms, etc.
   Feature extraction refers to extracting feature tensors from input
   data in CNN tasks. With the nonlinear ability introduced by
   activation functions, neural networks can model highly complex
   relationships and patterns in data. Optimization algorithms are used
   to update model parameters so that the loss function is minimized.

-  **Data flow operators**\ cover data preprocessing and loading. Data
   preprocessing mainly refers to data resizing, padding, normalization,
   and argumentation of mostly visual and textual data, whereas data
   loading involves operations such as shuffling, batching, and
   pre-fetching of the dataset. Data flow operators transform raw input
   data into a format meaningful to the machine learning framework and
   efficiently load the data to the network for training or inference
   according to the defined number of iterations, reducing memory usage
   and wait time.

-  **Control flow operators**, usually found in flexible and complex
   models, are used to control data flows in computational graphs.
   Typical control flow operators are conditional operators and loop
   operators. They are provided by either the machine learning framework
   or the frontend language. Control flow operations affect data flows
   in both forward and backward computation of neural networks.

Computational Dependencies
~~~~~~~~~~~~~~~~~~~~~~~~~~

In a computational graph, the dependencies between operators influence
the execution sequence and parallelism of operators. The computational
graphs involved in machine learning algorithms are directed acyclic
graphs, where data flows must not lead to circular dependencies. With a
circular dependency, the training program will run into an infinite loop
and never terminate by itself. Data stuck in an infinite loop tends to
either infinity or 0, yielding invalid results. To analyze the execution
sequence and facilitate model topology design, the following describes
the dependencies between the compute nodes in a computational graph.

As shown in Figure :numref:`dependence`, if the Matmul1 operator is
removed from the graph, there will be no input to the downstream
activation function, and the data flow will be interrupted. We can
therefore conclude that the operators in this computational graph depend
on each other with transitive relations.

.. _dependence:

.. figure:: ../img/ch03/dependence.png
   :width: 400px

   Computational dependencies


Three types of dependencies are available.

-  **Direct dependency**: For example, the ReLU1 node is directly
   dependent on the Matmul1 node. That is, ReLU1 can run properly only
   when it receives a direct output from Matmul1.

-  **Indirect dependency**: For example, the Add node indirectly depends
   on the Matmul1 node. Specifically, Matmul1’s output is processed by
   one or more intermediate nodes and then transmitted to the Add node.
   The Add node directly or indirectly depends on the intermediate
   nodes.

-  **Mutual independence**: For example, the graph shows no input/output
   dependency between Matmul1 and Matmul2, meaning that the two nodes
   are independent of each other.

In the computational graph shown in Figure :numref:`recurrent`, the
Add node indirectly depends on the Matmul node; conversely, the Matmul
node directly depends on the Add node. The two nodes are stuck waiting
for each other’s output to start their computation. When input data is
manually assigned to the two nodes at the same time, they will compute
endlessly, and the training process can never terminate by itself. A
circular dependency produces a positive feedback data flow, where data
values overflow to positive infinity, underflow to negative infinity, or
tend to 0. These all lead to unexpected training results. As such, we
should avoid circular dependencies between operators when designing deep
learning models.

.. _recurrent:

.. figure:: ../img/ch03/recurrent.png
   :width: 300px

   Circular dependency


In machine learning frameworks, the *unrolling* method is used to
represent loop iterations. Figure :numref:`unroll` shows a
computational graph involving three loop iterations. The subgraph of the
loop body is replicated to three (according to the number of iterations)
to produce an unrolled loop, where the resulting subgraphs are
concatenated in the iteration sequence. The subgraph of one iteration
has a direct dependency on that of the previous iteration. In one
computational graph, tensors and operators are uniquely identified
across the loop iterations, even for the same operation. Unlike circular
dependencies, loop iterations do not involve mutual dependencies between
operators with unique identifiers. When a subgraph is replicated to
produce an unrolled loop, the replicated tensors and operators are
assigned new identifiers to avoid circular dependencies.

.. _unroll:

.. figure:: ../img/ch03/unroll.png
   :width: 800px

   Unrolled loop


Control Flows
~~~~~~~~~~~~~

A control flow maintains the sequence of computation tasks, thereby
facilitating the design of flexible and complex models. By introducing a
control flow to a model, we can execute a node iteratively any number of
times or skip a node based on specific conditions. Many deep learning
models rely on control flows for training and inference. For example,
models built on recurrent neural networks (RNNs) and reinforcement
learning rely on recurrence relations and input status conditions to
complete the computation.

Popular machine learning frameworks provide two major types of control
flows:

-  **Frontend control flows**: Python control flow statements are used
   to implement control decision-making in a computational graph.
   Frontend control flows are easy to use in model building. However,
   because the computation process of the machine learning framework
   runs on the backend hardware and the control flow is decoupled from
   the data flow, the computational graph cannot run entirely on the
   backend hardware. As such, control flow implementations using the
   frontend language are referred to as the *out-of-graph approach*.

-  **Framework control primitives**: Machine learning frameworks come
   with built-in low-level fine-grained control primitive operators.
   Such operators are executable on compute hardware. When they are
   introduced to a model, the computational graph can run entirely on
   the backend hardware. This type of control flow implementations are
   referred to as the *in-graph approach*.

To explain why we need these different approaches to implement control
flows, let’s look at the differences between the two approaches.

The out-of-graph approach is familiar to Python programmers. This
flexible, intuitive approach allows direct use of Python commands such
as ``if-else``, ``while``, and ``for`` in building control flows.

The in-graph approach, by contrast, is more complicated. TensorFlow
provides a range of in-graph control flow operators (such as ``tf.cond``
for conditional control, ``tf.while_loop``\ for loop control, and
``tf.case`` for branch control). These operators are composites of
lower-level primitive operators. The control flow representations
adopted by the in-graph approach are in a different style from common
programming — this improves computing performance but comes at the
expense of usability.

The out-of-graph approach is easier to use. However, not all backend
compute hardware is compatible with the frontend runtime environment,
and extra efforts may be needed to execute the frontend control flows.
Nevertheless, control flows implemented using the in-graph approach are
directly executable on hardware independent of the frontend environment,
improving efficiency throughout the model building, optimization, and
execution process.

The two approaches serve different application scenarios. To run tasks
such as model training, inference, and deployment on compute hardware
independent of the frontend environment, the in-graph approach is
recommended for building control flows. For model validation purposes,
the out-of-graph approach allows for higher efficiency in generating
model code from the model algorithm.

Major machine learning frameworks support both the out-of-graph and
in-graph approaches. In the following illustrations about the impact of
control flows on forward and backward computation, we adopt the
out-of-graph approach for control flow implementations, given that
frontend control flows are more popular in practice. The most common
control flows include conditional branches and loops. For a model
containing control flow operations, the control flow is replicated to
the gradient computational graph during backpropagation, so that the
required tensor gradients can be accurately calculated.

Code shows an example of simple conditional control, where ``matmul``
indicates the matrix multiplication operator.

.. code:: python

   def control(A, B, C, conditional = True):
       if conditional:     
           y = matmul(A, B)   
       else:     
           y = matmul(A, C)
       return y

Figure :numref:`if`\ depicts the forward and backward computational
graphs of above code. When running a model containing ``if`` conditions,
the program needs to know which branch of each condition is taken so
that it can apply the gradient computation logic to the right branch. In
the forward computational graph, tensor :math:`\boldsymbol{C}` does not
participate in computation due to conditional control. Similarly, in the
backward computational graph, tensor :math:`\boldsymbol{C}` is skipped
in gradient computation.

.. _if:

.. figure:: ../img/ch03/if.png
   :width: 600px

   Computational graphs of conditional control


A control loop allows us to execute an operation in a loop zero or
multiple times. When the loop is unrolled, each operation is assigned a
unique identifier to identify different calls to the same operation.
Each iteration directly depends on the result of the previous one.
Therefore, one or more lists of tensors need to be maintained in the
control loop for storing per-iteration intermediate results used in the
forward pass and gradient computation. The following code shows a
control loop example. In its unrolled loop, :math:`\boldsymbol{X_i}` and
:math:`\boldsymbol{W_i}` are the lists of intermediate result tensors to
be maintained.

.. code:: python

   def recurrent_control(X : Tensor, W : Sequence[Tensor], cur_num = 3):
       for i in range(cur_num):    
           X = matmul(X, W[i]) 
       return X
   # Unroll the loop to obtain an equivalent representation.
   def recurrent_control(X : Tensor, W : Sequence[Tensor]):
       X1 = matmul(X, W)   #Let W = W[0], W1 = W[1], and W2 = W[2]. W2 = W[2]
       X2 = matmul(X1, W1)
       Y = matmul(X2, W2) 
       return Y

The forward and backward computational graphs of Code are shown in
Figure :numref:`while`. The gradient of the control loop is also a
loop, with the same number of iterations as the forward loop. The
gradient value output by one iteration serves as the input value for
calculating the gradient of the next iteration until the loop ends.

.. _while:

.. figure:: ../img/ch03/while.png
   :width: 600px

   Computational graphs of loop control


Gradient Computation Using the Chain Rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the loop unrolling example in last Section, when input tensor
:math:`\boldsymbol{X}` is fed into the neural network, the data is
propagated forward one layer at a time in the computational graph, and
the intermediate variables are calculated and stored until
:math:`\boldsymbol{Y}` is output after multilayer computation. In DNN
training, the loss function result is calculated based on the output
result of forward propagation and the label value. The model
backpropagates the loss function information through the computational
graph and updates the training parameters based on computed gradients.
Typically, backpropagation works by computing the gradients of the loss
function with respect to each parameter. Backpropagation based on other
information can also work but is not discussed here.

The chain rule method is used to calculate the gradients with respect to
each parameter during backpropagation. In calculus, the chain rule
provides a technique for finding the derivatives of composite functions.
The derivative of a composite function at a given point is the product
of the derivatives of each individual function at the corresponding
point. Assume that *f* and *g* are functions mapped from the real number
*x*. If :math:`y=g(x)` and :math:`z=f(y)=f(g(x))`, the derivative of *z*
with respect to *x* is

.. math:: \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
   :label: ch04-1

The backpropagation algorithm of neural networks executes the chain rule
in the sequence defined by the backward computational graph. Generally,
neural networks accept 3D tensor inputs and output 1D vectors.
Therefore, we can generalize the gradient computation Equations
:eq:`ch04-1` of composite functions with respect to scalars as
follows: Assuming that :math:`\boldsymbol{X}` is an *m*-dimensional
tensor,\ :math:`\boldsymbol{Y}` is an *n*-dimensional tensor,
:math:`\boldsymbol{z}` is a 1D vector,
:math:`\boldsymbol{Y}=g(\boldsymbol{X})`, and
:math:`\boldsymbol{z}=f(\boldsymbol{Y})`, the partial derivative of
:math:`\boldsymbol{z}` with respect to each element of
:math:`\boldsymbol{X}` is

.. math:: \frac{\partial z}{\partial x_i}=\sum_j\frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x_i}
   :label: ch04-2

The equivalent form of Equation :eq:`ch04-2` is

.. math:: \nabla_{\boldsymbol{X}}\boldsymbol{z} = (\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}})^{\top}\nabla_{\boldsymbol{Y}}\boldsymbol{z}
   :label: ch04-3

where, :math:`\nabla_{\boldsymbol{X}}\boldsymbol{z}` represents the
gradient matrix of :math:`\boldsymbol{z}` with respect to
:math:`\boldsymbol{X}`.

Figure :numref:`chain` shows the application of the chain rule in
neural networks, illustrating both forward and backward passes in a
single graph. The neural network performs matrix multiplication twice to
obtain the predicted value :math:`\boldsymbol{Y}`, and then performs
gradient backpropagation based on the error between the output value and
label value to update the weight parameters to minimize the error. The
weight parameters to be updated include :math:`\boldsymbol{W}` and
:math:`\boldsymbol{W_1}`.

.. _chain:

.. figure:: ../img/ch03/chain.png
   :width: 600px

   Backpropagation computational graph


The mean square error (MSE) is selected as the loss function in this
example. Two important questions arise here: How does the loss function
transfer the gradient information to :math:`\boldsymbol{W}` and
:math:`\boldsymbol{W_1}` using the chain rule method? And why do we need
to calculate the gradients of non-parameter data :math:`\boldsymbol{X}`
and :math:`\boldsymbol{X_1}`? To answer these questions, let’s analyze
the computation details of forward and backward propagation. First, the
loss value is calculated through forward propagation in three steps: (1)
:math:`\boldsymbol{X_1}=\boldsymbol{XW}`; (2)
:math:`\boldsymbol{Y}=\boldsymbol{X_1W_1}`; and (3)
Loss=\ :math:`\frac{1}{2}`\ (:math:`\boldsymbol{Y}`-Label):math:`^2`.

The loss function is calculated to minimize the distance between the
prediction value and the label value. According to the chain rule,
backpropagation is performed through Equations :eq:`ch04-4` and
:eq:`ch04-5` to calculate the gradients of the loss function with
respect to parameters :math:`\boldsymbol{W}` and
:math:`\boldsymbol{W_1}`:

.. math:: \frac{\partial {\rm Loss}}{\partial \boldsymbol{W_1}}=\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{W_1}}\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}
   :label: ch04-4

.. math:: \frac{\partial {\rm Loss}}{\partial \boldsymbol{W}}=\frac{\partial \boldsymbol{X_1}}{\partial \boldsymbol{W}}\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X_1}}
   :label: ch04-5

Both Equations :eq:`ch04-4` and :eq:`ch04-5` solve
:math:`\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}`, which
corresponds to grad :math:`\boldsymbol{Y}` in Figure :numref:`chain`.
:math:`\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X_1}}`
in Equation :eq:`ch04-5` corresponds to grad grad
:math:`\boldsymbol{X_1}` in Figure :numref:`chain`. To calculate the
gradient of model parameter :math:`\boldsymbol{W}`, the gradient of
intermediate result :math:`\boldsymbol{X_1}` is calculated. This also
answers the second question raised above. The gradients of non-parameter
intermediate results are calculated to facilitate gradient computation
with regard to each parameter.

Because :math:`\boldsymbol{X_1}=\boldsymbol{XW}`,
:math:`\boldsymbol{Y}=\boldsymbol{X_1W_1}`, and
Loss=\ :math:`\frac{1}{2}`\ (:math:`\boldsymbol{Y}`-Label):math:`^2`,
Equations :eq:`ch04-4` and :eq:`ch04-5` are expanded to
:eq:`ch04-6` and :eq:`ch04-7` according to Equations
:eq:`ch04-3`, respectively. Then, we can analyze how variables
participate in gradient computation when the machine learning framework
uses the chain rule to build a backward computational graph.

.. math:: \frac{\partial {\rm Loss}}{\partial \boldsymbol{W_1}}=\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{W_1}}\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}=\boldsymbol{X_1}^\top(\boldsymbol{Y}-{\rm Label})
   :label: ch04-6

.. math:: \frac{\partial {\rm Loss}}{\partial \boldsymbol{W}}=\frac{\partial \boldsymbol{X_1}}{\partial \boldsymbol{W}}\frac{\partial {\rm Loss}}{\partial \boldsymbol{Y}}\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X_1}}=\boldsymbol{X}^\top(\boldsymbol{Y}-{\rm Label})\boldsymbol{W_1}^\top
   :label: ch04-7

Equation :eq:`ch04-6` uses intermediate result
:math:`\boldsymbol{X_1}` in the forward computational graph when
calculating the gradient of :math:`\boldsymbol{W_1}`. In equation
:eq:`ch04-7`, both input :math:`\boldsymbol{X}` and parameter
:math:`\boldsymbol{W_1}` are used for calculating the gradient of
parameter :math:`\boldsymbol{W}`. This answers the first question. The
gradient information transferred backward from downstream network
layers, and the intermediate results and parameter values in forward
computation, all have roles to play in calculating the gradient of each
parameter in the graph.

Based on Figure :numref:`chain` and Equations :eq:`ch04-4`,
:eq:`ch04-5`, :eq:`ch04-6` and :eq:`ch04-7`, when the
chain rule is used to construct a backward computational graph, the
computation process is analyzed and the intermediate results and
gradient transfer status in the model are stored. The machine learning
framework improves the backpropagation efficiency by reusing buffered
computation results.

We can generalize the chain rule to wider applications. With flexible
control flows, the machine learning framework can quickly analyze the
computation processes of the forward data flow and backward gradient
flow by using computational graph technology, effectively manage the
lifetime of each intermediate result in memory, and improve the overall
computation efficiency.