Scheduling and Executing Computational Tasks
--------------------------------------------

Training a model is conducted by scheduling the execution of the
operators in a computational graph. From a broad perspective, a training
job runs a computational graph for a defined number of iterations,
relying on optimal scheduling of tasks such as data loading and training
(inference) execution. Within each iteration, we need to analyze
operator-level scheduling based on the graph topology, computational
dependencies, and control flows. We optimize the scheduling and
execution of computational graphs to make full use of computing
resources, improve computational efficiency, and shorten the model
training and inference time. The following introduces the typical
techniques of computational graph scheduling.

The scheduling execution of the computation graph can be divided into
three modes according to the graph generation method, which are operator
scheduling, whole graph scheduling, and operator and subgraph combined
scheduling. These three modes also correspond to the three modes of
dynamic graph, static graph, and combination of dynamic and static in
the calculation graph generation mechanism.

Next, we will introduce the scheduling and execution of the calculation
graph in detail.

Operator Scheduling
~~~~~~~~~~~~~~~~~~~

Operator scheduling means that the operators contained in the algorithm
or model are scheduled and executed one by one through the runtime of
the Python language. This scheduling mechanism is used when the
calculation graph is executed in dynamic graph mode, such as PyTorch’s
default execution mode and TensorFlow’s eager mode.

Operator scheduling includes two steps. In the first step, according to
the call sequence of the model operator declaration, the dynamic
calculation graph obtains a linear operator scheduling sequence. And the
second is distributing the ordering of operators to instruction streams.

In Figure :numref:`schedule`, the directed acyclic graph on the left
contains five nodes a, b, c, d, and e and four dependency edges a->d,
b->c, c->d, and d->e (e.g., a->d indicates that d depends on a).
According to the operator call sequence of the model code, such as
a->b->c->d->e, all operator nodes are put into the queue in turn, and
the scheduling ends.

.. _schedule:

.. figure:: ../img/ch03/schedule.png
   :width: 700px

   Operator scheduling and execution


With the ordering, we then prepare to distribute the operators in the
ordering and related data to the GPU hardware for execution. Figure
:numref:`single_op_exec` shows the trace of operator scheduling. Once
the Python runtime calls an operator, the machine learning framework
initializes the operator by determining information such as the operator
precision, type and size of each input/output, and target device. It
then allocates memory for the operator before copying the memory to the
specific device for execution.

.. _single_op_exec:

.. figure:: ../img/ch03/single_op_exec.png
   :width: 700px

   Operator scheduling and execution


The operator scheduling method offers high flexibility because operators
are directly scheduled by the Python runtime. It facilitates the
representation of complex computational logic (such as control flows)
and use of Python-native data structures for implementing complex
algorithms. Operators are driven by the Python runtime to finish
computational tasks, facilitating easy collaboration with Python’s
large, rich ecosystem.

Despite its advantages, operator scheduling also has some disadvantages.
One is that context-based runtime optimizations such as operator fusion
and algebraic simplification become difficult. This is because global
information about the computational graph is unavailable. Another
disadvantage is that computational tasks have to run in serial mode,
rather than in parallel, due to the lack of computational topology.

Graph Scheduling
~~~~~~~~~~~~~~~~

When the calculation graph uses the static graph mechanism for
whole-graph scheduling execution, operators will be sent to the hardware
for execution one by one according to a certain execution sequence.
However, global information about the computational graph is available.
it can analyze operator dependencies and the number of computing
devices, and complete the scheduling and execution of the entire graph
in the following two ways:

-  **Serial**: executes its tasks one at a time, in the order that they
   are added to the queue.This method expands a computational graph into
   a sequence of operators, which are then run separately. Operators are
   executed in a static order using a single thread, thereby requiring
   fewer resources.
-  **Parallel**: executes its tasks concurrently for higher
   efficiency.This method expands a computational graph based on
   operator dependencies. Operators are executed in the order defined by
   their input dependencies, and those without input dependencies are
   executed concurrently. This method executes operators in a dynamic
   order (which may vary in each iteration) using multiple threads,
   thereby consuming more system resources.

Within a computational graph, most operators are dependent on each other
directly or indirectly. When scheduling such operators, their sequence
must be guaranteed. Figure :numref:`order` shows a computational
graph, where a forward pass is run on the input data to produce a
predicted value and then the gradient of the loss function is computed
for backpropagation. In general, downstream operators run dependently on
the output from the upstream. As such, we have to schedule the operators
in this computational graph to a serial queue in order to ensure that
each operator receives the necessary input.

.. _order:

.. figure:: ../img/ch03/order.png
   :width: 800px

   Serial operator scheduling


A computational graph may also contain operators independent of each
other, for example, op1 and op2 shown in Figure :numref:`para`. We can
have each operator run on different hardware devices to implement
parallel computing. Compared with the serial mode, parallel computing
decreases execution time by leveraging more computing resources at the
same time.

.. _para:

.. _comparsion_execution:

.. figure:: ../img/ch03/para.png
   :width: 800px

   Parallel operator scheduling


Serial execution and parallel execution have their own advantages and
disadvantages, as summarized in Table numref:\ ``comparsion_execution``.

:Comparison between serial execution and parallel execution \| Execution
Method \| Serial execution \| Parallel execution \| \| :—————–:\|
:————————————————–: \|:————————————————–: \| \| Execution Order \|
Static \| Dynamic \| \| Execution Threads \| Single thread \| Multiple
threads \| \| Resource Consumption \| Low \| High \|

A computing environment contains more than one type of computing device,
such as a CPU, GPU, or other. As such, a computational graph consisting
of operators that run on more than one type of computing device is
referred to as a heterogeneous computational graph.

The graph contains the following types of operators based on the
computing hardware.

-  **CPU operators**: They are C++ operators that run on the host CPU.
   The computing performance of the CPU depends on the extent to which
   the multi-core capability of the CPU is utilized.
-  **GPU operators**: They run on the GPU (e.g., NVIDIA GPU). GPU
   kernels are delivered to the host GPU one by one for execution. The
   GPU features ample parallel computing units that offer significant
   speedup to parallel algorithms.
-  **Python operators**: They run on the host CPU. Unlike CPU operators,
   Python operators are interpreted and executed by the Python runtime
   interpreter.

We mentioned earlier that the dynamic graph mechanism relies on the
Python interpreter to distribute operators and execute them serially
according to the order of operators defined by the model code. This mode
usually allows data to be transmitted on different computing devices.
Communication bottlenecks may increase the time spent waiting for
operators to execute data, reducing the overall execution efficiency of
the calculation graph. Therefore, the first condition for the efficient
execution of the calculation graph is to accurately identify the device
where the operator is executed, try to avoid the transmission of data
between different devices. Independent operators are scheduled on
different devices in parallel. The static graph mechanism can get rid of
the constraints of the Python interpreter. The calculation graph is sent
to the device at one time, which reduces the number of interactions
between the host and the computing chip, and improves computing
efficiency and performance.

The combination of operators and subgraphs for scheduling execution mode
is a combination of the previous two execution modes. Due to the
flexibility of the computing graph structure, the efficiency of
computing graphs in complex scenarios may not be optimal when executed
on the entire computing chip. For example, computing chips can
accelerate floating-point operations, while CPUs are good at processing
logical judgments. Therefore, the parts with low execution efficiency
for computing chips can be separated and handed over to devices with
higher execution efficiency such as CPU for processing, which can take
into account both performance and flexibility.

There are different levels of parallelism: operator parallelism, model
parallelism, and data parallelism. Operator parallelism is not just
about executing independent operators in parallel. Where applicable, we
can further partition an operator into multiple parallel child
operations. Model parallelism refers to partitioning a computational
graph among several devices in order to shorten the time taken by each
training iteration. And data parallelism involves training the same
computational graph on different data, reducing the total number of
iterations and improving training efficiency. We will discuss these
three parallelism methods in Chapter Distributed Training.

Synchronous and Asynchronous Data Loading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As previously mentioned, a single training iteration of a computational
graph goes through three serial tasks: data loading, data preprocessing,
and model training. Each task is dependent on the output of the previous
one. To schedule the three types of tasks in iterative graph training,
we can use the synchronous and asynchronous mechanisms at the iteration
level.

-  **Synchronous**: Tasks are executed in order, one after the other.
   Tasks have to wait for and coordinate between each other.
-  **Asynchronous**: When a task is complete, the same task in the next
   iteration can be executed immediately.

If the synchronous mechanism is adopted to train the computational graph
shown in Figure :numref:`synchronization`, in each iteration, a batch
of input data is loaded, preprocessed, and then passed to the
computational graph for model training and parameter update. Tasks in
the next iteration wait until the current iteration is complete. The
synchronous mechanism wastes computation and communication resources
because the data preprocessing and model training tasks must wait until
a batch of data is completely loaded, and because the I/O channel for
data loading is idle at model training time.

.. _synchronization:

.. figure:: ../img/ch03/sync.png
   :width: 800px

   Synchronous mechanism


In the asynchronous setting shown in Figure :numref:`asynchronous`,
after loading and passing a batch of input data to the subsequent data
preprocessing task, the I/O channel immediately moves on to the next
batch without waiting for the current iteration to complete. In contrast
with the synchronous mechanism, the idle time between data loading, data
preprocessing, and model training in the asynchronous mechanism is
notably reduced, thereby shortening the overall training time with
improved execution efficiency.

.. _asynchronous:

.. figure:: ../img/ch03/async.png
   :width: 800px

   Asynchronous mechanism


To further shorten the training time and improve the execution
efficiency, we can combine the asynchronous mechanism with parallel
computing, as shown in Figure :numref:`asyn_para`. On the one hand,
the asynchronous mechanism reduces the model’s wait time for data
loading and preprocessing, allowing the model to quickly traverse the
entire dataset. On the other hand, parallel computing increases the
batch size in iterative training, increasing the efficiency of computing
resources.

.. _asyn_para:

.. figure:: ../img/ch03/para-async.png
   :width: 800px

   Asynchronous mechanism combined with parallel computing