Generating a Computational Graph -------------------------------- In the previous section, we explored the ingredients of a computational graph. Now let’s proceed to the next question — how is a computational graph automatically generated? Machine learning frameworks support two approaches to implementing computational graphs: static and dynamic. The static approach builds a static (unchanging) graph based on information such as the network topology and parameter variables described by the frontend language. Because frontend languages are independent, static graphs are especially suitable for model deployment (e.g., deploying a facial recognition application on mobile devices). Unlike the static approach, the dynamic approach dynamically generates a temporary graph based on the frontend description each time the model is executed. Dynamic graphs are easy to debug, making it possible to fine-tune models efficiently on the fly. Major machine learning frameworks such as TensorFlow and MindSpore are compatible with both approaches. And although PyTorch uses dynamic graphs, it also offers dynamic-to-static conversion support for efficient model execution. To choose the right approach for a specific task, we need to consider the task requirements as well as the pros and cons of each approach. Static Graph ~~~~~~~~~~~~ The static graph approach decouples the definition and execution processes. That is, a static graph is compiled before it is executed, as shown in Figure :numref:`static`. .. _static: .. figure:: ../img/ch03/static.png :width: 800px Generating and executing a static graph When a model program is generated using the frontend language, the machine learning framework first analyzes the model topology for information such as the connections between network layers, parameter variable settings, and loss functions. The framework then compiles the model description into fixed code (i.e., a static computational graph) that can be invoked and executed by the computing backend. In this case, subsequent training or inference on this model is no longer frontend-dependent. Specifically, when input data is fed into the static graph, the operators in the graph are directly scheduled to hardware for execution. And to improve hardware computational efficiency, we can also convert a static graph into other equivalent structures through various optimization strategies. The following code shows an example of generating and executing a simple static graph. In the frontend definition phase, some machine learning frameworks require developers to declare predefined configuration items including tensor placeholders, loss functions, optimization functions, network building and runtime environments, and network executors, as well as in-graph control statements using control flow operators. The design of machine learning frameworks has recently been improved to provide easy-to-use APIs and a unified model building paradigm. For example, MindSpore enables unified frontend programming representations featuring dynamic and static integration. To illustrate, let’s consider the following simple model. .. code:: python def model(X, flag): if flag>0: Y = matmul(W1, X) else: Y = matmul(W2, X) Y = Y + b Y = relu(Y) return Y The machine learning framework does not load input data when generating a static graph. Instead, *placeholder* tensors are used to hold places of input data. In the static graph defined the above code, we need to create a placeholder for input :math:`\boldsymbol{X}` in line 1. Because no actual input is fed into the model during static graph generation, the control flow defined in line 2 cannot make control decisions at build time. As such, we need to add the control flow operator and the computational subgraph of each branch to the static graph. When the model receives actual inputs during runtime, different branches are taken (by running the corresponding computational subgraphs) depending on different inputs. However, not all machine learning frameworks are able to compile Python control flows as their static graph equivalents. In order to implement control flows in this case, we can use the control primitives provided by the framework. .. _staticgen: .. figure:: ../img/ch03/static_gen.png :width: 800px Generating a static graph Static computational graphs offer two distinct advantages. First, they yield better performance with less memory. When building a static graph, the machine learning framework acquires the complete model topology containing global information of the model, which facilitates the formulation of graph optimization strategies (e.g., the operator fusion strategy that fuses two or more operators into a larger one). As shown in Figure :numref:`staticgen`, the Add and ReLU operators are fused into one operator to reduce the loads/stores of intermediate results and low-level scheduling overhead, thereby improving the execution performance and efficiency with a lower memory footprint. Static graphs allow for many optimization strategies at build time, which we will discuss in later sections. Second, by converting static graphs into executable code within the machine learning framework, we can directly deploy our models on various hardware platforms to provide efficient inference services. Also, we can store static graphs using serialization techniques for future execution (either model training or inference), eliminating the need to rebuild the frontend source code from scratch every time before execution. Once the frontend code of the model is compiled into a static graph, the graph structure is fixed. If we introduce any optimizations to the graph, the optimized code can differ significantly from the original. However, the optimized code is not intuitively visible, meaning that it is sometimes impossible to locate a runtime error based on the returned code line number in the optimized code. Consider a simple case. Assuming that the Add and ReLU operators in the above code have been fused for optimization, if a runtime error related to the fused operator is reported, it would be hard for us to determine the exact error location (Add or ReLU). In addition, in the daunting process of model debugging and testing, intermediate results cannot be printed in real time. To make this happen, we need to insert additional code to the source code and then recompile the source code for execution, making debugging less efficient. By contrast, the dynamic graph approach offers more flexibility. Dynamic Graph ~~~~~~~~~~~~~ Figure :numref:`dynamic`\ shows the principle of the dynamic graph approach. A dynamic graph is defined as it runs. The frontend interpreter parses the graph code and the machine learning framework distributes the operators in the graph to the backend for just-in-time (JIT) execution. Adopting the user-friendly imperative programming paradigm, the dynamic graph approach allows developers to create neural network models at the frontend and is therefore favored by a vast number of deep learning researchers. .. _dynamic: .. figure:: ../img/ch03/eager.png :width: 600px 动态图原理 Next, we reuse the pseudocode in the previous section to compare the dynamic and static graph approaches. While these two approaches differ only slightly in their frontend representations, they differ dramatically in terms of their compilation and execution mechanisms. Unlike the static graph approach, the dynamic graph approach calls the built-in operator distribution function of the machine learning framework through the Python API to distribute Python operators to the hardware backend (e.g., CPU, GPU, or NPU) for accelerated computing, which then returns the computational result to the frontend. This process does not generate a static computational graph. Instead, the framework describes the model topology using the frontend language, schedules and executes the model based on computational dependencies, and dynamically generates a temporary graph. Figure :numref:`dynamicgen`\ shows the process of generating a dynamic graph. .. _dynamicgen: .. figure:: ../img/ch03/eager-gen.png :width: 700px 动态生成 Forward computation is run through the neural network in the sequence defined by the model declaration. Once the model receives input :math:`\boldsymbol{X}`, the machine learning framework starts to generate a dynamic graph by adding the input node to the graph and sending the data to the downstream node. The control flow (if available) makes a data flow decision immediately. For example, in Figure :numref:`dynamicgen`, if the conditional returns true, only the Matmul operator node with respect to tensor :math:`\boldsymbol{W1}` is added to the graph. Then, the machine learning framework inserts the Add and ReLU operator nodes based on the operator sequence and computational dependencies defined in the code. For each newly added operator node, the machine learning framework distributes and executes the operator, returns the computational result, and prepares to pass the result to the next node. When forward computation resumes, the last dynamic graph becomes invalid and a new dynamic graph is created according to current input and control decision. In contrast with a static graph that represents the entire model described in the frontend language, a dynamic graph is generated on the fly as the control flow and data flow evolve over time. For this reason, the machine learning framework has few opportunities to optimize the model in the dynamic graph setting. In the static graph setting, as the model definition is entirely available, a complete forward computational graph and a complete backward computational graph can be constructed simultaneously. However, in the dynamic graph setting, gradients are calculated for backpropagation as the forward pass proceeds. Specifically, the machine learning framework collects information of each backward operator and tensor participating in gradient computation based on the information of each operator called in the forward pass. Once the forward pass ends, the operator and tensor information for backpropagation becomes available. With this information, the machine learning framework creates a backward computational graph and runs it on hardware to complete gradient computation and parameter update. As shown in Figure :numref:`dynamicgen`\ ,when the Matmul operator with respect to tensor :math:`\boldsymbol{W1}` is called, the framework runs the Matmul operator to calculate the product of inputs :math:`\boldsymbol{X}` and :math:`\boldsymbol{W1}`, and then records the operator and tensor :math:`\boldsymbol{X}` that will participate in backpropagation based on the backward computation process Grad\_\ :math:`\boldsymbol{W1}`\ =Grad\_\ :math:`\boldsymbol{Y}*\boldsymbol{X}`, thereby completing the forward pass and producing a backward computational graph. Although the optimization techniques useful in the static graph setting do not work for dynamic graphs (because the complete network structure is unknown until the dynamic graph runs), researchers and developers can easily analyze errors and debug results during model testing and optimization. This is made possible by dynamic graphs supporting JIT computing and returning computational results immediately with the execution of each statement. Also, the dynamic graph approach enables flexible execution using native control flows provided by the frontend — unlike static graphs, which involve complex control flows along with programming and debugging difficulties. Consequently, the dynamic graph approach lowers the barriers to programming for beginners while also improving the iteration efficiency of algorithm development and model optimization. Dynamic Graph vs. Static Graph ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two approaches for implementing computational graphs have their pros and cons, as described in Table :numref:`cmp_dynamic_static`\ 。 .. _cmp_dynamic_static: .. table:: Static graph vs. dynamic graph +-----------+----------------------------+----------------------------+ | Feature | Static Graph | Dynamic Graph | +===========+============================+============================+ | On-the-fl | No | Yes | | y | | | | intermedi | | | | ate | | | | results | | | +-----------+----------------------------+----------------------------+ | Code | Difficult | Easy | | debugging | | | +-----------+----------------------------+----------------------------+ | Control | Specialized syntax | Frontend syntax | | flow | | | | implement | | | | ation | | | +-----------+----------------------------+----------------------------+ | Performan | Better, supporting wide | Poor, supporting limited | | ce | optimization strategies | graph optimizations | +-----------+----------------------------+----------------------------+ | Memory | Low | High | | footprint | | | +-----------+----------------------------+----------------------------+ | Direct | Yes | No | | deploymen | | | | t | | | +-----------+----------------------------+----------------------------+ Compared with the dynamic graph approach, the static graph approach seems to be less user-friendly to developers because intermediate results are not available on the fly, code debugging is difficult, and implementing control flows is complex. However, static graphs ensure higher execution performance than dynamic graphs. See the example in following Code. .. code:: python def model(X1, X2): Y1 = matmul(X1, W1) Y2 = matmul(X2, W2) Y = Y1 + Y2 output = relu(Y) return output If the static approach is used to implement in the above code, the machine learning framework creates a complete computational graph. Because tensors :math:`\boldsymbol{Y_1}` and :math:`\boldsymbol{Y_2}` are computed independently from each other, we can implement automatic parallelism on them in order to improve the computational efficiency. Furthermore, the static approach allows many more optimization strategies to improve efficiency while also lowering memory footprint, for example, fusing operators Add and ReLU to reduce the loads and stores of the intermediate variable :math:`\boldsymbol{Y}`. Conversely, if the dynamic approach is used without a manually configured parallelism strategy, the machine learning framework is unaware of the independence between operators due to the lack of a complete computational graph. Consequently, the framework has to execute the operators, including Add and ReLU, in a defined order and store the intermediate variable :math:`\boldsymbol{Y}`. To further reduce memory footprint, the static approach narrows down the intermediate variables to be stored for backpropagation beforehand in the forward pass, based on the forward and backward computational graphs defined prior to execution. This is not feasible in the dynamic approach, where the backward computational graph is defined only after the forward pass is complete. As such, more intermediate variables have to be stored in the forward pass to ensure the backpropagation efficiency, resulting in higher memory footprint. To choose one approach over the other, we should consider their pros and cons in addition to analyzing specific task requirements. For academic research purposes or in the model design and debugging phases, the dynamic graph approach is suggested because it allows for quick testing of experimental ideas and iterative update of the model structure. In other cases where the model structure is determinant, to accelerate the training process or deploy a model on specific hardware, using the static graph approach offers higher efficiency. Conversion Between and Combination of Dynamic and Static Graphs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dynamic graphs are easy to debug and suitable for model design and testing, whereas static graphs improve execution efficiency and shorten model training time. Is there a way for the machine learning framework to combine the merits of both approaches? Major machine learning frameworks, such as TensorFlow, MindSpore, PyTorch, and PaddlePaddle, have added support to convert between dynamic and static graphs, allowing developers to program using the dynamic graph approach and letting the framework automatically convert the code to a static equivalent for execution. Table :numref:`dynamic_static_switch` lists the APIs for dynamic graph to static graph conversion provided by major frameworks. .. _dynamic_static_switch: .. table:: Dynamic graph to static graph conversion support of major frameworks +------------------+---------------------------------------------------+ | Framework | Dynamic Graph to Static Graph Conversion | +==================+===================================================+ | TensorFlow | ``@tf_function``: builds a static graph by | | | tracing operators,,where AutoGraph automatically | | | transforms a control flow to the equivalent | | | static statement. | +------------------+---------------------------------------------------+ | MindSpore | ``context.set_context(mode=context.PYNATIVE_MODE) | | | ``\ dynamic | | | graph mode, | | | \ ``context.set_context(mode=context.GRAPH_MODE)` | | | ` | | | static graph mode, \ ``@ms_function``: builds a | | | static graph from source code. | +------------------+---------------------------------------------------+ | PyTorch | ``torch.jit.script()``: builds a static graph | | | from source code. \ ``torch.jit.trace()``: builds | | | a static graph by tracing operators. | +------------------+---------------------------------------------------+ | PaddlePaddle | ``paddle.jit.to_static()``: builds a static graph | | | from source | | | code.\ ``paddle.jit.TracedLayer.trace()``: builds | | | a static graph by tracing operators. | +------------------+---------------------------------------------------+ These dynamic-to-static conversion methods fall into the following two categories: - **Tracing**: A static graph is built by tracing operator scheduling in a dynamic graph. - **Source code transformation**: The frontend code is inspected and built as static graph code. And the static graph executor is automatically called to run the static graph. The **tracing** method goes through two simple phases. The first is to generate a dynamic graph, following a workflow similar to that shown in Figure :numref:`dynamicgen`. The machine learning framework runs the created dynamic graph and traces the data flow and operator scheduling in the dynamic graph to produce a static graph. Note that the dynamic graph is not destroyed; instead, it is preserved as a static graph for subsequent execution. As the machine learning framework finishes executing the dynamic graph, a static graph is produced. In the second phase when the model is called again, the machine learning framework runs the static graph for computation. The tracing technique only traces the operators scheduled when the dynamic graph is run for the first time. However, if the model has a data-dependent conditional, only one branch of the conditional can be traced — the traced graph would be unable to take alternate branches. Similarly, the traced graph cannot include every iteration if there is a data-dependent loop. Unlike dynamic graph code which is parsed and executed by the frontend interpreter, a static graph must be first created by the graph compiler of the machine learning framework before execution. Because the graph compiler cannot directly deal with dynamic graph code, the source code transformation–based method is introduced to convert the dynamic graph code into static code description. The **source code transformation**–based method can overcome the drawbacks involved in the tracing method and also consists of two phases, as shown in Figure :numref:`ast`. The first involves lexical and syntax analysis. Specifically, the lexical analyzer scans and analyzes every character in the dynamic graph code, splits the source text by removing any white spaces or comments, and returns a stream of tokens. Then, the syntax analyzer or parser analyzes the token stream, eliminates any errors, and generates a parse tree as the output of the phase. In the second phase, the built-in translators of the machine learning framework scan and translate each part of the abstract syntax tree to map the grammatical structures from dynamic graph format into static graph format. Any control flow written in the frontend language is transformed into the corresponding static graph API in this phase, so as to include every branch of the control flow in the resulting graph. Next, we can easily generate static graph code from the translated syntax tree. .. _ast: .. figure:: ../img/ch03/ast.png :width: 800px Source code transformation To improve the computational efficiency, we can transform the entire model graph for fast deployment on hardware. Alternatively, we can consider transforming some of the model functions into static subgraphs and embedding them into the global dynamic graph as individual operators, so that these exact functions would run in the form of static graphs at execution time. This not only improves computational efficiency but also retains flexibility for code debugging. Code as follow shows a simple model, which can be built into a dynamic graph as a whole. In this example, we transform the ``add_and_relu`` module into a static subgraph. The model runs on the input data in a predefined sequence, resulting in a temporary dynamic graph. When the ``Y=add_and_relu(Y,b)`` statement is executed, the machine learning framework automatically runs the static subgraph transformed from the module, achieving a performance gain by combining the advantages of dynamic and static graphs. .. code:: python def add_and_relu(Y, b): Y = Y + b Y = relu(Y) return Y def model(X, flag): if flag>0: Y = matmul(W1, X) else: Y = matmul(W2, X) Y = add_and_relu(Y, b) return Y Dynamic-to-static conversion is mostly found in the model deployment stage, as a workaround to the hardware constraints on dynamic graph deployment, which requires the frontend model definition code for topology discovery in addition to the file of already-trained parameters. To remove the frontend dependency, once model training in dynamic graph mode is complete, we may convert the model into static graph format and serialize the model and parameter files, thereby expanding the list of supported hardware.