Advanced Efficient Techniques ============================= In addition to standard model compression methods, some advanced approaches are being developed to accelerate the decoding process of the large models. These methods include generating specific tokens using smaller models and the ability to generate multiple tokens in a single step, resulting in accelerating the decoding process. Furthermore, there are techniques that utilize the memory hierarchy for high throughput computation, aiming to decrease memory I/O, and as a result, be more efficient. Speculative Decoding -------------------- Speculative decoding is a strategy to speed up the decoding process, based on insights provided by Leviathan et al. [@leviathan2023fast]. 1. Complex modeling tasks frequently encompass simpler subtasks that can be effectively approximated using more efficient models. 2. By combining speculative execution with a unique sampling approach, it is possible to accelerate exact decoding from larger models. This is achieved by processing them with the outputs from the approximation models in parallel. Figure :numref:`ch-deploy/sd` is a brief overview of Speculative Decoding. It involves initially generating a series of tokens using a draft model, which is a smaller and less complex model. These generated tokens are then verified in parallel with the target model, which is a larger model. The tokens that are finally executed in the output are those that are accepted by the target model from the initial draft tokens. Additionally, if rejection occurs, one more token is resampled and generated from the adjusted distribution. If there is no rejection, an extra token is generated by the target model using the draft tokens as context. .. raw:: html
.. container:: center .. raw:: html
Speculative Decoding Overview .. raw:: html
.. raw:: html
To elaborate, the process begins with the draft model generating a series of :math:`\gamma` tokens, denoted as :math:`x_1, x_2, ..., x_{\gamma}`. Subsequently, it preserves the distributions :math:`q_{1}(x), q_{2}(x), ..., q_{\gamma}(x)` of these tokens for future verification by the target model. These :math:`\gamma` tokens are then inputted into the target model in parallel to calculate the logits for the respective token combinations :math:`p_{1}(x), p_{2}(x), ..., p_{\gamma+1}(x)`, derived from :math:`M_{\text{target}}(\text{prefix} + [x_1 + ... + x_{\gamma}])`. If the condition :math:`q(x) < p(x)` is met, the token is retained. In contrast, if not met, the token faces a rejection chance of :math:`1 - \frac{p(x)}{q(x)}`, following which it is reselected from an adjusted distribution: .. math:: p'(x) = norm(max(0, p(x) - q(x))) :eqlabel:``equ:sd_adjusted`` In the paper [@leviathan2023fast], Leviathan et al. have proved the correctness of this adjusted distribution for resampling. Under the assumption that the execution time for a single step of the Target model is denoted as :math:`T`, and that of the draft model as :math:`cT`, where :math:`0