Bitcoin

Teaching Old LLMs New Tricks: The Consistency Model Makeover for Speed

Authors:

(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;

(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;

(3) Zhezhi He, Shanghai Jiao Tong University;

(4) Zhijie Deng, Shanghai Jiao Tong University;

(5) Hao Zhang, University of California, San Diego.

Abstract and 1 Introduction

2. Related Work

3. Methodology and 3.1. Preliminary: Jacobi Decoding

3.2. Consistency Large Language Models (CLLMs)

3.3. Acceleration Mechanisms in CLLMs

4. Experiments

4.1. Evaluations

4.2. Acceleration Mechanisms in CLLMs

4.3. Ablation Studies

4.4. Limitations and Discussion

5. Conclusion, Impact Statement, and References

A. Illustration of Consistency Loss Learning Objectives

B. Comparison with Baseline Algorithms

C. Pesudo Code for Jacobi Decoding with KV Cache

3. Methodology

This section begins with a review of the Jacobi decoding method (Santilli et al., 2023) for accelerating LLM inference, then elaborates on CLLMs, a refinement of pre-trained LLMs to enjoy higher speedup from Jacobi decoding. In this paper, we only consider greedy sampling and leave other sampling strategies to future work. We also empirically identify the fast-forwarding phenomenon and the emeragence of stationary tokens from CLLMs, which serve as the source of such acceleration.

3.1. Preliminary: Jacobi Decoding

Given a prompt x and a pre-trained LLM p(·|x), we obtain the model response typically with the standard AR decoding method under the greedy strategy, i.e.,

Figure 2. Comparison of Jacobi trajectory between a target LLM and CLLMs on Spider. Each point along the Jacobi trajectory is a color-coded sequence: blue for correct tokens matching with AR results, and red for inaccurate ones. CLLM demonstrates enhanced efficiency, converging to the fixed point 2× faster than the target LLM. This increased efficiency in the CLLM can be attributed to the consistency loss which facilitates the learning of the structure of each n-token sequence given a prefix.Figure 2. Comparison of Jacobi trajectory between a target LLM and CLLMs on Spider. Each point along the Jacobi trajectory is a color-coded sequence: blue for correct tokens matching with AR results, and red for inaccurate ones. CLLM demonstrates enhanced efficiency, converging to the fixed point 2× faster than the target LLM. This increased efficiency in the CLLM can be attributed to the consistency loss which facilitates the learning of the structure of each n-token sequence given a prefix.

Jacobi Decoding with KV Cache. The sequential nature of LLMs ensures that each token generation is dependent only on preceding tokens. Namely, we have an increasing number of fixed tokens, which are correctly aligned with the AR generations. We don’t need to iteratively update them and recompute their keys and values for computing attention in subsequent iterations thanks to the KV cache technique. So, we 1) progressively reduce the length of the iteration state by at least one token and 2) save the KV cache of fixed tokens along with the decoding procedure. We elaborate on this in Algorithm 3.

3.2. Consistency Large Language Models (CLLMs)

Despite the promise, the speedup effect of Jacobi decoding for vanilla LLMs is minimal in practice (Santilli et al., 2023; Fu et al., 2024). The reason is that AR-trained LLMs can usually generate only one correct token in each Jacobi iteration as such models can rarely yield a correct token when there are incorrect preceding tokens. To address this, we propose to adapt pre-trained LLMs to consistently map any point y on the Jacobi trajectory J to the fixed point y∗. Surprisingly, such an objective is analogous to that of consistency models (Song et al., 2023; Song & Dhariwal, 2023), a leading acceleration approach for diffusion models (Ho et al., 2020; Song et al., 2021b).

This section first delineates our data preparation procedure for tuning CLLM and then elaborates on the training procedure of CLLM. Lastly, we discuss some possible sources of the reason for CLLMs’ acceleration.

3.2.1. JACOBI TRAJECTORY COLLECTION

Let p denote the target LLM we aim to adapt. Let qθ(·|x) denote the CLLM with parameters θ initialized with those of p. To realize the aforementioned adaptation, we collect a set of Jacobi trajectories by running the Jacobi decoding algorithm with the target LLM p on prompts from a certain domain of interest, forming an original training set D. We summarize the algorithm for dataset generation in Algorithm 1. Note that to generate a lengthy response l of N (N ≫ n) tokens, we can sequentially perform Jacobi decoding for every truncation of n tokens to avoid slow model evaluation on lengthy input. Consequently, l amounts to the concatenation of a set of consecutive fixed points.

Data augmentation. In a typical Jacobi iteration process, the correct tokens often appear one after another, and ntoken sequences usually exhibit a “correct, correct, wrong, wrong, wrong” pattern. In comparison, patterns like “correct, correct, wrong, correct, wrong” can be rare. To enhance the learning and generalization capabilities of CLLMs, we augment the dataset D by randomly correcting erroneously predicted tokens within the samples.

Data post-processing. Since the target LLM itself can make errors for some prompts, it often leads to low-quality generations in the Jacobi trajectories. We find training a CLLM with n-token sequences with token-level (Holtzman et al., 2019) or sentence-level repetitions (Polisensk ˇ a et al. ´ , 2015) often results in to repetitive content generation and noticeably degrades performance. Recognizing the significance of high-quality datasets for training LLMs (Zhou et al., 2023a), we perform post-processing to eliminate the low-quality samples from our training dataset D based on a rule-based detector.

3.2.2. TRAINING

We jointly optimize two losses for tuning CLLMs, one guaranteeing the prediction of multiple tokens at once and the other avoiding the CLLM from deviating from the target LLM so as to maintain generation quality.

Consistency Loss. For a prompt x with the Jacobi trajectory J, let y and y ∗ denote a random state on the trajectory and the fixed point respectively. We can directly push CLLM to output y ∗ with y as the input by minimizing the following loss:

where θ − = stopgrad(θ) and we abuse notations to represent uniform sampling from the dataset. D(·||·) denotes the distance between two distributions, with forward KL, reverse KL, and their mixture (i.e., the Jensen-Shannon divergence) as popular examples (Agarwal et al., 2023). We primarily experiment with the forward KL.

This term contributes to maintaining generation quality substantially (see Table 6).

Consequently, the total loss for training a CLLM is:

The training procedure is detailed in Algorithm 2.

3.3. Acceleration Mechanisms in CLLMs

Next, we compare the Jacobi trajectory of the target LLM and CLLM in Figure 2 to chase an in-depth understanding of acceleration mechanisms in CLLMs.

As shown in the left side of Figure 2, target LLMs typically generate only one correct token in one iteration. In contrast, we identify fast forwarding phenomenon where multiple consecutive tokens are correctly predicted in a single forward pass in CLLMs. The average fast forward count per forward pass in CLLMs ranges from 2 to 6 tokens as evaluated in Table 3. Moreover, tokens correctly generated in advance (e.g. “country” and “H” in point 5 and 6 in the left side of Figure 2), are often replaced inaccurately in subsequent iterations in target LLMs. Unlike the pre-trained models, CLLMs exhibit the capability of predicting correct tokens preemptively, even with preceding incorrect tokens, while ensuring the tokens remain unchanged. We term such tokens as stationary tokens, whose existance allow simultaneous extension of discontinuous correct tokens within the n-token sequence. Both phenomena contribute to the fast convergence in Jacobi decoding of CLLMs, thereby leading to a considerable generation speedup.

We observe that CLLMs acquire a crucial linguistic concept through training – collocations: a series of words or terms that co-occur more frequently than one would expect by random chance (Smadja, 1991). Language is not solely composed of isolated words but also relies heavily on specific word pairings. Examples of collocations are abundant in both natural and coding languages. They include verb + preposition combinations (e.g., “talk to”, “remind … of …”), verb + noun structures (e.g., “make a decision”, “catch a cold”), and many more domain-specific syntactical structures (e.g., “SELECT … FROM …”, “if … else” for programming). The consistency generation objective allows CLLMs to infer such structures from any point in the Jacobi trajectory, encouraging CLLMs to acquire proficiency in numerous collocations and thereby predict multiple words simultaneously to minimize iteration steps.

Notably, lookahead decoding (Fu et al., 2024) collects ngrams generated from previous Jacobi iterations as candidate tokens and verifies them in the next iteration to accelerate decoding. CLLMs can also be combined with lookahead decoding and achieve extra speedup (see Table 1 and Table 2) because collocations learned in CLLMs improve the quality of n-grams and thus increase the acceptance rate.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button