Direct Nash Optimization Beats Bigger Models with Better Data
Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];
(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].
Table of Links
Abstract and 1 Introduction
2 Preliminaries
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
3.2 Theoretical Analysis
4 Practical Algorithm – Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
5.2 Results and Analysis
6 Related Work
7 Conclusion and References
Appendix
A Extension to Regularized Preferences
B Detailed Proofs
C Additional Experimental Details
5.2 Results and Analysis
We run several head-to-head experiments that control for hyperparameters and input data. We often refer to the policy being trained as the “student” and GPT-4 as a “teacher”; GPT-4 is also used as an annotator when prompted.
SFT Baselines The first baseline is Orca-2.5 itself, which is a mistralai/Mistral-7B-v0.1 raw pretrained model fine-tuned on a new collection of Orca-2 data (Mitra et al., 2023). This model was finetuned for three epochs and achieves scores shown in the top of Table 4. All other experiments in this study are initialized with Epoch 1 of Orca-2.5. This is the solid horizontal line in Figure 2.
The second baseline is continue-SFT of Orca-2.5 training towards the positives in UltraFeedback (and masking out loss over the input prompts). If the original positive in that dataset was not from GPT-4-Turbo, we replace it with one that is. This is the red line in Figure 2. It is clear that even offline contrastive training methods are more beneficial than additional SFT, showing that the difference between the positive and negative output provides more valuable training signal than the positive in isolation.
Large Margin Filtering of Training Pairs: We ran a simple experiment of Offline DPO for one epoch on UltraFeedback data. In the control, we trained on all 63k preference pairs in the original dataset, whereas in the treatment we filtered the 42k pairs that met a large margin requirement enforcing that the positive’s scores exceeded that of the negative by at least 1.0 (out of 10) according to their GPT-4-Turbo annotator. All else was equal. Even though the treatment was trained for fewer steps on less data, it achieved an AlpacaEval 2.0 win rate of 11.60 vs 9.60 for the control, showing that fewer higher quality preference pairs is better than a higher quantity of noisy pairs (not shown in the tables).
On-Policy is Better than Off-Policy One of the critical questions in this study whether to sample “on-policy” outputs from the current student to use in training pairs, or whether “off-policy” outputs collected from other models different
than the student will suffice. We ran 4 epochs of Offline DPO on UltraFeedback (filtered for large margin), and as shown in Table 1, on-policy methods especially DNO surpass the off-policy DPO, even when trained for 4 epochs while the on-policy models were granted only three iterations. Recall that each iteration of batched on-policy training sees only a third of the UltraFeedback input data, whereas an epoch of Offline DPO sees the entire dataset.
Higher Quality Annotators In our study, we use GPT-4-Turbo to provide the annotations for preference pairs. However, the Self-Rewarding Language Model uses the Llama-2-70B (Touvron et al., 2023) model trained to also give feedback as the annotator, which in their study starts off with a 65% agreement rate with human-labeled preferences improving to 80% in the last iteration (Yuan et al., 2024). While it was not reported how well GPT-4-Turbo’s annotations agree with their held-out human labels, we believe that having a higher-quality annotator to start with will lead to higher quality policies. Since both our studies use UltraFeedback data, and our annotation prompt is based on their annotation prompt, we believe there is a valid comparison.
We observe DNO initialized with a 7B base model outperforms the 70B parameter Self-Rewarding model over the same number of training iterations (24.97 win-rate vs 20.44 on AlpacaEval 2.0, and 7.46 MT-Bench vs 7.25), at least in part due to the higher quality preference annotations. See the dark blue band versus the gray line in Figure 2 and the corresponding row in Table 1. However, unlike Self-Rewarding LM, we saw a slight gain rather than a drop reasoning benchmarks like ARC-Challenge (Clark et al., 2018) and HellaSwag (Zellers et al., 2019). Granted, the evaluation of OpenLLM predicts the answer with the max logit corresponding to one of the multiple-choice options, which is not congruous with how these techniques are trained.
Training Pair Construction One of the most critical implementation questions in this study is how to construct training pairs that help the student policy exceed a strong teacher like GPT-4-Turbo. One approach, Self-Play Finetuning (SPIN), removes the preference annotation step and automatically assigns the teacher output to be the positive, and all student samples to be negative (Chen et al., 2024). We find in our re-implementation of SPIN that this is detrimental, presumably because this automatic assignment could lead to noisy training pairs in cases where the student might actually be preferred. The resulting win-rate of SPIN is only 16.13 after three epochs of iterative training compared to 24.97 for DNO as shown in Table 1, all else being equal. Similar results hold in the OpenLLM results in Table 3.
In a second experiment, which we denote DNO-Restrictive, we annotate all preference pairs with GPT-4-Turbo as usual, but only admit training pairs where the teacher’s output is the preferred one. The difference between DNO and DNO-Restrictive is illustrated in Table 2 where 0 student-vs-teacher and student-vs-student pairs are created. The same is also true for SPIN, but SPIN would admit a greater quantity of noisy teacher-vs-student examples even when they are dis-preferred: Table 2 shows that after Iteration 2 of DNO-Restrictive, only 9.9k instances exist of the teacher being preferred over the student, whereas SPIN would have automatically created about 100k (5 samples × 20k inputs).
While DNO-Restrictive is slightly better (19.21 win-rate) than SPIN, it still does not give the student a chance to compare its behavior to a powerful teacher. Absence of this signal is a major oversight, since the last row of Table 2 shows that by Iter 3, over 64% of the DNO training data (32k pairs) are cases where the student is in fact preferred over the teacher, a number which increases with iteration. We conclude it is imperative to “allow the student to become the teacher” i.e. learn from comparisons where its own outputs are preferred over a more powerful teacher.
One curious phenomenon in Table 2 is that while the teacher outputs are fixed ahead of time, the annotator gives slightly lower scores to the teacher as the student improves; we are not sure if this is an innocuous artifact of preference
annotations, or symptomatic of a deeper problem. Also, the total quantity of new “large margin” training pairs (not counting those sampled from previous iterations) in DNO tends to decrease as the policy improves across iterations, but we do not have enough data to quantify how this relates to a change in quality.
Lookahead to Future Iterations As a curiosity, we experimented with whether a model could benefit from the knowledge of which training pairs it would generate if it could look into the future. We tested this by running three-iterations of DNO, accumulating all the preference pairs across iterations, combining and shuffling them, and then re-starting training from the initial model. In essence, this turns the batch-online DNO into an offline learning algorithm we denote as DNO-Lookahead. We trained for one epoch on the three iterations’ worth of preference data. It deteriorated more than we expected on AlpacaEval 2.0 win-rate (24.97 to 18.18), however, even more surprisingly, the MT-Bench numbers improved significantly (7.48 to 7.70). While the reasons for the relatively low correlation between MT-Bench and AlpacaEval 2.0 are not entirely clear, it is important to consider the disparity in the size of the datasets. Given that MT-Bench consists of merely 80 examples, whereas AlpacaEval 2.0 contains 10x more, we conjecture that the statistical significance and reliability of the findings from AlpacaEval 2.0 are regarded with greater confidence.
DNO Scales with More Data: One of the reasons we split UltraFeedback into three non-overlapping partitions is to avoid overfitting. Another strategy to avoid overfitting is to collect more data, so we increased by a factor of 10 the instruction data based on publicly available datasets. We split a large mixture of datasets into six nonoverlapping partitions of roughly 100k inputs each (and inference GPT-4-Turbo outputs for all inputs), and show that DNO-More-Data scales well in this expanded regime (see the purple line in Figure 2 and the last row of Table 4.
We make some notes on the behavior of this experiment: because each iteration builds on outputs of the previous iteration, if there are any anomalies or errors in critical components such as preference annotation, those errors will propagate and the only way to combat them is “roll back” to the iteration that introduced them. This can result in wasted time and cost, which are both already very high as shown in Appendix C. We suspect that the “depth” of iterations matters more than the “width” or number of samples within each iteration, and furthermore, that having equal number of inputs per iteration may not be optimal, but we did not test this thoroughly. From an efficiency standpoint, although this algorithm is “batched”, some optimizations can be made, such as starting to annotate sampled policy outputs are soon as they are ready instead of waiting for all inference jobs to finish.
“Exploding” Lengths It is known that contrastive LLM training techniques, especially DPO, lead to longer outputs from the model which is widely suspected to be a form of “reward hacking”. Curiously, Table 2 shows that the largest jump comes after the first round of contrastive training (Iteration 1), where lengths explode by at least a factor of 2 over the initializing SFT model, before inching down again in the next iteration. We interpret this “length spike” as wasted computation optimizing towards a spurious signal; we wish we were better equipped to control this phenomenon.