Crypto Trends

Extending Direct Nash Optimization for Regularized Preferences

Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to [email protected];

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to [email protected];

(6) Tengyang Xie, Microsoft Research and Correspondence to [email protected].

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

A Extension to Regularized Preferences

In this section, we discuss how to extend the DNO framework to the case of regularized preferences (defined in Eq. (5)),

which was first introduced and solved by Munos et al. (2023) via Nash-MD introduced earlier.

This paper is available on arxiv under CC BY 4.0 DEED license.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button