These slides are only for reference purpose.
The crucial role of samplers in online direct preference optimization
Logit mixing and RLHF paper reading
Decoding-time language model alignment with multiple objectives
Unleashing the power of pre-trained language models for offline reinforcement learning