Talks
(only for reference purpose)
The crucial of samplers in online direct preference optimization
[slide]
Logit mixing and RLHF paper reading
[slide]
Decoding-time language model alignment with multiple objectives
[slide]
Unleashing the power of pre-trained language models for offline reinforcement learning
[slide]