avatar

Ruizhe Shi

我们从坚果剥出时间并教它走路: 而时间回到壳中.

Talks

(only for reference purpose)

  • The crucial of samplers in online direct preference optimization

    [slide]

  • Logit mixing and RLHF paper reading

    [slide]

  • Decoding-time language model alignment with multiple objectives

    [slide]

  • Unleashing the power of pre-trained language models for offline reinforcement learning

    [slide]