What's
Knowledge sharing seminar in the field of Reinforcement Learning
Branislav Kveton, Principal Research Scientist at Adobe Research, gave a lecture on Reinforcement Learning with Large Language Models Through Reward-Weighted Fine-Tuning.
Lecture abstract
Reinforcement learning (RL) with large language models (LLMs) has enabled recent progress in training reasoning models. In this work, we show how to reduce offline RL with LLMs to reward-weighted supervised fine-tuning (SFT). This allows practical RL optimisation of LLM agents using just SFT, arguably the most common approach for training LLMs. Unlike offline variants of other approaches, such as PPO and GRPO, we do not need token-level rewards or reward models, and avoid propensity score ratios in the objective. We demonstrate our approach on several LLM agent optimisation problems: increasing sales, improving recommendation accuracy, and learning to reason in question-answering agents. This is joint work with Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, and Jayakumar Subramanian.
Photos from the lecture


