Frontiers of Offline Interactive Machine Learning: From Contextual Ban…
| 구분 | 응용수학 |
|---|---|
| 일정 | 2025-10-29 16:00 ~ 17:00 |
| 강연자 | 전광성 (University of Arizona) |
| 기타 | |
| 담당교수 | 홍영준 |
At the heart of modern machine learning lies a fundamental challenge: how can an intelligent system not just learn from data but also decide which data to collect for learning? This is the essence of interactive machine learning (IML) -- a paradigm that encompasses reinforcement learning, contextual bandits, and active learning. Recently, the offline version of IML has gained popularity because the standard online version often cannot be run due to real-world constraints. In this talk, I will show two recent advances in offline IML problems. First, I will discuss the contextual bandit problem that has applications in recommendation systems. I will show how an improved confidence bound for [0,∞)-valued random variable translates into a superior learning algorithm, both in theory and practice. Second, I will show that the LLM alignment problem is an instance of offline IML and that existing training objectives for it lack theoretical justifications, leaving us wondering if they are the right ones to use. As such, I will present a novel theoretical framework for alignment from which three different alignment algorithms are derived along with theoretical guarantees, which is a strong form of justification. Surprisingly, two of them are very similar to existing algorithms called Direct Policy Optimization (DPO) and reinforcement learning from human feedback (RLHF), respectively. Together with our theoretical guarantees, our work can be seen as providing theoretical justifications to DPO and RLHF, with minor corrections. Furthermore, our theory confirms the existing empirical finding that RLHF performs better than DPO. I will conclude with empirical results and exciting future research directions.
