Blog

Intuition Explained: Behavioral Supervisor Tuning

Conferences confine a paper to be at most X pages in length, where X is usually 7-10 pages. The need to communicate most of the important (theoretical and experimental) parts of my work in such a format means making sacrifices - this often ends in me removing text that aims... [Read More]
Tags: tech, machine learning, reinforcement learning

Thoughts on DPO and Offline RL

Direct Preference Optimization is all the rage now in LLMs, and rightly so! The derivation is neat (and very familiar to those experienced with reinforcement learning) and allows direct, preference-based finetuning of regression-trained LLMs without having to learn a reward model. [Read More]
Tags: tech, machine learning, reinforcement learning