Intro
In Reinforcement Learning (RL), an agent navigates through an environment and learns to maximize a discounted sum of scalar rewards produced as a property of the environment. In practice, however, no environment decides to provide a reward signal – such signals are the product of human design.
At its simplest, a reward might consist of a simple binary signal that indicates when a task has been successfully completed at termination. Such sparse feedback may pose challenges to learning, resulting in the need for reward shaping. More considerate reward design can improve learning efficiency. However, this requires far more effort in design to ensure that the reward signal is informative and robust to reward hacking where flaws in the reward signal are exploited by the agent, resulting in undesireable behavior – see this for an illustrated example.
Clearly, for many tasks constructing an informative reward function is challenging because without the ability to explicitly communicate which behaviors are preferred, the agent can exploit the reward function. The discrepancy between human preferences and reward function-implied behavior leads to misalignment between the desired and actual behavior of RL-trained policies (see Bostrom and Russell).
Reinforcement Learning from Human Feedback (RLHF) is one paradigm designed to align RL policy performance with human preferences; the key difference between RLHF and typical RL is in how human preferences are incorporated and iteratively used to refine a policy to produce human-preferred results.
RLHF underpins many successful applications: first and foremost is to finetune pretrained large language models (LLMs) such as InstructGPT (this blog does a great job of describing RLHF in LLMs very well),