When doing reinforcement studying from human suggestions, you sometimes optimize towards a reward mannequin that’s skilled to foretell human preferences. For the reason that reward mannequin is an imperfect proxy, over-optimizing its worth might hinder actual efficiency in response to Goodhart’s Legislation. This impact is commonly noticed however not rigorously measured because of the expense of gathering human choice information. On this work, we use an artificial setting the place a set “gold commonplace” reward mannequin performs the position of a human, offering the labels used to coach the agent reward mannequin. We research how the gold reward mannequin scores change after we use reinforcement studying or best-n sampling to optimize towards the agent reward mannequin. We discover that this relationship follows completely different practical kinds relying on the optimization methodology, and that in each instances its coefficients scale easily with the variety of reward mannequin parameters. We additionally research the impression on this relationship of the scale of the reward mannequin dataset, the variety of reward mannequin and coverage parameters, and the KL penalty coefficient added to the reward within the reinforcement studying setting. We discover the implications of those empirical outcomes for theoretical concerns of alignment in synthetic intelligence.
Subscribe to Updates
Get the latest creative news from FooBar about art, design and business.
Related Posts
Add A Comment