Specification Gaming is the behaviour of satisfying the specification of an objective without actually completing the intended task
Cheating or finding shortcuts is a good and simple example of this in real life
Why does this happen?
- Misspecification - This isn’t a flaw in the RL algorithm. Rather, it’s because the objective itself is not specified correctly
- Poorly designed reward shaping - Shaping rewards can change the optimal policy, sometimes for the worse. This happens when they’re not potential-based
- Incorrect assumptions - A common form of incorrect assumptions is that agents cannot exploit bugs that are exposed during simulations in the real world. This is a problem especially when designers consider the specified tasks to be immune to tampering → Reward tampering problem
Note
Specification gaming, while usually thought as unintended or bad, can actually be a good sign in some cases. Such behaviours may encourage the system to learn novel and innovate methods to achieve the objective. After all, there can be multiple solutions to a given problem
Task Specification - Reward function design, training environments, auxiliary rewards etc. More of a catch-all term
Correctness of task specification ⇐> Intended outcome
Challenges summarized
- How do we avoid reward tampering?
- How do we faithfully capture human concepts?
- How do we avoid making incorrect assumptions?
Important Takeaways
- Always try to properly specify the intent
- A combination of RL Algorithm Design and Reward Design is crucial
- Don’t try to cover every possible case while specifying - Learn the reward function from human feedback