Q-Learning
Q-Learning is a model-free, value-based, and off-policy RL algorithm for finding the best policy based on the agent’s current state
Agent uses a Q-table to take the best possible action based on expected reward for each state. It is a table of actions and states. Q-function is the Bellman equation
Limitations
- Traditional Q-learning suffers from overestimation bias ← addressed using Double Q-Learning, which uses two separate Q-values and takes the minimum of the two
Randomized Ensemble Double Q-learning (REDQ) is another improvement
- Double Q-learning to counter overestimation bias
- Randomized ensembles (different models) to encourage thorough exploration