Q-Learning

Q-Learning is a model-free, value-based, and off-policy RL algorithm for finding the best policy based on the agent’s current state

Agent uses a Q-table to take the best possible action based on expected reward for each state. It is a table of actions and states. Q-function is the Bellman equation

Limitations

  • Traditional Q-learning suffers from overestimation bias addressed using Double Q-Learning, which uses two separate Q-values and takes the minimum of the two

Randomized Ensemble Double Q-learning (REDQ) is another improvement

  • Double Q-learning to counter overestimation bias
  • Randomized ensembles (different models) to encourage thorough exploration