You wrote: "We then use those Q-Values to decide on our policy π (where a policy is simply each action’s probability to be selected)".

As I understand: with Q-Learning (or DQN) you do not get probabilities of actions but Q-Values for actions.

Is there any way to get probabilities then?

Responses (1)