开发者

Negative rewards in QLearning

Let's assume we're in a room where our agent can move along the xx and yy axis. At each point he can move up, down, right and left. So our state space can be defined by (x, y) and our actions at each point are given by (up, down, right, left). Let's assume that wherever our agent does an action that will make him hit a wall we will give him a negative reward of -1, and put him back in the state he was before. If he finds in the center of the room a puppet he wins +10 reward.

When we update our QValue for a given state/action pair, we are seeing what actions can be done in the new state and computing what is the maximum QValue that is possible to get there, so we can update our Q(s, a) value for our current state/action. What this means is that if we have a goal state in the point (10, 10), all states around it will have a QValue a bit smaller and smaller as they get farther. Now, in relationship to the walls, it seems to me the same is not true.

When the agent hits a wall(let's assume he's in the position (0, 0) and did the action UP), he will receive for that state/action a reward of -1, thus getting a QValue of -1.

Now, if later I am in the state (0, 1), and assuming all the other actions of state (0,0 0) are zero, when calculating the QValue of (0, 1) for the action LEFT, it will compute it the following way:

Q([0,1], LEFT) = 0 + gamma * (max { 0, 0, 0, -1 } ) = 0 + 0 = 0

This is, having hit the wall doesn't propagate to nearby states, contrary to what happens when you have positive reward states.

In my optic this seems odd. At first I thought finding state/action pairs giving negative rewards would be learningwise as good as positive rewards, but from the example I have shown above, that statement doesn't seem to hold true. There开发者_开发百科 seems to be a bias in the algorithm for taking far more into consideration positive rewards than negative ones.

Is this the expected behavior of QLearning? Shouldn't bad rewards be just as important as positive ones? What are "work-arounds" for this?


Negative feedback only propagates when it is the only possible outcome from a particular move.

Whether this is deliberate or unintentional I do not know.


You can avoid negative awards by increasing the default reward from 0 to 1, the goal reward from 10 to 11, and the penalty from -1 to 0.

There are tons of scientific publications on Q-learning, so I'm sure there are other formulations that would allow for negative feedback.

EDIT: I stand corrected, this doesn't change the behaviour as I stated earlier. My thought process was that the formulation with negative feedback could be replaced by one without.

The reason for your observation is that you have no uncertainty on the outcome of your actions or the state it is in, therefore your agent can always choose the action it believes has optimal reward (thus, the max Q-value over all future actions). This is why your negative feedback doesn't propagate: the agent will simply avoid that action in the future.

If, however, your model would include uncertainty over the outcome over your actions (e.g. there is always a 10% probability of moving in a random direction), your learning rule should integrate over all possible future rewards (basically replacing the max by a weighted sum). In that case negative feedback can be propagated too (this is why I thought it should be possible :p ). Examples of such models are POMDPs.


Your question is answered in the book of "Reinforcement Learning: An Introduction", which has an section of "Maximization Bias and Double Learning".

The "Q-Learing" algorithm has a drawback, where a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to significant positive bias.

The "Double Q-Learning" algorithm can avoid maximization bias and solve your question, where you need to learn two independent estimates, called Q_1(a) and Q_2(a). Here I paste the pseudocode for you: Double Q-Learning

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜