Negative rewards in QLearning

2022-12-13 01:31 问答作者：

Let's assume we're in a room where our agent can move along the xx and yy axis. At each point he can move up, down, right and left. So our state space can be defined by (x, y) and our actions at each point are given by (up, down, right, left). Let's assume that wherever our agent does an action that will make him hit a wall we will give him a negative reward of -1, and put him back in the state he was before. If he finds in the center of the room a puppet he wins +10 reward.

When we update our QValue for a given state/action pair, we are seeing what actions can be done in the new state and computing what is the maximum QValue that is possible to get there, so we can update our Q(s, a) value for our current state/action. What this means is that if we have a goal state in the point (10, 10), all states around it will have a QValue a bit smaller and smaller as they get farther. Now, in relationship to the walls, it seems to me the same is not true.

When the agent hits a wall(let's assume he's in the position (0, 0) and did the action UP), he will receive for that state/action a reward of -1, thus getting a QValue of -1.

Now, if later I am in the state (0, 1), and assuming all the other actions of state (0,0 0) are zero, when calculating the QValue of (0, 1) for the action LEFT, it will compute it the following way:

Q([0,1], LEFT) = 0 + gamma * (max { 0, 0, 0, -1 } ) = 0 + 0 = 0

This is, having hit the wall doesn't propagate to nearby states, contrary to what happens when you have positive reward states.

In my optic this seems odd. At first I thought finding state/action pairs giving negative rewards would be learningwise as good as positive rewards, but from the example I have shown above, that statement doesn't seem to hold true. There开发者_开发百科 seems to be a bias in the algorithm for taking far more into consideration positive rewards than negative ones.

Is this the expected behavior of QLearning? Shouldn't bad rewards be just as important as positive ones? What are "work-arounds" for this?

Negative feedback only propagates when it is the only possible outcome from a particular move.

Whether this is deliberate or unintentional I do not know.

You can avoid negative awards by increasing the default reward from 0 to 1, the goal reward from 10 to 11, and the penalty from -1 to 0.

There are tons of scientific publications on Q-learning, so I'm sure there are other formulations that would allow for negative feedback.

EDIT: I stand corrected, this doesn't change the behaviour as I stated earlier. My thought process was that the formulation with negative feedback could be replaced by one without.

The reason for your observation is that you have no uncertainty on the outcome of your actions or the state it is in, therefore your agent can always choose the action it believes has optimal reward (thus, the max Q-value over all future actions). This is why your negative feedback doesn't propagate: the agent will simply avoid that action in the future.

If, however, your model would include uncertainty over the outcome over your actions (e.g. there is always a 10% probability of moving in a random direction), your learning rule should integrate over all possible future rewards (basically replacing the max by a weighted sum). In that case negative feedback can be propagated too (this is why I thought it should be possible :p ). Examples of such models are POMDPs.

Your question is answered in the book of "Reinforcement Learning: An Introduction", which has an section of "Maximization Bias and Double Learning".

The "Q-Learing" algorithm has a drawback, where a maximum over estimated values is used implicitly as an estimate of the maximum value, which can lead to significant positive bias.

The "Double Q-Learning" algorithm can avoid maximization bias and solve your question, where you need to learn two independent estimates, called Q_1(a) and Q_2(a). Here I paste the pseudocode for you: Double Q-Learning

继续阅读：artificial-intelligence reinforcement-learning

Negative rewards in QLearning

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？