开发者

Reducing the number of markov-states in reinforcement learning

I've started toying with reinforcement learning (using the Sutton book). I fail to fully understand is the paradox between having to reduce the markov state space while on the other hand not making assumptions about what's important and what's not.

background

Eg. the checkers example, Sutton says that one should not assign rewards to certain actions in the game, such as defeating an opponents piece. He claims this may optimize the AI for taking pieces not win the game. Thus, rewards should only be g开发者_如何转开发iven to the result you want to achieve (eg win the game).

Question 1

Assume a (Texas hold'em) Poker AI with a markov state only of the players hand and the cards on the table. This has around 52*51*50*49*48*47*46/1*2*3*4*5*6*7 states. Now assume we want the AI to take players money pool + their bets into consideration. This will make the Markov state space approach "infinite number of combinations" if we assume 8 players each having between $1-200.000.

Question 2

One state-reducing-strategy could be to divide players cash into either poor, medium or rich. This seriously reduces our state space, however, how do I know that a) 3 groups is sufficient? b) what are the discriminating limits for each group?

cheers,


The general approach is to use function approximation to reduce the state space when it gets too large. The key here is that you are generalizing rewards between states that are similar. Of course this requires that you come up with meaningful features to use by exploiting domain knowledge. There are, unfortunately, no algorithms that both solve the feature selection problem and the control problem together with any guarantees about optimality (in polynomial time), nor do we expect any to be invented.

To answer your questions, 1) even beginner level performance in 8 player limit texas holdem' is far beyond the current state of the art research. Check out the current "worlds best computer poker player" research at http://poker.cs.ualberta.ca/. That said, you could try and divide up the space into arbitrary features like: (player[1].cash > 1000) 0:1 , (player[1].cash > 2500) 0:1, etc.

2) It's hard to know how good your representation is, usually people just run it until it starts to converge and see how well it performs...


A proposed approach to reducing state space in RL is through use of a state-action hierarchy. Instead of having a single state variable X, you would break that up into smaller variables, say, x1, x2, x3. Then you measure their transition frequencies and determine dependencies between them (e.g. x1 usually changes when x2=abc). You can then form a policy explaining how best to transition the faster-changing variable in order to change the slower-changing variable in order to maximize the reward.

This approach is still relatively new, and I'm not aware of any public implementations of it. However, there are several papers proposing possible implementations. The MAXQ algorithm assumes a human-defined hierarchy, whereas the HEXQ algorithm describes a method of learning the hierarchy as well as the policies.


To add to @fairidox point (which I agree with): if the pot of cash players have is between 1-200000 then, given function approximation as they say, it is logical that a player would not alter their behaviour if they are in a state of having $1000 or $1001. Now, as you mention in question 2, there is a problem with choosing these arbitrary boundaries.

The same logic applies to having a robot in a physical space where its state in one space is likely to have the same value as a state 1mm to the right. If we did not approximate then we could argue that we should also have a new state for 0.5mm, 0.25mm etc. Sutton and Barto (and the David Silver lectures) discuss Tile Coding for this, which I think may also be appropriate for your poker example too.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜