Reinforcement for Poker
When I start to solve a new machine learning task with Python and TensorFlow, at the beginning I try to keep things simple. I focus on the core of the problem and try to isolate it. Usually it is very good to start with the minimum amount of the data to set small, simple and effective models as the baseline. At this stage you should forget about corner cases for a while. I think that it is a workable scenario for most of the problems. It is a good strategy to not fall into big trouble early.
To solve poker with reinforcement learning (RL), I have started with building the simple basics: code of poker classes like a Deck, Table and Player. I developed interfaces and two most important methods: hand evaluation for a Deck and a game loop for a Table. Players were ready to take seats at the table, get their cards and run a hand till the end when the winner was rewarded. I decided to take a simple scenario of 3 players at one table and a limited bet size. Such assumptions reduce the problem a bit at the beginning but do not imply any system code changes in the future.
Next, I added a bunch of lines to set up the first full RL environment with players controlled by neural networks (NN). NN received table states: player cards, moves of other players, cash on the table etc. All this information was transformed into an input vector of the network. Then the neural model was asked to make a decision about the player’s move based on that input. From time to time the player received a reward: cash after the finished hand — if won, it was a positive amount, if lost — a negative one. This reward was used so that the network could learn the policy with backpropagation. The first setup was as simple as possible. It was a single process running all the tables and neural networks. When one network was running forward pass, the whole system waited. It was slow as hell — it ran about 150 hands per second which for a typical RL setup is, in the simplest words, not enough. But at least it ran.
This setup was a disaster. Probably too simple and based on bad assumptions. The players (NN) learned almost nothing. Every player started to be aggressive as much as possible and, after a while, stuck in a crazy game without any further improvement. It looked very unstable. I was wondering why that happens.
In the early research phase of the project, sometimes you sacrifice good design of code for quick experiments and frequent changes. A messy desk and creative work often go hand in hand. But there also may appear a trap. At that time I went to the point where my code was not ready for any bigger changes. I was experimenting with some options like neural architecture adjustments and different optimization algorithms but nothing helped. Then I realized that probably NN is not able to understand the cards information at all — players always played in the same way regardless what cards they held. This was a motivation behind leaving the RL problem for a while and focusing only on the cards understanding as a small subproblem.
Working on the cardNet network concept was a total fun. It is described in detail in my previous post. To summarize the concept of cardNet: it is a network trained in a supervised manner to encode a set of cards into a single vector representation according to the poker game rules. People familiar with NLP concepts may think of this network as a sentence encoder where, for a given sequence of chars (sentence), a neural net builds its reach representation. The reach representation holds a lot of information about the meaning of the text and helps the machine learning system to understand what was written. This representation may be used later for many NLP tasks. As was shown before, the properly trained cardNet representation is rich in information like rank, rank value and even winning probability for a given set of cards. This information is essential while making good decisions at the poker table. While developing the cardNet, I learned a lot and observed that unfortunately “simple” is sometimes not enough to make a step forward. I realized that my first RL solution couldn’t use information about cards properly. But now I was able to use the cardNet encoder as a ready block in future work.
I decided to give myself a second chance to build an improved RL setup. I wrote down all the drawbacks of the first one. With the knowledge I gained from the first implementation, I designed everything from the beginning and planned a new setup of the system. For a while I stopped and realized that the problem I try to solve is really quite complicated and complex, I thought about giving up at that moment.
I started with the parallelization from the early beginning. I made a concept and built a kind of multiprocessing mock-up for tests. I set the interface of a poker table as a separate process with communication queues. Neural models were substituted first with simple random ones. While developing the system, I made a lot of tests to find optimal settings in terms of speed of processing. While running random players, I was able to process about 30000 hands per second. Each second the system sends table states (data) with queues to separate processes that made (random) decisions and then send them back. Every single hand statistically has about 30 states sent by each player with queues and about 10 decisions sent back to each of them. Having 3 players at one table, it gives about 30000 * 3 * (30+10) = 3.6M objects sent through Python queues between processes every second. If you are aware of how slow a single Python queue usually is, then you know that this number is really big. I know that after adding neural networks, it will slow down a lot but it is a good starting point with still a lot of development potential.
Besides the new architecture with multiprocessing and queues, the main difference from the first RL setup is the concept I call: Many States One Decision (MSOD in short). I will describe the concept while explaining RL basics later. I also have used the cardNet encoder as a ready subgraph of the whole neural network and totally changed the algorithm of input data preparation for the NN. I think that the format of input data was also the big problem of the first implementation.
When everything was ready, after replacing random players with neural models, things really took a turn for the better. I am amazed with the statistics of players, which look very promising. They are evolving and learning strong tactics from the self-game quickly. The setup is also very stable in terms of reinforcement learning environment, it does not get stuck and still improves with time. It is also quite fast — it runs about 7000 hands per second while running forward and backward passes of 14 neural networks and still does not use more than a half of the hardware resources. After a few hours of training, usually there are a few players with very strong skills and others that are catching up with them. I was positively shocked by the new results. Below are some graphs of player statistics that evolve while running the RL process.
Graph above present:
- $won — total winnings of each NN (every NN starts every hand with 500$ despite winning or losing previous one)
- VPIP — voluntarily put $ in pot
- PFR — preflop raise, VPIP and PFR are two most popular statistics of a poker player, in the simplest words those inform about understanding of cards value and player position at the table
- AGG — postflop aggression, tells how aggressive a player is after the flop.
- HF — hands folded, custom statistics developed for the pypoks — the percentage of hands player decided to fold during the hand
Since all the players start without any knowledge of poker rules and strategy, they quickly begin with high aggression. Until some players fold too often, all aggressive ones will be winning. It always happens at the beginning of the training. After about 50K hands aggression alone is not enough. Players start to explore the space and start to choose less aggressive strategies. All players in average lower aggression (VPIP, PFR, AGG) since a less aggressive strategy is more profitable. Setup presented above is very dynamic. Players very often change their strategies. I have noticed that the main reason is optimizer type, here it is Adam.
It looks that setting RL for poker is now a really good way at pypoks project ;). I have improved some elements very quickly and got even better results. There are still many things to do, but I see that the system runs a multiplayer poker game with RL with very good results.
I have to say that the current RL setup is no more simple. Here are some numbers: the trainings presented at the graphs were running with 400 poker tables (1200 players) simultaneously, each table was a separate process communicating using multiprocessing queues with 14 neural networks running in separate processes on 2 GPUs. Each neural network was built with the Transformer based cardNet encoder and convolutional encoder for incoming table states. There are still a lot of interesting things to test and implement. In addition, I think that the code is still nicely designed. It is scalable and easy to maintain.
I will try to explain all concepts deeply in the next posts and hope to keep my explanations simple.