Computer Plays TicTacToe

This is an attempt at making a computer play TicTacToe. I plan to write code for playing the game using rule-based logic, Q-Learning, and eventually a mix of search + RL algorithms (the AlphaZero way). I'm exploring RL through projects, and this is the first one.

The code is public and available on GitHub.

TicTacToe is a simple game we've all played as kids — and it's a great choice for implementing Reinforcement Learning because the rules are precise and the constraints are limited. That simplicity helps us focus on the concepts rather than the game mechanics.

Rules

The game is played on a 3×3 grid with two players (X and O)
Players take turns placing their symbol in an empty square
Win by placing three symbols in a row (horizontal, vertical, or diagonal)
If all nine squares fill up with no winner, it's a draw
X always goes first

The Smart Strategy

We start with rule-based logic. For a 3×3 grid this works fine, but as we scale up the grid, the number of possible valid games explodes — from 255,168 on a 3×3 grid to somewhere around 10^80 to 10^100 on a 10×10 grid. That's more than the estimated atoms in the observable universe.

The smart strategy boils down to 5 rules, checked sequentially each move:

Win — If you can win in the next move, take it
Block — If the opponent can win next move, block them
Center — If the center is open, take it
Corner — If a corner is open, take it
Side — Otherwise, take a side square

Each move of the game is tested sequentially against the above rules, whichever rule fits the condition best is implemented. If no rule can be selected, it's most likely going to be a draw.

Simulations: 3×3 Grid

We ran 10,000 simulations for Random vs Random, Smart vs Random, and Smart vs Smart strategies using self play. The simulation results are as follows:

Random vs Random Strategy [3x3 Grid]

Smart vs Random Strategy [3x3 Grid]

Smart vs Smart Strategy [3x3 Grid]

Some observations:

The minimum number of moves to end a game is 5 while the maximum is 9 (the grid size)
Random vs Random is fastest at 0.9 seconds while Smart vs Smart takes 10.32 seconds — makes sense since the smart strategy has to evaluate conditions each move
In Smart vs Random, the random strategy still manages to win a few games (55 wins) even while making random moves, though the number is significantly low
Smart vs Smart shows that when both players play smart, all games end up as draws

Scaling Up: 5×5 Grid

Now let's see how the current rule-based logic performs when the number of blocks increase as we scale the grid. We implemented and simulated 10k games for the 5×5 grid.

Random vs Random Strategy [5x5 Grid]

Smart vs Random Strategy [5x5 Grid]

Smart vs Smart Strategy [5x5 Grid]

Observations:

Like the 3×3 grid, Random vs Random is fastest while Smart vs Smart takes the most time — expected given how they work
Whenever Random strategy is involved, the maximum moves in the game end up being 25 (5×5), covering the entire grid
The number of moves required drops significantly with the smart strategy, but it comes at a severe time cost — up to 20x longer
Smart vs Smart still results in all draws even on the 5×5 grid
In Smart vs Random, the random strategy O still wins a small number of games

Scaling Further: 9×9 Grid

I ran these on my Mac Mini M4 — Google Colab Free was too slow for this. Since we ran 1k and 10k games, I'll focus on the 10k results.

Random vs Random Strategy [9x9 Grid]

Smart vs Random Strategy [9x9 Grid]

Smart vs Smart Strategy [9x9 Grid]

Key observations across all grid sizes:

Compute scales dramatically — Random vs Random takes 1 min 40 seconds while Smart vs Smart requires ~58.5 minutes. Compared to the 3×3 grid, that's 106x, 414x and 338x more time respectively
Random wins disappear — On larger grids, the random strategy's chance of winning drops to literally zero in Smart vs Random
Smart vs Smart always draws — This holds across every grid size tested
Minimum moves are 33 in Smart vs Smart while the maximum is 81 (the grid size) in Random vs Random
X has a slight advantage in Random vs Random as it starts first

What's Next

The rule-based approach works for small grids but doesn't scale. The next step is implementing Q-Learning to see if we can find better strategies without explicitly programming rules — and maybe finally break the Smart vs Smart deadlock on larger grids.