Simulating The Right Thing: Property-Based Testing

Property-based testing is a useful technique for testing software. It works exceedingly well in a small set of situations, such as my tower defence game bot. Here's how I tested whether my bot was simulating valid moves.

Inner-Cover-Image_Justin-1

I recently entered the Entelect Challenge. It's an annual AI programming competition. This year, I had to write a program that was able to play a tower defence game against an opponent. But how do you conclusively say that a piece of software is working as intended? With bugs, my bot wouldn't do well in the tournament, so I'd really rather that my bot is doing the right thing.

In the field of software testing, there are many possible ways to deal with this. One is to boot the program up and see if it does the right thing. However, as your program grows, it becomes impossible to manually check that everything in the system still works after every change. That's why we write programs that test that the original program is doing the right thing.

I've previously written about the tournament and my approach to writing an entry. In this article, I'd like to focus on one specific method I used in writing test code for my Entelect Challenge bot: property-based testing. The full source code for my bot is up on GitHub if you want to dig through the code to see how it works.

A quick recap: What is the game all about?

Each tower defence game is a one-on-one battle between two programs that have been written by two competitors. In each round, the programs need to start up, read the state of the game board and make a move - all within a strict two second time limit.

Like a board game, the competition came with a set of rules explaining what moves a bot could make and the results a given move would have. The moves that the bots could choose from all revolve around building various towers on their side of the board:

Energy towers produce energy for you to build other towers,
Attack towers shoot missiles at your opponent and
Defence towers block opponent missiles.

You can see how these towers come together by watching the recording of the competition's finals!

I programmed my bot with an algorithm called the Monte Carlo Tree Search. It involves trying every valid move it can make, and simulating random games that start with making that move. If you do one simulation of the game using random moves, then the outcome doesn't tell you much. This changes when you do a thousand simulations using the same starting move. That's when the simulation can tell you statistically if the starting move was good.

In order to do an effective Monte Carlo simulation, I needed my implementation of the game rules to conform to two requirements:

It had to be fast enough to simulate many random games: My bot had to be able to say which move was statistically better within the two second time limit. For this, I used profiling and benchmarking to improve performance.
It had to implement the rules of the game correctly. If I messed up the game rules at all, my moves would be based on a buggy understanding of the game. This is where I used property-based testing.

What is property-based testing?

Property based testing broadly falls into the category of automated software testing. It's easiest to understand by contrasting it with the much more common example-based testing. In example-based testing, you set up an example use case of your software. The programmer works out the expected result of running the example by hand, and then hard codes it into the test.

For the tower defence game, I would then have written a test where:

I start with an empty game board.
Player A builds an attack tower at position (0, 0).
Player B builds an energy tower at position (13, 2).
Assert that the game board has the two new towers in the correct positions after the simulation. If this isn't the case, the test fails.

In software, it's very easy to accidentally break existing functionality without realising. Sometimes making a small change in one part of the system has knock-on-effects that you didn't know about, and your little fix on one side of the codebase breaks something on the other side. These example tests are super useful to check that things are still working. They are also quite easy to apply to many situations because they are typically directly related to how a person might use the code being tested.

The problem with example-based testing is that it requires the programmer to work out the results of each example by hand. In the tower defence game, each player has up to 192 valid moves that they can make because they can build 3 different towers in 64 different spots (64x3=192). That means that there are 36,864 different combinations of moves to test because each of the two players has 192 valid moves (192^2=36,864). If you take the state of the game board before the move into account, then there are even more cases to handle! I think it's safe to say that no human would be willing to write example test cases to handle all of those possible inputs.

Property-based testing is a different way of approaching tests. Instead of working with specific examples, you generate random inputs for your program. Calling your function with random inputs means that you can't work out the correct answer beforehand and hard code it into your test. Instead, you assert that certain properties of your system hold true for all inputs. Since the computer is generating the test cases, you can test a few thousand different cases at a time.

For example, if you're testing a function that sorts a list, you might write property-based tests that take in a random list, sort it, and assert properties like:

The output list has the same number of elements as the input list
All of the elements in the input list appear in the output list
The output list is sorted

No matter what your random input list is, those properties should always hold for a function that sorts a list. If you check this for enough random input lists, you will probably catch most errors in your sorting algorithm.

For my tower defence game bot, there were two properties that I wanted to focus my attention on:

For all moves made by either player and all initial game board states, my bot should simulate the next move correctly.
When simulating a game all the way to the end with random moves, my bot must only be making valid random moves.

My bot should simulate the next move correctly

When Entelect released the rules for the challenge, they also released the source code for their game runner: the program that is used to run the game between two bots in the tournament.

I couldn't just copy the code from their game runner and embed it into my code. It obviously implemented the whole game correctly, but my Monte Carlo Tree Search algorithm had much stricter requirements when it came to the performance of my simulation: While following the rules identically, I needed my code to be as fast as possible.

Given the same game board and player moves, my program had to produce the same next game board as the official implementation.

To ensure that this property held, I wrote a test that:

Opened a game replay that had been written by the official game runner
Read each game state in the replay and the moves that the players made
Simulated that move
Checked that the result was identical to the next state in the replay

property-test

This test caught several bugs that I wouldn't have discovered otherwise.

The exact order in which you apply the rules matters: Things like updating the cool-down on an attack tower before or after the attack towers fire can result in the missiles being fired a round early or late. My property-based test caught a few cases where I was doing this wrong. I would not have caught these differences with example-based tests because the problem was in my understanding of the rules, and any examples I wrote would have been based on the same misunderstandings.

I also found a situation which doesn't come up much in practice where the official implementation behaved strangely. This happened when a new tower called the "Tesla tower" was first introduced to the game. The Tesla tower is a powerful but expensive tower that could destroy an entire row of enemy towers at once. Unfortunately, it also cost too much energy to build and took too long to construct, so in most games it just wasn't built. In fact, some bots in the finals ignored it altogether. The Tesla tower's initial implementation in the official game engine had an odd edge case where it could not hit an opponent unless there was at least one tower for it to destroy as well. Usually you'd only use a Tesla tower if it could hit your opponent's towers, so this situation wasn't specifically mentioned in the rules. I asked about the strange behaviour on the Entelect Challenge forums that led to an update to the official implementation. This benefited me in the same way that finding issues in my own code did: in the end my interpretation of the rules must match the official game runner to be correct. The code change in this case just fell on Entelect instead of me.

This, of course, is only one way of discovering misunderstandings while programming. Another way that works well is to collaborate closely with other people. In my day job, we heavily make use of pair and mob programming to help catch issues as early as possible and to spread knowledge about our system as we go. After all, two sets of eyes on a problem are more likely to catch issues than one.

My bot can only make valid random moves

Because of the Monte Carlo Tree Search algorithm, my bot's decision-making was heavily reliant on simulating many random games. The more random games I could simulate, the better my bot was at picking the statistically best next move.

Unfortunately, as I tweaked my performance to play as many random games as possible, I found my code became less readable. It got to its worst point when my selection of a random move depended on code I adapted from an article called Bit Twiddling Hacks. It's a very clever piece of bitwise code that I managed to translate from the original C reference into Rust, but could not understand at all.

This code, with all of its bitwise shifts and overflowing arithmetic, somehow returns the index of the n-th unset bit very quickly. I had the game board represented as a bitfield. The unset bits were free to build on, so using this function I could choose a random empty spot on the board:

fn find_bit_index_from_rank(occupied: u64, n: u64) -> u8 {
    let v = !occupied;

    let mut r = u64::from(v.count_ones()) - n;

    let a: u64 =  v - ((v >> 1) & (!0u64/3));
    let b: u64 = (a & (!0u64/5)) + ((a >> 2) & (!0u64/5));
    let c: u64 = (b + (b >> 4)) & (!0u64/0x11);
    let d: u64 = (c + (c >> 8)) & (!0u64/0x101);
    let mut t: u64 = (d >> 32) + (d >> 48);

    let mut s: u64 = 64;
    s -= (t.wrapping_sub(r) & 256) >> 3;
    r -= t & (t.wrapping_sub(r) >> 8);
    t  = (d >> (s - 16)) & 0xff;
    s -= (t.wrapping_sub(r) & 256) >> 4;
    r -= t & (t.wrapping_sub(r) >> 8);
    t  = (c >> (s - 8)) & 0xf;
    s -= (t.wrapping_sub(r) & 256) >> 5;
    r -= t & (t.wrapping_sub(r) >> 8);
    t  = (b >> (s - 4)) & 0x7;
    s -= (t.wrapping_sub(r) & 256) >> 6;
    r -= t & (t.wrapping_sub(r) >> 8);
    t  = (a >> (s - 2)) & 0x3;
    s -= (t.wrapping_sub(r) & 256) >> 7;
    r -= t & (t.wrapping_sub(r) >> 8);
    t  = (v >> (s - 1)) & 0x1;
    s -= (t.wrapping_sub(r) & 256) >> 8;
    s = 65 - s;

    64 - s as u8
}

Since I didn't understand how it worked, the chances of me implementing and using it right the first time were low. It was crucial that my tests caught if my bot was proposing an invalid move. I decided to let property-based testing help me here as well.

The second property that my program needed to maintain is that my random move generator had to return valid moves.

While I was implementing the game rules, I watched out for any cases of invalid moves. For example, a valid move wouldn't try to build in an occupied spot on the board, and a valid move wouldn't build an expensive tower if the player didn't have enough energy to pay for it.
When I encountered these situations, I added test assertions directly into the code. Rust has a macro called debug_assert which is only included in debug builds, so it would have no performance impact in the actual tournament.
I then created a test case which just ran my bot for a few seconds. My Monte Carlo Tree Search ran as many random games as it could for those few seconds.
As the bot ran, if it happened to encounter any situation where the random move generation gave it an invalid move, it would hit one of the test assertions and the test would fail.

As predicted, this actually failed the first few times. After tweaking the random move generation for a while, I eventually ironed out all of the bugs and ended up with something that was both fast and correct.

What difficulties did I encounter?

These two tests are very high level tests. There are a small number of test cases that touch almost all of the system. This is useful in that it can find many issues, but when you do find issues, the bug could be anywhere!

This isn't a problem with property-based testing generally, this is a problem with my specific property-based tests. I took the lazy approach to writing the tests and just wrote two that both touched the whole system. I would have gotten more specific feedback if I'd written more tests that each examined a subset of the rules. For example, if I had written a test that generated random missiles on the board without towers, I would have known if the problem was in my missile movement function.

You can run into this same problem with example-based testing, when you are writing examples that touch the whole system. Generally speaking, the way to avoid this problem is to rather write more tests, each testing a small isolated part of the system. It's still valuable to have the test that hits the whole system at once, to make sure that the individual pieces work well together. Martin Fowler refers to this advice as the Test Pyramid: you want to have most of your tests run on small units of your code in isolation, and fewer tests that bring your whole system together.

The second problem that my tests gave me, which is very similar to the first problem, is that I didn't have good visualisation of the game boards. When my tests failed, saying a game board didn't look the way the test expected, the test then printed a gigantic string of code describing the two game boards. A failing test looked something like this:

Expected game board

BitwiseGameState { status: Continue, player: Player { energy: 20, health: 100, unconstructed: [UnconstructedBuilding { pos: Point { index: 15 }, construction_time_left: 0, building_type: Energy }], buildings: [144115188092706817, 0, 0, 0], occupied: 144115188092739585, energy_towers: 144115188092698625, missile_towers: [0, 0, 0, 8192], firing_tower: 0, missiles: [(32768, 0), (0, 0), (0, 0), (0, 0)], tesla_cooldowns: [], iron_curtain_available: false, iron_curtain_remaining: 0 }, opponent: Player { energy: 20, health: 100, unconstructed: [UnconstructedBuilding { pos: Point { index: 1 }, construction_time_left: 0, building_type: Energy }], buildings: [562949953422097, 0, 0, 0], occupied: 562949953422099, energy_towers: 562949953422081, missile_towers: [0, 0, 0, 16], firing_tower: 0, missiles: [(64, 0), (0, 0), (0, 0), (0, 0)], tesla_cooldowns: [], iron_curtain_available: false, iron_curtain_remaining: 0 }, round: 11 }

Actual game board

BitwiseGameState { status: Continue, player: Player { energy: 20, health: 100, unconstructed: [UnconstructedBuilding { pos: Point { index: 15 }, construction_time_left: 0, building_type: Energy }], buildings: [144115188092706817, 0, 0, 0], occupied: 144115188092739585, energy_towers: 144115188092698625, missile_towers: [0, 0, 0, 8192], firing_tower: 0, missiles: [(16384, 0), (0, 0), (0, 0), (0, 0)], tesla_cooldowns: [], iron_curtain_available: false, iron_curtain_remaining: 0 }, opponent: Player { energy: 20, health: 100, unconstructed: [UnconstructedBuilding { pos: Point { index: 1 }, construction_time_left: 0, building_type: Energy }], buildings: [562949953422097, 0, 0, 0], occupied: 562949953422099, energy_towers: 562949953422081, missile_towers: [0, 0, 0, 16], firing_tower: 0, missiles: [(32, 0), (0, 0), (0, 0), (0, 0)], tesla_cooldowns: [], iron_curtain_available: false, iron_curtain_remaining: 0 }, round: 11 }

As you can see, the difference between the two was hard to spot and understand. If you're going to do what I did and test your entire system in a small number of massive tests, think carefully about how your test should report its failures. You'll save time if the test failure tells you exactly what the difference is between the expected and actual game.

The third difficulty I encountered is that I found it difficult to uncover the properties that hold true for all inputs. You don't always have an implementation that is known to be correct on hand. To help with this, Scott Wlaschin has a list of patterns which appear in property-based testing. Reading through this list of patterns makes it easier to spot where they could be used in your own programming.

How did it work out in the end?

When it came time for the competition, I was reasonably sure that my implementation matched the rules of the official game engine. I had discovered several places where I'd misunderstood the rules of the game and fixed them. As such, I knew that my bot would simulate into the future both fast and correctly.

I was also reasonably sure that my program was stable enough to handle whatever game states were thrown at it. It would continue to simulate random moves into the future without making a mistake and trying to simulate an invalid move.

This doesn't mean that my bot was completely bug free (is any software ever COMPLETELY bug free?), but these two properties together gave me confidence. In the end, my bot came in third place, and I'm really happy with that result. If I didn't do this testing, I doubt I would have made it into to the finals at all!

Resources

Fun Fun Function talking about the why and how of unit testing.
Many test frameworks support property-based testing. Take a look for one that supports your favourite programming language. For me, that currently means:
- Proptest - in Rust
- ScalaTest - in Scala
- JSVerify - in JavaScript
Choosing properties for property-based testing - some examples of other properties to use in property-based testing.

Justin Worthe is a software engineer with an interest in music, games, good coffee, and using programming to get stuff done. He can often be found trying to find ways to play with all of these interests simultaneously. In the meantime, you can catch him on Twitter or Github.