How Hypothesis and VCR Changed the Way I Do Testing in Python

03 August 2022, by Jethro Muller

A key concern for any developer is preventing new changes from breaking existing systems. However, it’s hard to know exactly what is going to break your systems because it’s difficult to test all the possibilities. Hypothesis and VCR make it easier to do this. In this article, I’ll discuss the benefits and drawbacks that I’ve found in starting out with these tools, by using examples from my real-world tests.

Jethro_How-to-not-break-your-systems_Inner-article

I work as a software developer at Flickswitch, a business service provider in the telecommunications space. At Flickswitch, one of the services we provide is a recharge API, Hotsocket, which allows for recharges across various networks throughout the world.

Hotsocket is a fairly old product and was originally written in Java. It has served Flickswitch well over the many years that it has been offered, providing a simple interface to service high-volume API customers.

Unfortunately, it was unstable and caused many headaches for the support and development teams. Often, recharges would be submitted but their state would not update to indicate that they had been processed, which caused issues for customers. The problem would then escalate and eventually require manual intervention. Similarly, adding a new user to the platform required cloning the repo and running a specific function with the user’s information to create all the correct table entries and insert the account details in the correct places. Again, this used the time of both the support and development teams.

I joined Flickswitch in January 2017, along with three other developers. After having completed our first major project, we were tasked with rewriting the Hotsocket API using GraphQL. When we set out to rebuild the functionality of Hotsocket, our goal was threefold: we wanted to improve its speed and reliability, as well as the transparency of the system.

On a practical level, we needed the new API to:

Handle the peak load requirements of our biggest customers
Handle errors properly
Provide clear logging to make debugging issues simple

The new GraphQL version of the Hotsocket API provided a more flexible experience for both internal and third-party use. Unfortunately, the use of GraphQL meant that customers, with existing integrations, had to update their integrations with us.

The solution was to build a drop-in replacement for the old Hotsocket API that leveraged the new GraphQL API’s capabilities and would generally be more reliable in the long-term.

Ultimately, building the drop-in replacement system was relatively easy. We used py2graphql to talk to the new API, Flask to handle routing and requests, and Google’s DataStore to store extra data not suited for the new API’s database.

The problem lay in how we could test it thoroughly enough to ensure that replacing the old system wouldn’t prevent our customers from recharging their SIMs.

Enter Hypothesis and pytest-vcr, two tools that proved incredibly useful in helping us write these tests. Here is how we worked through testing our new system and some of the benefits and drawbacks that we experienced using these tools.

The testing phase

The problem

To be confident in the API proxy we had built, we knew that we needed to write tests for the various critical functions, and then for the full flow in order to ensure that there were no errors with the validation.

This approach, while slow, could provide a reasonable basis for trust in the system. Our existing methods for writing tests were very much manual: We were mocking out functions that created outbound calls and manually generating test cases for different inputs. This created a large amount of boilerplate code as well as necessitated setting up mock objects correctly for all the test cases. This, coupled with needing to manually determine and write out the possible problematic and expected test cases, used valuable developer time.

The solution

Enter the two tools: Hypothesis, a property-based testing framework, and pytest-vcr, a tool that records and replays requests.

These tools were ideal for our project as we were calling our GraphQL Hotsocket API in almost all the functions we had to test, while the large number of possible inputs required that a large number of tests be written to cover just the acceptable inputs – not even accounting for edge cases.

This is where pytest-vcr and Hypothesis shine.

Hypothesis provides a convenient way to define test cases that simply cover a large range of inputs. It also allows you to define “contracts” for your test inputs. These definitions are then turned into example tests at runtime.

This example from the readme on Github helps illustrate the power of this kind of system:

@given(st.lists(
  st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_mean(xs):
    assert min(xs) <= mean(xs) <= max(xs)

Falsifying example: test_mean(
  xs=[1.7976321109618856e+308, 6.102390043022755e+303]
)

In the above example, the decorator provides the definition of all possible inputs to the test. It says to expect a list of at least length 1 of floats that will never be not a number or infinity. Values that meet that criteria are generated and passed into the test.

Pytest-vcr is very simple to understand. A test decorated with @pytest.mark.vcr() will automatically record any requests made in the test to a YAML file that allows pytest-vcr to fully recreate the request response cycle the next time that test is called. Pytest-vcr abstracts away the requirement to write explicit network usage mocking, while instead allowing the API calls to be naturally resolved.

These two tools helped us write succinct tests that very accurately represented the class of inputs and responses we needed to handle.

The rest of the article explores the challenges posed by a few tests written for the Hotsocket API proxy using Hypothesis and pytest-vcr. It also covers cases where the tools weren’t applicable and how we got around their limitations.

The tests we had to work on

Initial validation check

The first part of the API proxy that needed to be tested was the input validation. We needed to ensure that, for all valid inputs, the validation checks passed. If they failed for valid inputs they would block customers’ recharges.

The Problem

Traditionally, we would have used pytest.mark.parametrize – a decorator that allows inputs to be enumerated and then creates test cases for each. An example of this is as follows:

@pytest.mark.parametrize('x, y, answer', [(1, 1, 1), (2, 3, 6)])
def test_multiply_gets_correct_answer(x, y, answer):
    assert x * y == answer

Here, you can see how the test cases have to be manually determined and written out. Each test case has to be figured out by the developer. Determining the most common flows is not hard, but fully capturing the edge cases, and cases likely to cause errors, is near impossible. Even without that, writing out the required cases for an API that is meant to be flexible would require many, many examples to prove it was a safe replacement for the old API.

The Solution

Using Hypothesis, we were able to succinctly capture the requirements of how our validation checks should behave. An example of that is a test that ensures that the initial validation check succeeds for any truth-y value, as long as it isn’t an airtime recharge.

Stated as text, the requirement is that, for any recharge with a positive non-zero denomination, of SMS or Data, on Vodacom, MTN, Cell C, or Telkom, the initial validation check will return True.

class TestInitialPurchaseCheck:
    @given(
        network_code=s.sampled_from([VODACOM, MTN, CELLC, TELKOM]),
        product_code=s.sampled_from([DATA, SMS]),
        denomination=s.integers(min_value=1),
    )
    def test_truthy_inputs_if_not_airtime_purchase_returns_true(
        self,
        network_code,
        product_code,
        denomination,
    ):
        assert recharge.initial_purchase_check(network_code, product_code, denomination) is True

This test with a single assert covers all possible permutations of the initial conditions and, in my opinion, reads fairly easily. This type of test is the main reason we decided to use Hypothesis for this project.

The previous method, of manually determining test cases and then creating the permutations manually or programmatically, created annoying boilerplate code that only served to make the test harder to understand.

The Drawbacks

The tests that can be created when using Hypothesis are amazingly simple and allow us to ensure that the code tested functions correctly across all cases of a type of input. This property is great, however, it means that, for slower functions, the tests are going to take longer because, for each test, Hypothesis generates a large number of test cases.

This didn’t affect the project much as most of the functions we needed to test weren’t computationally intensive, but the login functionality that tested passwords does take a few seconds to run because of all the test cases Hypothesis generates.

Similarly, something we didn’t experience, but does seem like it could prove problematic, is Hypothesis‘ use with ORMs, such as SQLAlchemy. This can mostly be solved by using plugins like hypothesis-sqlalchemy. The Hypothesis project has Django support built-in which is great; however, it only seems to work with tests that use the special version of TestCase that Hypothesis provides.

Outbound Network Calls

The Problem

At Flickswitch, we test everything we write. According to codecov.io we have 97% test coverage with over 2200 tests just for SIMcontrol, our main product.

Our tests allow us to upgrade, change, and deploy our code with confidence.

This peace-of-mind serves as great motivation to find better ways of writing more accurate tests that better replicate real-world usage.

Many of our tests require making outbound network calls to third-party services. This is problematic in a test environment because ideally, you’d have a test version of the service to call, but often that is not possible. Even with a test service, making outbound network calls is slow and, as a test-suite grows, speed becomes a necessary consideration.

Traditionally we’ve used the unittest.mock.patch decorator and the responses library to prevent outbound network calls – using mocks to instead return fake data. This is great for test speed; however, it does not accurately model reality. Slight deviations in the dummy return data can cause your tests to pass and your production code to fail.

The drop-in replacement for the new API is essentially a tool that translates old API calls into the style of the new API. This means that, for almost every function we had to test in the drop-in replacement, we would have to manually mock out the network call and define fake return data.

The Solution

Previously, we would have used responses. Using responses involves determining the URL and HTTP verb that the request is going to be, and then setting up a responses Response for that. The Response that is used is what is returned in the test. Included in the Response, is the data you wish to return. This works well when the data that results from the request isn’t too important and it isn’t likely that the request will change. Below is an example of how that would work:

@responses.activate
def test_failed_api_request_returns_none():
    responses.add(
        responses.GET,
        API_URL,
        status=500,
    )

    result = make_api_call()

    assert result is None

This method works well for simple cases, but the assurances that we needed were more complicated and we didn’t want to manually specify the structure of the responses for each test. We needed something that abstracted that all away and handled it automagically.

To do this, we used pytest-vcr. It’s as simple as decorating a test with @pytest.mark.vcr(). Doing this records all outbound requests inside the test the first time that test is run. The recording of the requests is saved to a YAML file in a cassettes folder. These recorded request-response pairs are used when the test is run again if the requests inside it are the same.

We used this to our advantage by recording and then blocking outbound requests to our production GraphQL API so only the recorded responses were used. By setting @pytest.mark.vcr(record_mode='none’), it’s possible to make VCR raise an error if any requests not handled by the recorded cassette are made. This was done to all our tests to prevent VCR from trying to create new cassettes while running tests in the CI environment or at build-time.

An example of a test where this was used is the following test of the query of a customer’s account balance. This is a very simple query, but it should illustrate how pytest-vcr works. It uses the pytest client fixture which is a custom fixture that creates a Flask test client.

@pytest.mark.vcr(record_mode='none')
def test_valid_request_returns_balance(self, client):
    username = 'username'
    token = 'token1'

    response = client.post('/balance', data={
        'as_json': True,
        'username': username,
        'token': token,
    })

    data = response.json
    assert data['response'] == {
        'status': '0000',
        'message': 'Balance lookup successful.',
        'running_balance': 1444.0,
    }

This test ensures both the data and structure of the data returned in the response is correct, given the specific JSON input to the drop-in replacement API. The incoming JSON request is converted to GraphQL, which is sent to the GraphQL API. The GraphQL API returns a response that is parsed and returned from the drop-in replacement as JSON.

The cassette below shows the recorded requests-response cycle from the GraphQL API.

interactions:
- request:
    body: '{"query": "query {\n  account {\n    balance\n  }\n}"}'
    headers:
      Accept: ['*/*']
      Accept-Encoding: ['gzip, deflate']
      Connection: [keep-alive]
      Content-Length: ['54']
      Content-Type: [application/json]
      User-Agent: [python-requests/2.19.1]
      simcontrol-api-key: [token1]
    method: POST
    uri: https://new.simcontrol.co.za/graphql/
  response:
    body: {string: '{"data":{"account":[{"balance":1444.0}]}}'}
    headers:
      Connection: [keep-alive]
      Content-Length: ['41']
      Content-Type: [application/json]
      Date: ['Tue, 30 Oct 2018 14:15:24 GMT']
      … (truncated for readability)
    status: {code: 200, message: OK}
version: 1

As you can see, in the cassette, the outbound query that has been recorded is a query to our GrahpQL API and the response is a GraphQL response with the appropriate data in the correct format. When the test is run in the future, the cassette will be used to replay the same exact request-response cycle each time provided the outbound request doesn’t change.

This method was used because it enabled us to prevent any unexpected outbound queries.

This was particularly useful in our recharge testing, because any rogue requests in our test environment could be costly. It also provided the reassurance that the requests and responses themselves were handled correctly, which would not necessarily be the case if we were to use the method described earlier using responses.

The Drawbacks

Using pytest-vcr allowed us to develop tests incredibly quickly compared to the old responses way of doing things, however, it did not work for all test cases.

Unfortunately, Google DataStore, Google’s NoSQL database, didn’t play nicely with pytest-vcr. In my brief time trying to work out the issue, it seemed that the system that DataStore uses to authenticate requests is time-based, which prevented pytest-vcr from being able to use the recorded cassettes later.

We didn’t have the time to fully research ways around this issue and so we built a fake version of DataStore. The fake version is a simple class that provides the same interface as DataStore with the storage implemented in-memory. We made sure it was used in all our tests by using it to monkey-patch DataStore completely.

This solution didn’t feel good, but it seemed to be the only way to solve the problem we were having without spending too much time figuring out the interaction between Google DataStore and pytest-vcr.

It definitely violates the testing principles that we espoused earlier about the test system being as close to real life as possible. However, in this case, the amount of time that would be required to find another solution that prevented this issue took longer than creating a custom class to replicate the DataStore functionality locally.

Useful resources

Our experience with these tools has so far been very valuable, albeit, fairly limited. It is entirely possible that what I’ve listed as shortfalls of Hypothesis and pytest-vcr can already be solved.

This is not meant to be a comprehensive guide and more information can be found for both Hypothesis and pytest-vcr on their respective GitHub pages.

Hypothesis: https://github.com/HypothesisWorks/hypothesis/tree/master/hypothesis-python
pytest-vcr: https://github.com/ktosiek/pytest-vcr

Hopefully, the examples I’ve shared have excited you about making improvements to tests in your own code base – the simpler and easier to write they can be, the better!

Jethro Muller is a software developer at Flickswitch. He primarily works in Python with Django on server-side code. He enjoys ORM query optimisation, building pipelines and tooling, optimising workflows, and playing with AI.