Property testing in depth: The size parameter

My work in property testing has recently allowed me to focus on some deeper parts. This post explains the size parameter of a property testing generator, and expands on its tradeoffs.

The size parameter

A QuickCheck generator

A property testing generator uses two important things when generating a value. The first is a seed for deterministic randomness, and the second is the so-called size parameter.

newtype Gen a = MkGen
  { unGen
  :: QCGen -- ^ The seed
  -> Int -- ^ The size parameter
  -> a -- ^ The generated value
  }

If you're curious as to why the Gen is not a state monad, that's because QuickCheck generators are splittable indefinitely, so they do not need to be threaded around.

The purpose of a size parameter

The size parameter is used to guide what kind of values are generated, in particular with respect of the size of the values to generate. The exact semantics that this guidance should follow is unclear and most of that design space is unexplored. However, there are some desired attributes of this guidance:

An increasing size parameter should vaguely make for values that are more likely to make complex incorrect functions fail.
It should be possible to use the size parameter to ensure that generated values do not become arbitrarily large.
The size parameter feature should not make false negatives any more likely than is necessary for generation to work.

There are many unanswered questions in this design space. In this post we will explore two of them.

Should the size parameter of a composite structure distribute the total size among components? How?

For example, when generating a list of values, what size should we give to the generator of the elements of the list?

One option is to generate the size of the list using the size parameter, and then generating elements of the same size. This is what QuickCheck does:

listOf :: Gen a -> Gen [a]
listOf gen =
  sized $ \n -> do
    k <- choose (0,n)
    replicateM k gen

The advantage of this approach is that it is a). simpler and b). it generates lists with bigger elements, which are more likely to cause test failures.

The disadvantage of such an approach is that composite structures become exponentially bigger the more layers they contain. For example, generating a list of lists will result in a list of a quadratic size while generating a list of lists of lists will result in a list of cubic size. This becomes a real problem when values become too big for your test suite to have a predictable and reasonable runtime.

Another option is to distribute the total size across among the elements of the list. This is what genvalidity does:

genListOf :: Gen a -> Gen [a]
genListOf func =
    sized $ \n -> do
        pars <- arbPartition n
        forM pars $ \i -> resize i func

Here, the size parameter is first partitioned into a list of size parameters that sum to the original size parameter. This sizes in this partition are then used to generate the elements of the list.

The advantage of this approach is that both value size and test suit runtimes stay both predictable and manageable. The disadvantage is that some false negatives may occur if the success of a property test is too dependant on the size parameter. The reasoning is that testing with smaller values is better than no testing at all because the test suite takes too long to run.

Should the size parameter influence the values that are generator in the case of fixed-byte-size types?

When generating an Int, should the size parameter influence the magnitude of the integer?

One answer is "Yes, because an Int will be used to generate the size of a list.". Another answer is "No, because otherwise there will be real false-negatives.". In particular, see the following:

$ stack ghci  --package QuickCheck
Prelude> import Test.QuickCheck
Prelude Test.QuickCheck> quickCheck $ \i -> i < 100
+++ OK, passed 100 tests.

Both QuickCheck and genvalidity currently use the same generator for Int values. This generator generates values between - size and + size.

I am currently experimenting with having genvalidity always generate values between minBound and maxBound, irrespective of the size parameter. If nothing too terrible happens, genvalidity will start doing that in the future. It could be that there are real problems with respect to runtimes of test suites.

genvalidity already uses this second approach in the case of Char, with a special case for generating extra annoying characters like UTF16 surrogate code points. QuickCheck on the other hand, generates mostly ASCII characters, and never UTF16 surrogate code points. The reasoning is that property testing should show the programmer what could go wrong as much as possible, and not ignore any problems just because they are rare (by default).