Why mocking is a bad idea

Mocking is a very common testing mechanism, and it is a bad idea. This post details why you should not use mocking, and why and how you should write integration tests instead.

TL;DR: Mocking provides false confidence by hiding real failures.

Disclaimer: This post uses example code in Haskell, but the same principles apply in other programming languages.

What is mocking

Writing a test using a mock thing involves using a fake version of that thing instead of the real ones. This is usually done in situations where the thing in question should be irrelevant for the test.

For example, you can use an in-memory stand-in for a file system instead of interacting with a real file system. Or, instead of interacting with the real Facebook API, you can use a fake one that responds with what you tell it to respond with.

In diagram form:

Test -[calls]-> Code that we want to test <-[uses]-> Fake dependency

instead of 

Test -[calls]-> Code that we want to test <-[uses]-> Real dependency

Using mocks can have certain advantages, for example:

Using a fake in-memory file system is likely much faster than dealing with a real one.
Using a fake dependency lets you only think about the parts that you think are relevant.
Using a fake network response means your test cannot be flaky because of an unreliable network.
Creating certain rare situations is much more difficult with the real system in which those situations are rare than in the fake system in which you control everything.

Mocking in Haskell

Developers have come up with many ways to do mocking in Haskell. The most popular methods include:

A handle record with the functions that are mocked
Monad transformer type class constraints
Free monads with alternate interpreters

I will not go into too much detail about how these work, but I will use the first method as an example to make my point. The same problems that I will point out occur with other methods.

Why mocking is a bad idea

To make my point of why mocking is a bad idea, I will use these two arguments:

Making code mockable makes it more complex and thus more likely to be wrong
Mocking hides real bugs. It makes tests pass that would have failed if not for the fake objects.

Conclusion: Writing tests with mocking is worse than not writing those tests.

An example

Consider the following example of a very simple function that reads a file and then writes to the same file. This could be a naive way to refresh the modification time stamp, or a way to exercise the file system for load testing, but it does not really matter. It just happens to be a nice example.

refreshFile :: FilePath -> IO ()
refreshFile path = do
  contents <- readFile path
  writeFile path contents

The careful reader may have already noticed that this code contains a bug. Hint: you can find it by having a look at this list of Haskell's dangerous functions.

Suppose you do not see that this code contains a bug, and that you haven't noticed it yet either. But maybe you are careful enough to write a test anyway.

You could write a test like this:

module RefreshSpec (spec) where

import Path
import Path.IO
import Test.Syd

spec :: Spec
spec = describe "refreshFile" $
  it "does not change the contents of the file it refreshes" $ do
    withSystemTempDir "my-test" $ \tdir -> do
      path <- resolveFile tdir "dummy"
      let contents = "hello world"
      writeFile (fromAbsFile path) contents
      refreshFile (fromAbsFile path)
      actual <- readFile (fromAbsFile path)
      actual `shouldBe` contents

Note that this test would have found the bug. Indeed, the test suite fails with this result:

    test/RefreshSpec.hs:9
  ✗ 1 RefreshSpec.refreshFile.does not change the contents of the file it refreshes
      /run/user/1001/my-test-b09fb993f4b767ec/dummy: openFile: resource busy (file is locked)

(Because of the laziness of readFile and writeFile, you can't actually write a refresh function like this because they will open the same file twice at the same time.)

However, you decide to use mocking instead, so that you don't have to actually read or write any files during testing.

The big up-front Refactor and extra complexity

You will need to refactor the original code so that the functions that you want to mock can be replaced. Then you need both a real version and the mocked fake version. You could do something like the following:

import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M
import Data.Maybe

-- | An abstract type that represents a way to deal with the filesystem
data FileSystemHandle m = FileSystemHandle
  { fileSystemReadFile :: FilePath -> m String,
    fileSystemWriteFile :: FilePath -> String -> m ()
  }

-- Our original code, now with an abstracted 'FileSystemHandle'.
refreshFile :: Monad m => FileSystemHandle m -> FilePath -> m ()
refreshFile FileSystemHandle {..} path = do
  contents <- fileSystemReadFile path
  fileSystemWriteFile path contents

-- The file system handle that we will use in production
ioFileSystemHandle :: FileSystemHandle IO
ioFileSystemHandle =
  FileSystemHandle
    { fileSystemReadFile = readFile,
      fileSystemWriteFile = writeFile
    }

-- The file system handle that we will use in testing
stateFileSystemHandle :: Monad m => FileSystemHandle (StateT (Map FilePath String) m)
stateFileSystemHandle =
  FileSystemHandle
    { fileSystemReadFile = \path -> fromMaybe (error "file does not exist") <$> gets (M.lookup path),
      fileSystemWriteFile = \path contents -> modify (M.insert path contents)
    }

Note that, for the sake of brevity, I am just passing in a dictionary here, you can do that with any of multiple approaches like MTL-style monad constraints, free monad interpreters, and many others. (You could also use an approach that uses ptrace to hijack syscalls, but I have never actually seen anyone use this type of mocking in practice so I will ignore it here for now.)

At this point, multiple alarm bells should be going off:

This code is more than 4x the size of the original code, requires extra dependencies, and is much more complex. The potential for getting it wrong is much greater (evidence: we have, see below) and it is much more difficult for newcomers to the code to understand.
The type of refreshFile now lies because it cannot promise to behave correctly for every value of type FileSystemHandle that we can come up with. Indeed, you could pass in some very silly FileSystemHandle that does nothing and just returns empty values. In that case the refreshFile function will not do what it should.

For the record, you can solve the latter problem by using an enum like this:

data FileSystemApproach m where
  IOApproach :: FileSystemApproach IO
  StateApproach :: Monad m' => FileSystemApproach (StateT (Map FilePath String) m')

-- Our original code, now with an abstracted 'FileSystemApproach'.
refreshFile :: Monad m => FileSystemApproach m -> FilePath -> m ()
refreshFile approach path = do
  let FileSystemHandle {..} = case approach of
        IOApproach -> ioFileSystemHandle
        StateApproach -> stateFileSystemHandle
  contents <- fileSystemReadFile path
  fileSystemWriteFile path contents

As you can see, this problem of the lying type can be solved, but it does require even more extra complexity.

Note that Haskell allows you to do such a big refactor relatively safely at least. In other languages, languages without types in particular, this would already be a big issue by itself.

Suppose you are OK with all this extra complexity (even though you really should not be). Let's look at the other problems that this mocking approach causes.

Inaccurate mocks causing false-negatives

Now that you've done the big and complex refactor, you get to write your test that uses mocking:

spec :: Spec
spec = describe "refreshFile" $
  it "does not change the contents of the file it refreshes" $ do
    let path = "dummy"
        contents = "hello world"
        beginState = M.singleton path contents
    endState <- execStateT (refreshFile stateFileSystemHandle path) beginState
    M.lookup path endState `shouldBe` Just contents

This test does indeed not touch any file system. However, and this is the big problem: this test does not catch the bug because the bug only exists because of issues with the IO-based implementation. At this point you will be more confident in your code, because your test passes, deploy to production, and experience the bug in production instead. The confidence you have gained through this test is false confidence. In this case, using a mock to test the code is actually worse than not testing the code at all, because if you hadn't tested the code, at least you wouldn't have any false confidence in it.

So this is the big issue with mocking, but it gets worse: You might think "well I can just ..." and you will still run into trouble because there is no way to figure out where the actual bugs will be. Those are unknown unknowns.

Another example: mocking external integrations

You may be thinking "Sure, mocking the file system is not a good idea because it is easy and cheap enough to just use the file system, but what if you do not have control over the integration you want to mock?" Let's say that you integrate with an external API that you have no control over, like Stripe for payments. In that case you may think "I cannot write an integration test using Stripe, because I would have to make a real payment in my test". (Never mind that Stripe actually has a testing version of its API exactly for integration tests. Let's assume, for the sake of arguments, that it doesn't.)

Your API call will probably look something like this:

My code -[function call]-> HTTP Library -[HTTP Request]-> Stripe Server -[HTTP Response]-> HTTP Library -[function response]-> My Code

Your idea may be to write a test with this architecture (maybe with Hoverfly, for example):

My code -[function call]-> HTTP Library -[HTTP Request]-> Mock Stripe Server -[HTTP Response]-> HTTP Library -[function response]-> My Code

In this case you will specify that when your mock receives a given request, it will respond with a very specific response, just like you expect Stripe to respond.

This works fine if Stripe follows exactly the pattern that you expect it to follow. However, if Stripe changes its API, or if it responds in a different way than you expect it to, your test is hiding a failure case again.

Instead of using mocking to achieve a false sense of confidence that your code works, you can admit that you cannot truly be confident about integrating with an external service, and instead test the parts of your code that you do control.

You will want to write two tests, one for each end of the integration:

1. My code -[function call]-> HTTP Library -[HTTP Request]-> 
2. -[HTTP Response]-> HTTP Library -[function response]-> My Code

This way you still test the parts of the code that you can control, without falsely asserting that you know how the external service will react. and you don't need to write any mocks, and you don't need to use any external tooling that adds unnecessary complexity to write your tests!

After all that, you can still have automated end-to-end tests and/or manual QA in the staging environment to make extra sure that the payment infrastructure behaves as expected.

Alternatives

So what should you do instead?

Test real code in a real environment instead of fake code in a pretend environment.

If your code does not use any state or resources, write a pure function. Simple enough: nothing to mock -> no need to mock anything.

If your code depends on a resource (like a file system, an internal service, a system resource, ...), then try to spin them up (ideally without test pollution) specifically for the test so that you can see how your system reacts in a real situation. See my post about test pollution for specific examples of this.

If your code depends on a resource that you do not control, you can write an end-to-end test instead. This way you observe your system the way your customers would, and you can still test it in staging.

If using end-to-end tests is also not an option, you still have other options.

For example, if someone needs to check that the integration with a third party provider "still feels snappy", then you can use manual testing before deploying from staging to production.

In another example, certain problems only become apparent in a larger scenario. In that case you could dogfood your product in staging.

In the extremely unlikely case where you have the great problem that you need to test what happens when 1000 payments are done via your external payment provider, you can use canary deployments to roll out your release to a small part of your user base before rolling out to everyone.

Addendum: Mathematically rigorous mocking

At this point I hope that I have been able to convince you that you should not be using mocking. However, if you still insist that you want to do it, then you should at least understand the mathematical theory that underlies well-founded mocking.

We start by defining the set of all possible executions of a program $P$ as $E_P$. Indeed, a program can have many different executions depending on the inputs that it uses, including the current state of the world, time, and any unknown number of other inputs.

Next, we define a partial order on the execution of programs as follows: We call the things that may fail in production and/or make a test fail, a defect. Call the set of all defects that a program execution $E_A$ manifests $D_{E_A}$. We can call then define the set of defects of a program $A$ as $D_A = \bigcup_E D_{E_A}$.

Now it's very important to realise that $D_A$ is unknown and there is no way to find it out. That is to say: it's an unknown unknown. There is no way to know this set, there is no way to estimate it, there is no way to have an intuition for it. The only thing we can do is show that it contains certain elements. We can never show that it does not have a certain element.

Next, we define the partial order $\prec$ on programs $A$ and $B$ such that $A \prec B$ is defined as $D_A \subseteq D_B$. To show that this is indeed a partial order, we need only look at the fact that $\subseteq$ is a partial order. Vaguely speaking, $A \prec B$ means that program $B$ exhibits at least all the same defects as program $A$ does.

Mocking involves using a program $P$ during deployment but another program $Q$ during testing. We can define a safe mock for a program $P$ as a program $Q$ such that $P \prec Q$ holds. (Not $Q \prec P$!) We give it this name because a safe mock exhibits all the defects that we are testing for as the real program, so any test that passes for the mock would also pass for the real program. (Note that this does not imply that "any test that fails for the mock would also fail for the real program" or "any test that passes for the real program would pass for the mock".) In other words: A safe mock does not hide defects. Formally: A safe mock is an over-approximation of the real program.

It is easy to see that every program $P$ is a safe (if degenerate) mock for itself. It should become clear that there are very few safe mocks to use for any nontrivial program, because the things that can make a mock interesting to use also prevent them from being safe mocks. Indeed. Using an in-memory stand-in for a file system does not exhibit certain defects that a real file system might exhibit. Even if that wasn't the case, and your in-memory stand-in for a file system also exhibits all possible defects that a real file system might exhibit, you still cannot be sure, because it is an unknown unknown. And if you could somehow be sure, congratulations, you would have effectively implemented a real file system instead of a mock.

You are probably thinking "well that's a very restrictive way of thinking about it, I can just ..." and let me tell you: No, you can't, Because unknown unknowns.

Conclusion

Test real code in a real environment instead of fake code in a pretend environment.