Test driven design, or planning driven development

  • user
  • date
    December 15, 2020
Test driven design, or planning driven development

Design processes in most business software development resemble peer review or crowd-sourcing. A putative design is presented to peers, who will do their best to find problems while the originator of the design defends it against these challenges. Ideally, where they are demonstrated incorrect or incomplete the process will iterate and an updated design produced and defended.

For many aspects of software design this is usually both efficient and sufficient, if subject to the biases and power relations of the individuals and groups involved. For the scenario below however, collective agreement could well be dangerously insufficient.

As programmers we learned to internalise the process of computation - running fragments of programs in our heads and reasoning about their behaviour. The shift to test-first methodologies in the early 2000 demonstrated to many of us that the our intution about the behviour of software even at a small scale can be misleading, particularly as we evolve a program and components begin to interact. With the distributed systems that we now work with the complexity is even greater and the conditions for the controlled failure required by test driven development difficult to reproduce.

There are technoloigical aids - processes and distributed systems can and should be heavily instrumented so that SRE teams and developers can be aware of, pinpoint and resolve these issues in production and staging systems. We build feedback loops back into our organisations to iteratively improve systems as their faults are discovered.

But what if we could find some of these faults before a line of code was written?

A familiar scenario

We are designing a job processing subsystem and the prototype sometimes doesn’t emit a message saying that that the job is has been executing is completed, despite their status in the database being set to complete.

The subsystem consists of a task pool, a message queue, a supervisor pool and a nosql store for storing task state. Once a subtask is complete it posts a message to a message queue. This message queue is subscribed to by one or more task monitor proceses. The task monitor process is responsible for determining when all the subtasks for a particular task have been completed and triggering zero of more possible actions depending on the task.

A supervisor process, on reciept of a task completion notification does the following:

  • Loads the job from the database
  • Append the new task status to the set of task statuses
  • Loads the job again from the database
  • Checks if the new set of task statuses is considered complete. If this set is complete and the previous was not, emit a completed message for the job.

When running the integration tests for this prototype system they intermittently fail when checking for the notification.

Formal modelling

When scientists or engineers in other fields cannot reason about or easily manipulate real systems due to their complexity or other constraints they work with models instead. We do this in software to an extent - our specificiations and architecture documents have semi formal descriptions of desired system behviour, and our test suites also implicity contain a model of system behaviour under certain conditions. Where we usually differ is that scientists and engineers in other fields use formal, mathematical models of the systems that they want to understand. This is much less common in our field.

Formal models can be difficult to create, and fully modelling a non trival software system is best left to very specialist safety or security critical problems. This does not mean that they are not useful for more standard systems.

A sufficent model for the scenario above

We can create a formal model for the properties of this design that we are interested in, accounting for messsage passing, concurrency and potential failure. There are various tools for doing this, but for this example we will use TLA+. TLA+ is probably the most tractable modelling language for a few reasons; it is based on relitively simple mathematics that should be familiar to most programmers - set theory and temporal logic, it does not require proofs - it comes with a toolkit for iteratively exploring the state spaces of models and it is beginning to be used outside of academic and critical software. Both Amazon and Micrsoft use it to model many of their cloud services and it is increasingly being used by application developers, particularly those working with distributed systems.

For the puposes of this article, we will wave our hands a little here. The details of TLA+ syntax and semantics are beyond its scope. We can model the important factors of this design in less than 50 lines of Pluscal (the high level TLA+ syntax).

EXTENDS Sequences, TLC

\* Receive a message from the queue
macro recv(queue, receiver)
  await queue # <<>>;
  receiver := Head(queue);
  queue := Tail(queue);
end macro

procedure appendProgress(writeNodes, status)
variable nodes = writeNodes;
begin P0:
  while (nodes # {}) do
    with n \in nodes do
      progress[n] := progress[n] \union {status};
      nodes := nodes \ {n};
    end with
  end while;
end procedure

\* Reads the progress set from the given nodes
ReadProgress(nodes) == UNION {progress[n] : n \in nodes}

\* Returs a set with all subsets of nodes with the given cardinality
NNodes(n) == {x \in SUBSET Nodes : Cardinality(x) = n}
That helper can then be used to describe the variables in the process that describes the process handler:

\* Handles a progress message from the queue
fair process progressHandler \in {"handler1", "handler2"}
  writeQuorumNodes \in NNodes(Quorum),
  readQuorumNodes \in NNodes(Quorum),
  secondReadQuorumNodes \in NNodes(Quorum),
  completedBefore = FALSE,
  message = "";
begin P0:
  while TRUE do
    recv(queue, message);
    unacked := unacked + 1;
    completedBefore := ProcessingComplete(ReadProgress(readQuorumNodes));
    call appendProgress(writeQuorumNodes, message);
    if ~completedBefore /\ ProcessingComplete(ReadProgress(secondReadQuorumNodes)) then
      \* The real progress handler would trigger some action here
      switchHappened := switchHappened + 1;
    end if;
    unacked := unacked - 1;
  end while;
end process;

The invariant we use to find the defect. We check that there is a state where the queue is empty, all messages are acked, and that the job completion swtich has happened.

Correctness == \/ queue # <<>>
               \/ unacked > 0
               \/ switchHappened > 0

And a liveness property, this is a statement in temporal logic that asserts that at some point while exploring the state space of the model the switch to complete happened. This is required as it is quite possible for an invariant to be satisfied without any state changes at all.

Liveness == <>[](switchHappened > 0)

When we execute this model in the TLA+ environment, the model executes without error. So we have encountered one of the most common issues with using model based approaches - which is real, the map or the territory? If satisified that the model is sufficient, we can check that the actual software system matches its design. In this case, the location of the problem is quite obvious - the guarantees of our use of the nosql store do not match those specified by the model. Enabling quorum for the reload of the task list will resolve the problem with a minimal and suffient change to the real system.

As you can see, the effort required to produce models sufficient to prove properties about a fairly realistic disitrubted system scenario is actually quite reasonable, particularly with a pattern library of distributed system components available. San Digital are able to offer this level of assurance for applicable parts of systems during the Design phase of our process.

Get in touch

Let’s do something great