Test driven design, or planning driven development

Design processes in most business software development resemble peer review or crowd-sourcing. A putative design is presented to peers, who will do their best to find problems while the originator of the design defends it against these challenges. Ideally, where they are demonstrated incorrect or incomplete the process will iterate and an updated design produced and defended.

For many aspects of software design this is usually both efficient and sufficient, if subject to the biases and power relations of the individuals and groups involved. For the scenario below however, collective agreement could well be dangerously insufficient.

As programmers we learned to internalise the process of computation - running fragments of programs in our heads and reasoning about their behaviour. The shift to test-first methodologies in the early 2000 demonstrated to many of us that the our intution about the behviour of software even at a small scale can be misleading, particularly as we evolve a program and components begin to interact. With the distributed systems that we now work with the complexity is even greater and the conditions for the controlled failure required by test driven development difficult to reproduce.

There are technoloigical aids - processes and distributed systems can and should be heavily instrumented so that SRE teams and developers can be aware of, pinpoint and resolve these issues in production and staging systems. We build feedback loops back into our organisations to iteratively improve systems as their faults are discovered.

But what if we could find some of these faults before a line of code was written?

A familiar scenario

We are designing a job processing subsystem and the prototype sometimes doesn't emit a message saying that that the job is has been executing is completed, despite their status in the database being set to complete.

The subsystem consists of a task pool, a message queue, a supervisor pool and a nosql store for storing task state. Once a subtask is complete it posts a message to a message queue. This message queue is subscribed to by one or more task monitor proceses. The task monitor process is responsible for determining when all the subtasks for a particular task have been completed and triggering zero of more possible actions depending on the task.

A supervisor process, on reciept of a task completion notification does the following:

  • Loads the job from the database
  • Append the new task status to the set of task statuses
  • Loads the job again from the database
  • Checks if the new set of task statuses is considered complete. If this set is complete and the previous was not, emit a completed message for the job.

When running the integration tests for this prototype system they intermittently fail when checking for the notification.

Formal modelling

When scientists or engineers in other fields cannot reason about or easily manipulate real systems due to their complexity or other constraints they work with models instead. We do this in software to an extent - our specificiations and architecture documents have semi formal descriptions of desired system behviour, and our test suites also implicity contain a model of system behaviour under certain conditions. Where we usually differ is that scientists and engineers in other fields use formal, mathematical models of the systems that they want to understand. This is much less common in our field.

Formal models can be difficult to create, and fully modelling a non trival software system is best left to very specialist safety or security critical problems. This does not mean that they are not useful for more standard systems.

A sufficent model for the scenario above

We can create a formal model for the properties of this design that we are interested in, accounting for messsage passing, concurrency and potential failure. There are various tools for doing this, but for this example we will use TLA+. TLA+ is probably the most tractable modelling language for a few reasons; it is based on relitively simple mathematics that should be familiar to most programmers - set theory and temporal logic, it does not require proofs - it comes with a toolkit for iteratively exploring the state spaces of models and it is beginning to be used outside of academic and critical software. Both Amazon and Micrsoft use it to model many of their cloud services and it is increasingly being used by application developers, particularly those working with distributed systems.

For the puposes of this article, we will wave our hands a little here. The details of TLA+ syntax and semantics are beyond its scope. We can model the important factors of this design in less than 50 lines of Pluscal (the high level TLA+ syntax).

EXTENDS Sequences, TLC

\* Receive a message from the queue
macro recv(queue, receiver)
begin
  await queue # <<>>;
  receiver := Head(queue);
  queue := Tail(queue);
end macro

procedure appendProgress(writeNodes, status)
variable nodes = writeNodes;
begin P0:
  while (nodes # {}) do
  P1:
    with n \in nodes do
      progress[n] := progress[n] \union {status};
      nodes := nodes \ {n};
    end with
  end while;
  return;
end procedure

\* Reads the progress set from the given nodes
ReadProgress(nodes) == UNION {progress[n] : n \in nodes}

\* Returs a set with all subsets of nodes with the given cardinality
NNodes(n) == {x \in SUBSET Nodes : Cardinality(x) = n}
That helper can then be used to describe the variables in the process that describes the process handler:

\* Handles a progress message from the queue
fair process progressHandler \in {"handler1", "handler2"}
variable
  writeQuorumNodes \in NNodes(Quorum),
  readQuorumNodes \in NNodes(Quorum),
  secondReadQuorumNodes \in NNodes(Quorum),
  completedBefore = FALSE,
  message = "";
begin P0:
  while TRUE do
  Poll:
    recv(queue, message);
    unacked := unacked + 1;
  Read:
    completedBefore := ProcessingComplete(ReadProgress(readQuorumNodes));
  Write:
    call appendProgress(writeQuorumNodes, message);
  ReadAfterWrite:
    if ~completedBefore /\ ProcessingComplete(ReadProgress(secondReadQuorumNodes)) then
      \* The real progress handler would trigger some action here
      switchHappened := switchHappened + 1;
    end if;
  Ack:
    unacked := unacked - 1;
  end while;
end process;

The invariant we use to find the defect. We check that there is a state where the queue is empty, all messages are acked, and that the job completion swtich has happened.

Correctness == \/ queue # <<>>
               \/ unacked > 0
               \/ switchHappened > 0

And a liveness property, this is a statement in temporal logic that asserts that at some point while exploring the state space of the model the switch to complete happened. This is required as it is quite possible for an invariant to be satisfied without any state changes at all.

Liveness == <>[](switchHappened > 0)

When we execute this model in the TLA+ environment, the model executes without error. So we have encountered one of the most common issues with using model based approaches - which is real, the map or the territory? If satisified that the model is sufficient, we can check that the actual software system matches its design. In this case, the location of the problem is quite obvious - the guarantees of our use of the nosql store do not match those specified by the model. Enabling quorum for the reload of the task list will resolve the problem with a minimal and suffient change to the real system.

As you can see, the effort required to produce models sufficient to prove properties about a fairly realistic disitrubted system scenario is actually quite reasonable, particularly with a pattern library of distributed system components available. San Digital are able to offer this level of assurance for applicable parts of systems during the Design phase of our process.

Jumping into the FHIR - type systems and objects

We have been doing a deep-dive on FHIR implementations and tooling following our initial FHIR investigation. A critical area of investigation for any system, particularly a large distributed system with many clients and peers that need longevity and guided evolution is its type system. Use of a strict type system can have many benefits - interoperability, governance, data quality, security, performance, developer experience, reduced component complexity and the ability to evolve services with confidence

Integrating with Events

The San Digital team has worked with numerous organisations in both the public and private sectors to transform their applications architecture into a flexible and business-focused model. Working with events at scale is key to maintaining individual teams' agility.

The process of building a mobile app

The team at San Digital has extensive experience developing apps for mobile devices, smartwatches, and smart TVs; using native and hybrid technologies (and everything in-between!) including using Rust for complex comms.

Low friction development environments

While setting up a sample project from an unnamed large vendor the other week I was disappointed by having to read large amounts of documentation and run various bits of script to install dependencies and set up infrastructure. We live in a world that has tools old (Make) and new (Docker) that can be combined to make onboarding engineers low or zero friction.

Cloud-native FHIR platforms

Continuing our series of posts on web protocols, we have been investigating more specialist protocols, in this case, "FHIR". We have produced a document based on our research, investigations and experience.

Team Structures

Multiple team structures can work to deliver software projects. There is no real one size fits all, however, there are common components that can be seen across different structures. At San Digital we believe that Engineer-led teams deliver great results for short duration high-impact projects.

Rules of the Road

This is called rules of the road but they aren't rules they're more guidelines, so they're rules until there is a good reason to ignore them.

Estimating and delivering defined outcomes

Recently there has been a shift away from time and materials projects towards defined outcomes, driven by various legislative changes, specifically IR35, but also cost control in the procurement function of larger organisations.

The San Digital Stack

San Digital has been designed as a remote first business from inception, on the assumption that it's easier to add offices later if they are necessary in an agile way. To work in collaborative way completely remotely takes a carefully thought out set of tools. Some of the ones that we use are really standard and some are a little more interesting.

Test driven design, or planning driven development

Design processes in most business software development resemble peer review or crowd-sourcing. A putative design is presented to peers, who will do their best to find problems while the originator of the design defends it against these challenges. Ideally, where they are demonstrated incorrect or incomplete the process will iterate and an updated design produced and defended.

A human view of computer vision at scale

Computers analysing and acting on what they see is not science fiction or even a new concept, it has been a reality of humankind's drive towards hyper-efficiency since around the time I was born.

Building scalable frontends

Scaling frontends is hard, actually scaling all codebases is hard, frontends just happen to be particularly visible and have a tighter feedback loop and a higher rate of change. As with all codebases, it is in principle possible to scale development through standards and integration processes, but these are a poor substitute for communication. Once development moves beyond the scope of a single team, either progress slows to take into account of different processes or implementations drift away from each other over time. Teams need to find a way to operate independently towards a goal.

Cross platform native mobile development with Rust

San Digital have extensive experience of mobile development and the use of Android as an embedded operating system. We treated android as a deployment target target for Rust firmware as well as writing our intricate real time communications component for both iOS and Android. This approach has advantages, you can maintain a single code base for a complicated communications layer, while also taking advantage of the full native capabilities of each platform

The evolution of web service protocols pt 2

At San Digital, some of us have been building things for people since the dawn of the web. Our historical perspective helps inform us about technological culture and trends today, almost compensating for the creaking knees.

The evolution of web service protocols pt 1

At San Digital, some of us have been building things for people since the dawn of the web. Our historical perspective helps inform us about technological culture and trends today, almost compensating for the creaking knees.