At Scala Matsuri a few weeks ago (incidentally, an excellent conference), I was fortunate to be able to attend Jonas Bonér’s impassioned talk about resilience and reactive software. His theme: “without resilience, nothing else mattersâ€.
At the core of it is a certain way of thinking about the ways that complex systems fail. Importantly, complex systems are not the same as complicated systems, although in everyday speech we tend to confuse the two. Perhaps a related or even identical question is: how do composite systems fail?
Using a terminology that originates with the Erlang language, Bonér talked about the “error kernelâ€, which is the part of a software system that must never fail, no matter what. As long as this innermost part stays alive, other parts are allowed to fail. There are mechanisms to replace, restart or route around failures in the outer parts.
This style of design leads to a well-structured failure and supervision hierarchy. Maybe this style of thinking itself is the most important contribution. In most software systems being designed today, the possibility of errors or failures is often a second class citizen, swept under the carpet, and certainly not part of a carefully considered structure of possibilities of failure. What if this structure becomes a primary concern?
Once errors are well structured and organised in a hierarchy, it also becomes easy to decide what to do when errors occur. The hierarchy structure clearly indicates which parts of a system have become defunct and need to be replaced or bypassed. Recoverability – being able to crash safely – at every level takes the software system a little bit closer, it seems, to biological systems.
Biological systems, Bonér pointed out, usually operate with some degree of inherent failure, be it disease, weakness, mutations or environmental stress. Perfect functioning is not typical, and it seems to me that for most organisms such a state may not even exist.
Recoverability at every level, resilience, and error hierarchies – “let it fail†– is truly a significant and very humble way of thinking about software. It means that as the developer, I acknowledge that the software I am writing does not control the universe (although as a developer I often fall prey to that illusion). The active principle, the “prime moverâ€, is somewhere outside the scope that I control. When it produces some unforeseen circumstance, we must respond properly. Reactive software to me seems to quietly acknowledge this order of things.
I have only had a very brief opportunity to try out Akka, Typesafe’s actor framework, in my projects so far, but I felt inspired by Boner’s talk and hope to use it more extensively in the future.
Post a Comment