The “Friedrich principles” for bioinformatics software

I’ve just come back from Biohackathon 2012 in Toyama, an annual event, traditionally hosted in Japan, where users of semantic web technologies (such as RDF and SPARQL) in biology and bioinformatics come together to work on projects. This was a nice event with an open and productive atmosphere, and I got a lot out of attending. I participated in a little project that is not quite ready to be released to the wider public yet. More on that in the future.

Recently I’ve also had a paper accepted at the PRIB (Pattern Recognition in Bioinformatics) conference, jointly with Gabriel Keeble-Gagnère. The paper is a slight mismatch for the conference, as it is really focussing on software engineering more than pattern recognition as such. In this paper, titled “An Open Framework for Extensible Multi-Stage Bioinformatics Software” (arxiv) we make a case for a new set of software development principles for experimental software in bioinformatics, and for big data sciences in general. We provide a software framework that supports application development with these principles – Friedrich – and illustrate its application by describing a de novo genome assembler we have developed.

The actual gestation of this paper in fact occurred in the reverse order from the above. In 2010, we begun development on the genome assembler, at the time a toy project. As it grew, it became a software framework, and eventually something of a design philosophy. We hope to keep building on these ideas and demonstrate their potential more thoroughly in the near future.

For the time being, these are the “Friedrich principles” in no particular order.

  • Expose internal structure.
  • Conserve dimensionality maximally. (“Preserve intermediate data”)
  • Multi-stage applications. (Experimental and “production”, and moving between the two)
  • Flexibility with performance.
  • Minimal finality.
  • Ease of use.

Particularly striking here is (I think) the idea that internal structure should be exposed. This is the opposite of encapsulation, an important principle in software engineering. We believe that when the users are researchers, they are better served by transparent software, since the workflows are almost never final but subject to constant revision. But of course, the real trick is knowing what to make transparent and what to hide – an economy is still needed.

Affirmation and negation

Some thoughts about the possible ways of affirming or negating something in the world led to the following formal structure.

Affirmation

1. Affirmation as associating yourself, taking the path, making the choice

2. Affirmation by copying what you affirm, assuming its form

3. Affirmation by appropriating something in a sophisticated way, making it your own

Negation

1. Negation by not taking the path, making a different choice, looking away, denying attention

2. Negation by assuming the exact opposite shape of the object, trying to become its inversion. This is in fact a slavish crypto-affirmation.

3. Negation through a cunning, sophisticated attack on the object, in which one  doesn’t assume its form.

Are there other forms?

How one might develop a Heideggerian AI that uses software equipment

This year I’ve spent a fair amount of time trying to read Martin Heidegger‘s great work Being and Time, using Hubert Dreyfus’ Berkeley lectures on the book as supporting material. By now I’ve almost finished division 1. I’m learning a lot, but it’s fair to say that this is one of the most difficult books I’ve read. I’m happy to have come this far and think I have some kind of grasp of what’s going on.

I’ve also come to understand that Heidegger played an important role in the so-called “AI debate” in the 70’s and 80’s. At the time, people at MIT, DARPA and other institutions were trying to make AI software based on the presumptions of an Aristotelian world view, representing facts as logical propositions. (John McCarthy, of Lisp fame, and Marvin Minsky were some of the key people working on these projects). Dreyfus made himself known as a proponent of uncomfortable views (for the AI establishment) at the time, such as Heidegger’s claim that you cannot represent human significance and meaning using predicate logic (more on that in a different post, when I understand it better).

There were even attempts at making a “Heideggerian AI” in response to Dreyfus’ criticism, when it became apparent that “good old fashioned AI”, GOFAI, had failed. But apparently the Heideggerian AI also failed – according to Dreyfus, this was because it wasn’t Heideggerian enough.

Using part 1 of Being and Time as inspiration, I have come up with a possibly novel idea for a “Heideggerian” AI. This is also a first attempt at expressing some of the (incomplete, early) understanding I think I have excavated from Being and Time. As my point of departure, I use the notion of equipment. Heidegger’s Dasein essentially needs equipment in order to take a stance on its being. It has a series of “for-the-sake-of-whichs” which lead up to an “ultimate for-the-sake-of-which”. In the case of an equipment-wielding AI, we might start by hardcoding its ultimate FTSOW as the desire to serve its human masters well by carrying out useful tasks. Dasein can ponder and modify its ultimate FTSOW over time, but at least initially, our AI might not need this capability.

Heidegger’s Dasein is essentially mitdasein, being-with-others. Furthermore, it has an essential tendency to do what “the they”/”the one” does, imitating the averageness of the Dasein around it. This is one of its basic ways of gaining familiarity with practices. By observing human operators using equipment, a well-programmed AI might be able to extract a working knowledge of how to use the same equipment. But what equipment should the AI use to train and evolve itself in its nascent, most basic stages? If the equipment exists in the physical world, the AI will need a sophisticated way of identifying this equipment and detecting how it is used, for example by applying feature detection and image processing to a video feed. This process is error prone and would complicate the task of creating the essential core of a rudimentary but well-functioning AI. Instead, I propose that the AI should use software tools as equipment alongside human operators who use the same tools.

Let’s now consider some of the characteristics of the being of equipment (the ready-to-hand) that Heidegger mentions. When Dasein is using equipment in skilled activity, the equipment nearly disappears. Dasein becomes absorbed in the activity. But if there is a problem with the activity or with the equipment, the equipment becomes more obtrusive. Temporarily broken equipment stands out and draws our attention. Permanently broken equipment is uncanny and disturbing. Various levels of obtrusiveness correspond to levels of breakdown of the skilled activity. And not only obtrusiveness: we become more aware of the finer details of the equipment as it breaks down, in a positive way, so that we may fix it. All this is certainly true for a hammer, car or sewing machine, but is it true of software tools? We may consider both how human users relate to software today, and how our hypothetical AI would relate to it.

Unfortunately it can be said that a lot of software today is designed — including but not limited to user interfaces — in such a way that when it breaks down, the essential details that need to be perceived in order to fix the problems are not there to be seen, for the vast majority of users with an average level of experience. When the software equipment breaks down, presumably our human cognition goes into alert and readies itself to perceive more details so that it can form hypotheses of how to remedy the errors that have arisen. But those details are not on offer. The software has been designed to hide them. In this sense, the vast majority of software that people use today does not fail smoothly. It fails to engage the natural problem solving capacity of humans when it does break, because the wrong things are exposed, and in the wrong ways and contexts. Software equipment has a disadvantage compared with physical equipment: we cannot inspect it freely with all of our senses, and the scrutiny of its inner details may involve some very artificial and sophisticated operations. The makers may even actively seek to block this scrutiny (code obfuscation, etc). In the case of software equipment, such scrutiny is greatly separated from the everyday use by numerous barriers. In the case of physical equipment, there is often a smooth continuum.

We now have the opportunity to tackle two challenges at once. First, we should develop an AI that can use software equipment alongside humans – that is, use it in the same way or nearly the same way that they use it, and for the same or nearly the same purposes. Second, we should simultaneously develop software that “breaks down well”, in the sense that its inner structure becomes visible to users when it fails, in such a way that they can restore normal functioning in a natural way. These users can be both humans and the AIs that we are trying to develop. Since the AI should mimic human cognition, a design that is good for one of these categories should be good for the other as well. In this way we can potentially develop a groundbreaking AI in tandem with a groundbreaking new category of software. Initially, both the AI and the software equipment should be very simple, and the complexity of both would increase gradually.

There would be one crucial difference between the way that humans use the software equiment and that the AI would use it. Human beings interact with software through graphical (GUI) or command-line (CLI) interfaces. This involves vision, reading and linguistic comprehension. These are also higher order capabilities that may get in the way of developing a basic AI with core functionality as smoothly as possible. In order to avoid depending on these capabilities, we can give the AI a direct window into the software equipment. This would effectively be an artificial sense that tells the AI what the equipment is doing at various levels of detail, depending on how smooth the current functioning is. This would be useful both for the AI’s own use of equipment and for its observation of how humans use the same equipment. In this way we can circumnavigate the need for capacities such as vision, language, locomotion and physical actuators, and focus only on the core problem of skilled activity alongside humans. Of course this kind of system might later serve as a foundation on which these more advanced capacities can be built.

Many questions have been left unanswered here. For example, the AI must be able to judge the outcome of its work. But the problems that it solves inside the computer will reference the “external” world at all times (external in so far as the computer is separated from the world, which is to say, not really external). I am not aware of many problems I solve on computers that do not carry, directly or indirectly, references to the world outside the computer. Such references to the external world mean that the “common sense” problem must be addressed: arbitrary details that have significance for a problem may appear, pop up, or emerge from the background, and Dasein would know the relation of these details to the problem at hand, since this is intelligible on the basis of the world which it already understands. It remains to be seen if our limited AI can gain a sufficient understanding of the world by using software equipment alongside Dasein. However, I believe that the simultaneous development of software equipment and a limited AI that is trained to use it holds potential as an experimental platform on which to investigate AIs and philosophy, as well as software development principles.

 

Complex data: its origin and aesthetics

Kolmogorov complexity is a measure of the complexity of data. It is simple to define but appears to have deep implications. The K. complexity of a string is defined as the size of the simplest possible program, with respect to a given underlying computer, that can generate the data. For example, the string “AAAAAA” has lower complexity than “ACTGTT”, since the former can be described as “Output ‘A’ 6 times”, but the latter has no obvious algorithmic generator. This point becomes very clear if the strings are very long. If no obvious algorithm is available, one has no option but to encode the whole string in the program.

In this case, when writing the “program” output ‘A’ 6 times, I assumed an underlying computer with the operations “repetition” and “output”. Of course a different computer could be assumed, but provided that a Turing-complete computer is used, the shortest possible algorithm is unlikely to have a very different length.

An essential observation to make here is that the output of a program can be much longer than the program itself. For example, consider the program “output ‘A’ 2000 times”. K. complexity has an inverse relation to compression. Data with low K. complexity is generally very easy to compress. Compression basically amounts to constructing a minimal program that, when run, reproduces the given data. Data with high K. complexity cannot, per definition, be compressed to a size smaller than the K. complexity itself.

Now that the concept is clear, where does data with high K. complexity come from? Can we generate it? What if we write a program that generates complex programs that generate data? Unfortunately this doesn’t work – it seems that, because we can embed an interpreter for a simple language within a program itself, a program-generating program doesn’t create data with higher K. complexity than the size of the initial, first-level program. A high complexity algorithm is necessary, and this algorithm must be produced by a generating process that cannot itself be reduced to a simple algorithm. So if a human being were to sit down and type in the algorithm, they might have to actively make sure that they are not inserting patterns into what they type.

But we can obtain vast amounts of high complexity data if we want it. We can do it by turning our cameras, microphones, thermometers, telescopes and seismic sensors toward nature. The data thus recorded comes from an immensely complex process that, as far as we know, is not easily reduced to a simple algorithm. Arguably, this also explains aesthetic appeal. We do not like sensory impressions that are easily explained or reduced to simple rules. On first glance at least, hundreds of blocks of identical high density houses are less attractive than low density houses that have grown spontaneously over a long period of time (although we may eventually change our minds). Objects made by artisans are more attractive than those produced in high volumes at low cost. Life is more enjoyable when we can’t predict (on some level, but not all) what the next day will contain.

The deep aesthetic appeal of nature may ultimately have as its reason the intense complexity of the process that generates it. Even a view of the sky, the desert or the sea is something complex, not a case of a single color repeated endlessly but a spectrum of hues that convey information.

 

Identity games

I’ve recently seen the film Tinker, Tailor, Soldier, Spy, based on John le Carré’s novel with the same name. In the 1970’s a TV series based on the same novel, with Alec Guinness as George Smiley, was very popular in Britain. This film, with Gary Oldman as the protagonist, is supposed to be something like an update for the new generation.

It is a very good film indeed. (I cannot remember the last time I was so gripped by a film shortly after its release.) I was also inspired to read several of le Carré’s novels, including but not limited to Tinker, Tailor, Soldier, Spy. What they have in common is a subtle, rich portrayal of the spy trade from the viewpoint of Britain during the cold war; a world that seems to be, increasingly, a thing of the past. Voice recognition, social profiling and data mining seems to be taking the place of a good chunk of what le Carré calls tradecraft – the concrete skills that spies with 1970’s technology need in order to perform their work on the ground in enemy territory – and computer scientists like myself are to blame.

While being hailed as the anti-Ian Fleming due to his relatively gritty realism, Le Carré is not without his own spy romanticism. But the bleakness inherent in the work comes through on every page.

In his commentary on the film, le Carré states that

[The world of spies is] not so far from corporate life, from the ordinary world. At the time of writing the novel, I thought that there was a universality that I could exploit. The book definitely resonated with the public; people wanted to reference their lives in terms of conspiracy, and that remains central to the relationship between man and the institutions he creates.

There is something profound in this. Spies are merely concentrated versions of something that we all are ourselves, something that we must be every day. Spies project false personalities in order to gain access and information, either about enemy assets or about other spies. They hide to survive, and they hide so that they may uncover a kind of truth. With a view to the spy as the most concentrated form of a certain kind of existence, let us take a look at some other forms that this existence may take.

The modern professional. To be professional means to effectively project a professional identity in the workplace. To be unprofessional almost always means that too much of another, possibly more genuine personality shines through – one has become too unrestrained. The professional needs to always be projecting, to a degree, in order to remain compatible with the workplace and retain his income and career prospects. Young people are socialised into this condition very early – at career workshops, students learn how to polish their CVs, how to embellish their record, and to hide their flaws. This is essentially a partial course in spycraft. But all this is only at the entry level. When any kind of sophisticated politics enters the organisation – as it does – the professional may be pushed ever closer to the spy. A recruiter: “Too bad that we couldn’t hire him, he seemed genuine.”

The academic. The academic can be thought of as a special version of the professional with some essential differences. First, professionals do not yet have universal records that follow them around for their entire lifetime – much of the “record” that they create, which is associated with the persona they are supposed to project, exists only in the memory of people and of one organisation. Academics build their records with units such as publications and conference attendance. Publications in particular form an atomic record that does not go away. On the other hand, the everyday life of the academic may – possibly – be less artificial than that of the professional, since focus is on the production of publishable units, not on pleasing people in one’s surroundings as much as possible.

The philosopher.  Philosophers seek to uncover some hidden truth about the world. In this sense, they are spies without enemies. The philosopher lives among people with a view to analysing them and understanding their behaviour, so that he can explain it to them. But most of the time the philosopher is likely to be a flaneur or a quiet observer, like the spy often is: someone who seeks to learn something hidden from situations that other participants may regard as being routine and their everyday existence. In this sense spies may have something in common with philosophers.

Here I have highlighted a phenomenon but not made any recommendations. Maybe it’s for the better that we are all a little bit like spies. Masks of some kind are worn in most social interactions, not just the ones above, and they are not a recent phenomenon. Exposing something like a true inner self requires that the inner self remains static long enough for it to be possible to expose. But the difference between most social relationships and the relationships we have with institutions today is that the former can change or dissolve naturally to fit spontaneous changes in people’s characters or needs. Relationships between people and modern institutions do not seem to be capable of this dynamic as of yet.