2011-03-11 14:59 by pjotrp

1 Designing the BioScala library

Copyright (C) 2010-2011, Pjotr Prins

HTML version

Scala is a strongly typed functional language, that allows for object oriented programming (OOP). Too functional purists this is something of a contradiction. However, it does make sense for complex libraries - OOP allows for partitioning of data and code.

Designing a library from scratch is a good opportunity to make fundamental choices. OOP can easily be driven too far and can make hierarchies of objects hard to maintain. With BioScala we opt for a functional approach:

OOP is used in the most minimalistic fashion to partition functionality.

Maybe an example will help. With functional programming we emphasize immutable values and no side effects. We also design functions in a way that they use basic objects, when possible.

So, it makes sense to have a Nucleotide class - which allows for strong type checking. A Sequence has a list of nucleotides and a list of attributes, like ID and descriptions.

Now we are to add transcription. Should it be part of the Sequence class? In classic OOP design: yes. However, with BioScala we don't think transcription is a part of a Sequence object. Rather than cluttering Sequence - which is an object - we create a SequenceTranslation object, which carries no state. The translate method is without side effects. It takes as input a list of nucleotide, and returns a fresh transcribed list of nucleotides.

For convenience we add a transcribe method to Sequence, which delegates to the new function. Someone may wish to say 'seq.transcribe' - it should return a new transcribed Sequence. At least, when there is a reason to, for example a deeper copy happens of, for example, attributes. The rule is:

The main difference with a traditional OOP delegator pattern, like you would write in Ruby or JAVA, is that the list of nucleotides get passed to the function, as input. Traditionally a Sequence object would be passed in.

What is the advantage? First, SequenceTranslation is not aware about Sequence objects. So you can create an alternative implementation easily. Second, the transcribe method is cleaner - it does not refer to Sequence object internals.

1.1 FastaReader

A similar example is FastaReader. Compare the BioScala implementation with the current one in BioJAVA. In BioJAVA the FastaReader takes care of reading a FASTA file, transforming it to BioJAVA Sequence objects in memory, and returns the type automatically (DNA, RNA, Protein). It looks very attractive, as it takes care of everything. However, it is not a flexible approach.

The BioScala FastaReader, in contrast, does very little. The FASTA file is iterated, returning a Tuple every time. The tuple consists of the record ID, description and sequence - as Strings. We can assume that is the minimal layout of a FASTA record. Even if we no nothing of the content. The only state contained in the FastaReader object is the file handle. You can use the same FastaReader for nucteotides, amino acids, alignments, degenerated data etc.

FastaReader makes no real assumptions on the content of the FASTA file. It only split out records, based on lines starting with '>'. And it splits out the ID from the description. If you encounter some dirty file you can easily clean it up by filtering on the data, before passing it into the Sequence object(s). Also it is trivially simple to plug in a new implementation of a FastaReader.

FastaReader does not load everything in memory.

Reading files in RAM is a big NO, in this age of Big Data.

BioJava also provides a listener interface for reading files, to circumvent reading everything in memory. However the listener requires a lot of boiler plate code. It is much harder to use than an iterator. Unfortunately, it appears to be impossible to use a listener and provide an iterator, without loading everything in memory. If someone knows how to do that - it would be very<font color='GRAY'> interesting. Maybe an actor could do it, with the BioJava listener in a separate actor. Every time the iterator requests the actor, the listener moves one record forward.

1.2 Support for ambiguous nucleotides

BioScala has several nucleotide classes which support data that is pure (AGCT only), gapped (adds '-'), or has ambigous data like IUPAC. If you use DNA.IUPACSequence over DNA.Sequence, you'll get type checking for IUPAC.

bio.Sequence[T] is the abstract base class for all Sequences. DNA.Sequence supports Nucleotide - so only AGCT values. DNA.IPACSequence supports IUPAC. And IUPACGappedSequence adds gaps. The design is similar for RNA and Protein sequences. Protein contains support for Amino Acid and Codon type Sequences. CodonSequence supports the widest definition of a Sequence (IUPAC and Gaps).

Support for translation etc. is mixed into the Sequence types.

The current implementation is somewhat inheritance based - mixed traits may be a later improvement.

1.3 On Gaps

Aligned sequences contain 'gaps'. Each Sequence stores a list of:

  Nucleotide | Gap
  AminoAcid | Gap
  Codon | Gap

Where Gap is a bio.Symbol and Codon is an amino acid with matching codon sequence. One way to handle it is to store a List[Symbol] that supports a Gap and one of the other types in, for example, GappedSequence[Codon]. The type safety is not really maintained at the lowest level - as it is storing two different types. Still, the higher interface, which constructs the Sequences, can be type safe.

Maybe we should support special IUPAC Gap types. More complex Gaps can store information as attributes, so we can do with a single symbol, the dash. Thus the single types

  DNA.GappedIUPAC for DNA.GappedSequence
  Protein.GappedIUPAC for Protein.GappedSequence
  GappedCodon for GappedCodonSequence

and also

  DNA.GappedNucleotide for DNA.Sequence
  Protein.GappedAminoAcid for Protein.Sequence

are type safe. This implies that for these symbols we can safely use the base abstract class bio.Sequence[T], rather than having a specialised class.

Gapped Sequences share traits. One of them is to split the Sequence into gapped and non-gapped sections. This functionality goes into the RichGappedSequence trait.

1.4 Generic functions

To enforce type safety we end up with different lists of Symbols having a similar meaning. For example Protein.Gap and Nucleotide.Gap are really different alignment spacing objects.

We could create different implementations for each type, to ensure type safety. Still the whole point of writing functional code is to write code once. So how do we do it with Scala?

Let's say we want to split a Sequence into gapped and non-gapped sections. We want to write something like:

    def sections(seq: List[Symbol], gap: Symbol): List[List[Symbol]] = {
      def isMatch(inGapped: Boolean, c: Symbol) = {
        if (inGapped) 
          c == gap
          c != gap
      val isGap = (seq(0) == gap)
      val s = seq.takeWhile{ isMatch(isGap,_) }
      val tail = seq.takeRight(seq.length - s.length)
      tail match {
        case Nil => s :: Nil
        case _   => s :: sections(tail, gap)

the problem is - what type is gap? As it is an object we can pass it in from the start:

  sections(seq, gap) ...

which would work fine. However, we may not want to make this exception every time.

In many cases I have seen a solution which gives the object a query method, like isGap? So a that Gap symbol returns True, and a non-gap returns False. This is an often seen OOP solution. The main issue I have with it is that object become unneccessarily 'rich'. Besides, why should an Amino Acid object care whether it is a gap (or not)?

Similarly 'properties' are added, like isGap (true or false). This is seen in non-OOP languages, like in the EMBOSS C code. This way you end up with a long list of (sometimes conflicting!) properties.

Dynamic languages, like Python and Ruby, but in this case also Java, allow for querying the name of the object's class. Thus, one can check whether it is a gap, or not, by quering the name of the object. This is nicer - and one reason you get more compact class libraries with Ruby. In the same vein you could use method_missing? to query for an isGap? method, in Ruby. Saving sprinkling Gap knowlegde over the object hierarchy.

For me, being new to Scala, I confront myself with this issue. Apart from passing in the Gap object, Scala allows the OOP solution (a method, or property for every object), and we can query the class name (Java style). My gut says there should be a better way.

The solution is a class using generics, and som lower bound typing on functions. I blogged about the different options, starting here

The final solution reads something like

    class Splitter[T] {
      def section(seq: List[T]): List[List[T]] = {
        def isGapType[Q >: T](c : Q) = (c == DNA.Gap || c == Protein.Gap)
        def isMatch(c : T) = { if (isGapType(seq(0))) isGapType(c) else !isGapType(c) }
        val (s, tail) = seq.span{ isMatch }
        tail match {
          case Nil => s :: Nil
          case _   => s :: section(tail)

with the matching Specification:

    "section" should "split on Nucleotide" in {
      new Splitter[Nucleotide].section(List(A,G,Gap,Gap,C,T,Gap,T)) should equal (List(List(A,G),List(Gap,Gap),List(C,T),List(Gap),List(T)))

Basically we can write the function once, so its implementation gets shared between different types - in a type safe way. We can even hide the type specification by adding in the bio.DNA and bio.Protein packages:

    type UseSymbol = Nucleotide
    object MySplitter extends Splitter8[UseSymbol]


    "section" should "split on Nucleotide" in {
      MySplitter.section(List(A,G,Gap,Gap,C,T,Gap,T)) should equal (List(List(A,G),List(Gap,Gap),List(C,T),List(Gap),List(T)))

1.5 File readers

BioScala's file readers are Iterators. An iterator reads a file on demand, so it does not store everything in memory first.

1.6 File writers

BioScala has file writers. The model of a filewriter is in bio/db/paml/pamlwriter.scala. Basically it is a class that contains state on the file stream. To start the PamlWriter you can use a file name or stream. Use the bio.io.Control.using() method to wrap a close() statement around the writer, so the stream always gets closed. The way it is implemented you can write sequentially:

     import bio.io.Control._
      using(new FileOutputStream(tmpfn)) { stream =>
        new PamlWriter(stream).write(seqlist)
        new PamlWriter(stream).write(seqlist)

1.7 On Alignments

An alignment is a list of aligned sequences. To deal with empty space in sequences a 'Gap' object is inserted for every missing nucleotide or amino acid. In this edition of BioScala the Alignment operations are on the most simple data representation, often sequences represented by lists of Symbols. An Alignment is


For some operations List may be changed into Array, e.g. for constant time (fast) access of matrix elements. The Symbols may contain attributes - which transparently migrate along - e.g. a Codon Symbol carries both amino acid and nucleotide representation. That way, any matrix operation retains the Codon information.

Matrix operations that require some form of meta-Sequence information, such as special features, either need to be explicit, or those operation Sequence objects may be passed in, with their meta attributes.

For returning meta-information also explict return values may be used. This is preferred over modifying the attributes of Sequence objects internally. This, again, to provide a functional implementation. So use the 'lower' type

  List[Symbol] instead of Sequence[Symbol]

when possible.

What does this mean for designing classes and traits?

For example, we may want to remove empty columns in a sparse alignment matrix. This is a special action, which will have the matrix as input, and a new matrix as output, accompanied with a 'log' of removed columns.

As there is no further state to track, the implementation does not need to be a class or object. So, make it a trait. Name it SparseAlignment. The trait performs an action, and we define a method named 'removeEmptyColumns'. The method removes columns, and we want to know which ones were removed (this is the log). The minimal interface returns both the new alignment and a list of removed column numbers.

Resulting in:

  trait SparseAlignment[T] {
    type Alignment = List[List[T]]  // where T is a bio.Symbol
    def removeEmptyColumns(in: Alignment) : (Alignment, List[Integer]) 

We can predefine for every Symbol a companion class

  package bio.DNA
  object SparseAlignment extends SparseAlignment[NTSymbol]

which allows for the following example

  val s1 = new GappedSequence("agc--taacg---")
  val s2 = new GappedSequence("agc---aaca---")
  val s3 = new GappedSequence("agc---aaca---")
  val m = List(s1.toList,s2.toList,s3.toList)
  val (m1,log) = SparseAlignment.removeSparseColumns(m,1)
  # tests
  m1.head.mkString should equal ("agctaacg")
  log1 should equal (List(3,4,10,11,12))

Whatever object you come up with, as long as it can be transformed to a simple list, it will work with this algorithm.

1.7.1 Finding SNPs in an alignment

The next example is to look for SNPs in an alignment. This is fairly similar to removeSparseColumns, but now >we want to return a list of columns containing SNPs. The simplistic method is to return all columns that contain more than one type of nucleotide.

One could write anothe specialized function GetColumnsWithNucleotides, e.g.

  val s1 = new GappedSequence("ag---ctaacaa")
  val s2 = new GappedSequence("ag---caaacag")
  val s3 = new GappedSequence("ag--ccaaacgg")
  val m = List(s1.toList,s2.toList,s3.toList)
  val snp = SparseAlignment.GetColumnsWithNucleotides(m,1)
  # test
  snp should equal (List(6,10,11))

but we should stick to a functional approach. Basically the generic method is to visit every column and run a user defined function

  def hasMultipleNucleotides(col: List[Symbol]) {
    val uniquelist = col.removeDuplicates.filter { _ != DNA.Gap }
    uniquelist.size > 1
  val snp = SparseAlignment.columns.map { hasMultipleNucleotides }
  # test
  snp should equal (List(6,10,11))

The point is that the library does not need to cater for every implementation detail, as long as generic functions work. The function hasMultipleNucleotides is too trivial to add to the library - thank functional programming for that.

1.7.2 Tracking Sequence information in Alignment

To facilitate further functionality you may want to create specialized versions of this trait. In above example the GappedSequence may contain an identifier 'id', which is now lost in the result. A specialized implementation for Sequence may want to put humpty dumpty together again. In this case it would be code that builds on:

  val r1 = new GappedSequence(s1.id,m1(0))

creating an updated Sequence object with both the id and the sequence content.

1.8 See also


wikiTEXer 0.55 by Pjotr Prins - generated 2011-03-11 14:59 by pjotrp