2011-03-11 14:59 by pjotrp

1 BioScala Tutorial

Copyright (C) 2010-2011, Pjotr Prins

HTML version

1.1 Introduction

The goal of the BioScala project is to create a scalable and functional bioinformatics library, building on existing libraries available on the JAVA Virtual Machine (JVM) - including BioJAVA and BioRuby.

1.2 Install

For installation and compilation of bioscala, follow the instructions in ./INSTALL

1.3 Run BioScala

Fire up Scala using the bioscala script with the -p switch, which makes sure the classpath is loaded:

  cd bioscala
  ./bin/bioscala -p

  Welcome to Scala version 2.7.7 final 
  scala> _

Import BioScala

  scala> import bio._

and create a Sequence object

  scala> val dna = new DNA.Sequence("agctaacg")
  seq: bio.DNA.Sequence = agctaacg 

1.4 Working with Sequences

Sequence is strongly typed. Creating a DNA sequence from and RNA string will give an error

  scala> val dna2 = new DNA.Sequence("agcuaacg")
  java.lang.IllegalArgumentException: Unexpected value for nucleotide u

You can transcribe DNA

  scala> val rna = dna.transcribe    
  rna: bio.RNA.Sequence = agcuaacg

Notice it is an RNA object now. Nucleotides are proper lists.

  val l = RNA.A :: rna.toList
  l: List[bio.Nucleotide] = List(a, a, g, c, u, a, a, c, g)

to translate nucleotides to amino acids:

  scala> SequenceTranslation.translate(l)
  res3: String = KLT

Create a Sequence with ID

  scala> val seq = new DNA.Sequence("ID456","agctaacg")
  scala> seq.id
  String = ID456

ID and description

  scala> val seq = new DNA.Sequence("ID456","My gene", "agctaacg")
  scala> seq.description
  String = My gene

A Sequence, like in the real world, can have multiple ID's and descriptions.

  scala> import bio.attribute._
  scala> val seq2 = seq.attrAdd(Id("Pubmed:456"))
  scala> seq2.idList
  List[bio.Attribute] = List(Pubmed:456, ID456)

  scala> val seq3 = seq2.attrAdd(Description("Another description")) 
  scala> seq3.descriptionList
  List[bio.Attribute] = List(Another description, My gene)

Note that Sequence is immutable. Every time you add an attribute a new copy gets created.

1.5 Sequence Attributes

BioScala allows you to add any attribute to a Sequence object. You can even define your own attributes. Predefined attributes are one or more id's, descriptions, annotations, features, gaps etc.

BioScala uses a message paradigm to access the list of attributes. This way an attribute can 'decide' to respond to a message. This also allows unanticipated usage of the Sequence object. Anyone can create special attributes that react to special messages.

One example is that the BioScala Description attribute responds to the GetXML message. If you check the source code - it is not part of the core Sequence implementation, but part of the Description attribute(!) We can generate XML:

    "Description Attribute" should "respond to GetXML" in {
      val descr = new Description("Some description")
      val msg = descr.send(GetXML) 
      msg should equal (Ok,"<Description>Some description</Description>")

Do turn a Sequence into XML we can do

  scala> val seq = new DNA.Sequence("ID356","Describe gene 356","agctgaatc")

First we filter on attributes that understand the messsage GetXML

  scala> val xml = seq.attribValues(GetXML,seq.attributes)

Next we generate output

  scala> xml.mkString should equal ("<Id>ID456</Id><Description>Gene 456</Description>")

Again, the Sequence object did not need to understand XML to achieve this. Only the Id and Description objects understand to interpret the GetXML message.

You can add *any* type of Attribute to a Sequence object and to its contained nucleotides, or amino acids. An interesting attribute would be a QualityAttribute for every nucleotide.

In fact, this is how Protein.CodonSequence is implemented. CodonSequence is an AminoAcidSequence, where every Codon has an AminoAcid and the matching (three letter) DNA Sequence information. By treating an amino acid and codon together we can (1) utilise the generic Sequence container and (2) reorder the sequence without having to split the logic. I.e. say we want to delete three amino acids from the Sequence, or insert three gaps, we do that in one go. See the CodonSequence section.

1.6 CodonSequence

The CodonSequence makes use of the Attribute list of the standard AminoAcidSequence object. When creating a CodonSequence, e.g.

  scala> val seq = new Protein.CodonSequence("ID356","Describe gene 356","agctgaatc")
  seq: bio.Protein.CodonSequence = S*I 
  scala> seq.toString
  String = S*I
  scala> seq.toDNA
  List[bio.DNA.NTSymbol] = List(a, g, c, t, g, a, a, t, c)

it stored the DNA sequence as a CodonAttribute to the AminoAcid. When you want to fetch the codon sequence of Attribute with the third Amino Acid, you can query for the codon information

  scala> seq(2).getCodon
  List(a, t , c)

and even directly from the CodonSequence

  scala> seq.getCodon(2)
  List(a, t , c)

Now we want to delete the middle codon:

  scala> seq2 = seq.delete(1,1)
  seq: bio.Protein.CodonSequence = SI

We have another immutable CodonSequence object seq2, which contains a new sequence with matching amino acids and codons:

  scala> seq.toString
  String = SI
  scala> seq.toDNA
  List[bio.DNA.NTSymbol] = List(a, g, c, a, t, c)

1.7 Alignment

Alignment is a container for GappedSequence(s):

  scala> val s1 = new DNA.GappedSequence("agc--taacg---")
  scala> val s2 = new DNA.GappedSequence("agc---aaca---")
  scala> val aln = new Alignment(List(s1,s2))

Each Sequence can have arbitrary attributes - see the description of Sequence above - but also at the alignment level you can have attributes. For example if you want to keep track of a predicted domain, an attribute derived of type AlignmentColumn may work.

  scala> class Domain(first: Int,last: Int) extends AlignmentColumn(first,last)
  scala> val domain = new Domain(20,30)
  scala> val aln2 = aln.attrAdd(domain)

Note that Alignment is immutable, so we create a new version by adding an attribute.

Or if you want to highlight a cut-out of the alignment you may derive from AlignmentBlock.

Likewise, if you want to keep track of a removed column you may also consider creating an Alignment attribute based on AlignmentColumn. The featured GetCleanAlignment message would simply not generate output:

  scala> class Removed(first: Int,last: Int) extends AlignmentColumn(first,last)
  scala> val aln3 = aln.attrAdd(new Removed(20,30))
  scala> aln3.toText(GetCleanAlignment)

Thus, like Sequence feature attributes, you can handle arbitrary attributes in a flexible way using the same type of message paradigm that Sequence uses. Say you want to create HTML output of the alignment, where domains are highlighted, you can simply implement a GetDomainHTML message that gets sent to all attributes to respond with a highlight. E.g.

  scala> println aln.toHTML(GetDomainHTML) // not yet implemented

1.8 Using BioJAVA from Scala

One of the great features of Scala is that it runs on the JAVA virtual machine and creates JAVA byte code. In effect, the generated byte code is equal to that of JAVA generated byte code. This means you can deploy Scala with JAVA, and you can interact between the two languages. An example is in BioScala's Sequence translation, from ./src/main/scala/sequence/translate.scala

  import org.biojava.bio.symbol._
  import org.biojava.bio.seq._
  import bio._

  package bio {
    object SequenceTranslation {
       * Translate nucleotides to amini acids (will change to returning List)
      def translate(nucleotides: List[Nucleotide]): String = {
        val rna = RNATools.createRNA(nucleotides.mkString);
        val aa = RNATools.translate(rna);

You can see we import BioJAVA classes, and call directly into BioJAVA's RNATools to create an RNA object, followed by translation to an amino acid string.

1.9 Using BioRuby from Scala

BioRuby compiles against JRuby on the JAVA Virtual Machine. Like, the description above, on BioJAVA, we can call directly into compiled Ruby byte code. For a BioRuby translation example, create a Ruby class like:

  require 'java'
  require 'bio'

  class RbSequence
    def translate(sequence, frame, codon_table)
      seq = Bio::Sequence::NA.new(sequence)

After compiling above with jrubyc it can be invoked from Scala

  scala> val rbseq = new RbSequence
  scala> rbseq.translate("agtcat",1,1)

assuming you have the jruby-complete.jar file in the class path and BioRuby installed.

1.10 Further Reading

Giving a full description of BioScala is beyond the objective of this tutorial. It is worthwhile checking the tests in the source tree - these are written as specifications and self explanatory. See the files in ./src/test/scala/bio. For example:

    "A DNA Sequence" should "allow multiple IDs" in {
      val s = new DNA.Sequence("ID456","Gene 456","agctaacg")
      val s2 = s.attrAdd(List(Id("GEO:456"),Id("Pubmed:456")))
      s2.idList === (List("ID456","Geo:456","Pubmed:456"))

another source of documentation is the information presented by the BioJAVA project. JAVA code maps easily to Scala, see the section on BioJAVA in this document.

1.11 See also


wikiTEXer 0.55 by Pjotr Prins - generated 2011-03-11 14:59 by pjotrp