Copyright (C) 2010-2011, Pjotr Prins
The goal of the BioScala project is to create a scalable and functional bioinformatics library, building on existing libraries available on the JAVA Virtual Machine (JVM) - including BioJAVA and BioRuby.
For installation and compilation of bioscala, follow the instructions in ./INSTALL
Fire up Scala using the bioscala script with the -p switch, which makes sure the classpath is loaded:
cd bioscala ./bin/bioscala -p
Welcome to Scala version 2.7.7 final scala> _
Import BioScala
scala> import bio._
and create a Sequence object
scala> val dna = new DNA.Sequence("agctaacg") seq: bio.DNA.Sequence = agctaacg
Sequence is strongly typed. Creating a DNA sequence from and RNA string will give an error
scala> val dna2 = new DNA.Sequence("agcuaacg") java.lang.IllegalArgumentException: Unexpected value for nucleotide u
You can transcribe DNA
scala> val rna = dna.transcribe rna: bio.RNA.Sequence = agcuaacg
Notice it is an RNA object now. Nucleotides are proper lists.
val l = RNA.A :: rna.toList l: List[bio.Nucleotide] = List(a, a, g, c, u, a, a, c, g)
to translate nucleotides to amino acids:
scala> SequenceTranslation.translate(l) res3: String = KLT
Create a Sequence with ID
scala> val seq = new DNA.Sequence("ID456","agctaacg") scala> seq.id String = ID456
ID and description
scala> val seq = new DNA.Sequence("ID456","My gene", "agctaacg") scala> seq.description String = My gene
A Sequence, like in the real world, can have multiple ID's and descriptions.
scala> import bio.attribute._ scala> val seq2 = seq.attrAdd(Id("Pubmed:456")) scala> seq2.idList List[bio.Attribute] = List(Pubmed:456, ID456)
scala> val seq3 = seq2.attrAdd(Description("Another description")) scala> seq3.descriptionList List[bio.Attribute] = List(Another description, My gene)
Note that Sequence is immutable. Every time you add an attribute a new copy gets created.
BioScala allows you to add any attribute to a Sequence object. You can even define your own attributes. Predefined attributes are one or more id's, descriptions, annotations, features, gaps etc.
BioScala uses a message paradigm to access the list of attributes. This way an attribute can 'decide' to respond to a message. This also allows unanticipated usage of the Sequence object. Anyone can create special attributes that react to special messages.
One example is that the BioScala Description attribute responds to the GetXML message. If you check the source code - it is not part of the core Sequence implementation, but part of the Description attribute(!) We can generate XML:
"Description Attribute" should "respond to GetXML" in { val descr = new Description("Some description") val msg = descr.send(GetXML) msg should equal (Ok,"<Description>Some description</Description>") }
Do turn a Sequence into XML we can do
scala> val seq = new DNA.Sequence("ID356","Describe gene 356","agctgaatc")
First we filter on attributes that understand the messsage GetXML
scala> val xml = seq.attribValues(GetXML,seq.attributes)
Next we generate output
scala> xml.mkString should equal ("<Id>ID456</Id><Description>Gene 456</Description>")
Again, the Sequence object did not need to understand XML to achieve this. Only the Id and Description objects understand to interpret the GetXML message.
You can add *any* type of Attribute to a Sequence object and to its contained nucleotides, or amino acids. An interesting attribute would be a QualityAttribute for every nucleotide.
In fact, this is how Protein.CodonSequence is implemented. CodonSequence is an AminoAcidSequence, where every Codon has an AminoAcid and the matching (three letter) DNA Sequence information. By treating an amino acid and codon together we can (1) utilise the generic Sequence container and (2) reorder the sequence without having to split the logic. I.e. say we want to delete three amino acids from the Sequence, or insert three gaps, we do that in one go. See the CodonSequence section.
The CodonSequence makes use of the Attribute list of the standard AminoAcidSequence object. When creating a CodonSequence, e.g.
scala> val seq = new Protein.CodonSequence("ID356","Describe gene 356","agctgaatc") seq: bio.Protein.CodonSequence = S*I scala> seq.toString String = S*I scala> seq.toDNA List[bio.DNA.NTSymbol] = List(a, g, c, t, g, a, a, t, c)
it stored the DNA sequence as a CodonAttribute to the AminoAcid. When you want to fetch the codon sequence of Attribute with the third Amino Acid, you can query for the codon information
scala> seq(2).getCodon List(a, t , c)
and even directly from the CodonSequence
scala> seq.getCodon(2) List(a, t , c)
Now we want to delete the middle codon:
scala> seq2 = seq.delete(1,1) seq: bio.Protein.CodonSequence = SI
We have another immutable CodonSequence object seq2, which contains a new sequence with matching amino acids and codons:
scala> seq.toString String = SI scala> seq.toDNA List[bio.DNA.NTSymbol] = List(a, g, c, a, t, c)
Alignment is a container for GappedSequence(s):
scala> val s1 = new DNA.GappedSequence("agc--taacg---") scala> val s2 = new DNA.GappedSequence("agc---aaca---") scala> val aln = new Alignment(List(s1,s2))
Each Sequence can have arbitrary attributes - see the description of Sequence above - but also at the alignment level you can have attributes. For example if you want to keep track of a predicted domain, an attribute derived of type AlignmentColumn may work.
scala> class Domain(first: Int,last: Int) extends AlignmentColumn(first,last) scala> val domain = new Domain(20,30) scala> val aln2 = aln.attrAdd(domain)
Note that Alignment is immutable, so we create a new version by adding an attribute.
Or if you want to highlight a cut-out of the alignment you may derive from AlignmentBlock.
Likewise, if you want to keep track of a removed column you may also consider creating an Alignment attribute based on AlignmentColumn. The featured GetCleanAlignment message would simply not generate output:
scala> class Removed(first: Int,last: Int) extends AlignmentColumn(first,last) scala> val aln3 = aln.attrAdd(new Removed(20,30)) scala> aln3.toText(GetCleanAlignment)
Thus, like Sequence feature attributes, you can handle arbitrary attributes in a flexible way using the same type of message paradigm that Sequence uses. Say you want to create HTML output of the alignment, where domains are highlighted, you can simply implement a GetDomainHTML message that gets sent to all attributes to respond with a highlight. E.g.
scala> println aln.toHTML(GetDomainHTML) // not yet implemented
One of the great features of Scala is that it runs on the JAVA virtual machine and creates JAVA byte code. In effect, the generated byte code is equal to that of JAVA generated byte code. This means you can deploy Scala with JAVA, and you can interact between the two languages. An example is in BioScala's Sequence translation, from ./src/main/scala/sequence/translate.scala
import org.biojava.bio.symbol._ import org.biojava.bio.seq._ import bio._
package bio { object SequenceTranslation { /** * Translate nucleotides to amini acids (will change to returning List) */ def translate(nucleotides: List[Nucleotide]): String = { val rna = RNATools.createRNA(nucleotides.mkString); val aa = RNATools.translate(rna); aa.seqString } } }
You can see we import BioJAVA classes, and call directly into BioJAVA's RNATools to create an RNA object, followed by translation to an amino acid string.
BioRuby compiles against JRuby on the JAVA Virtual Machine. Like, the description above, on BioJAVA, we can call directly into compiled Ruby byte code. For a BioRuby translation example, create a Ruby class like:
require 'java' require 'bio'
class RbSequence def translate(sequence, frame, codon_table) seq = Bio::Sequence::NA.new(sequence) seq.translate(frame,codon_table) end end
After compiling above with jrubyc it can be invoked from Scala
scala> val rbseq = new RbSequence scala> rbseq.translate("agtcat",1,1)
assuming you have the jruby-complete.jar file in the class path and BioRuby installed.
Giving a full description of BioScala is beyond the objective of this tutorial. It is worthwhile checking the tests in the source tree - these are written as specifications and self explanatory. See the files in ./src/test/scala/bio. For example:
"A DNA Sequence" should "allow multiple IDs" in { val s = new DNA.Sequence("ID456","Gene 456","agctaacg") val s2 = s.attrAdd(List(Id("GEO:456"),Id("Pubmed:456"))) s2.idList === (List("ID456","Geo:456","Pubmed:456")) }
another source of documentation is the information presented by the BioJAVA project. JAVA code maps easily to Scala, see the section on BioJAVA in this document.