BioJulia 2016 – The First Report

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/12oUw98qs-Q/biojulia2016-mid

We are pleased to announce releasing
Bio.jl 0.4, a minor release including
significant functionality improvements as I promised in the previous blog
post
.

The following features are added since the post:

  • Online sequence search algorithms.
  • Sequence data structure for reference genomes.
  • Data reader and writer for the .2bit file format.
  • Data reader and writer for the SAM and BAM file formats.
  • Sequence demultiplexing tool.
  • Package to handle BGZF files.

And many other miscellaneous performance and usability improvements! Tutorial
notebooks are available at https://github.com/BioJulia/BioTutorials. Here I
briefly introduce you to these new features one by one.

Online sequence search algorithms

Sequence search is an indispensable tool in sequence analysis. Since the last
post, I have added exact, approximate and regex search algorithms. The search
interface of Bio.jl mimics that of Julia’s standard library.

julia> using Bio.Seq

julia> seq = dna"ACAGCGTAGCT"
11nt DNA Sequence:
ACAGCGTAGCT

# Exact search.
julia> search(seq, dna"AGCG")
3:6

# Approximate search with one error or less.
julia> approxsearch(seq, dna"AGGG", 1)
3:6

# Regular expression search.
julia> search(seq, biore"AGN*?G"d)
3:6

Sequence data structure for reference genomes

In Bio.jl DNA sequences are encoded using 4 bits per base by default in order to
store ambiguous nucleotides and this encoding does well in most cases. However,
some biological sequences such as chromosomal sequences are so long especially
for eukaryotic organisms and the default DNA sequences may result in a waste of
memory space. ReferenceSequence is a new type introduced in Bio.jl that
compresses positions of ambiguous nucleotides using a sparse bit vector. This
type can achieve almost 2-bit encoding in common reference sequences because
most of the ambiguous nucleotides are clustered in a sequence and the number of
them is small compared to other unambiguous nucleotides.

# Converting a DNASequence object to ReferenceSequence.
julia> ReferenceSequence(dna"ACGT"^10000)
40000nt Reference Sequence:
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG…CGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

# Reading chromosome 1 of human from a FASTA file.
julia> open(first, FASTAReader{ReferenceSequence}, "hg38.fa")
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Bio.Seq.FASTAMetadata}:
  name: chr1
  sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  metadata: Bio.Seq.FASTAMetadata("")

julia> sequence(ans)
248956422nt Reference Sequence:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Data reader and writer for the 2bit file format

2bit is a binary file format to store reference sequences. This is a kind of
binary counterpart of FASTA but
specialized for DNA reference sequences to enable smaller file size and faster
loading. Reference sequences of various organisms are distributed from the
download page of UCSC
in this
file format. An important advantage of 2bit is that sequences are indexed by its
name and can be accessed immediately.

# Opening a sequence file of yeast (S.cerevisiae).
julia> reader = open(TwoBitReader, "sacCer3.2bit");

# Loading a chromosome VI using random access index.
julia> reader["chrVI"]
Bio.Seq.SeqRecord{Bio.Seq.ReferenceSequence,Array{UnitRange{Int64},1}}:
  name: chrVI
  sequence: GATCTCGCAAGTGCATTCCTAGACTTAATTCATATCTGC…GTGTGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGG
  metadata: UnitRange{Int64}[]

Data reader and writer for the SAM and BAM file formats

The SAM and BAM file formats are designed for storing sequences aligned to
reference sequences. SAM is a line-oriented text file format and easy to handle
with UNIX command line tools. BAM is a compressed binary version of SAM and
suitable for storing data in disks and processing with purpose-built softwares
like samtools. The BAM data reader is carefully
tuned so that users can use it in real analysis with large files. It is also
feasible to read a CRAM file
combining the BAM reader and samtools view command.

An experimental feature is parallel processing using multiple threads.
Multi-threading support is introduced in Julia 0.5 and we use it to parallelize
decompression of BAM files. Here is a simple benchmark script to show how
much reading speed can be improved with multiple threads:

using Bio.Align

# Count the number of mapped records.
function countmapped(reader)
    ret = 0
    record = BAMRecord()
    while !eof(reader)
        # in-place reading
        read!(reader, record)
        if ismapped(record)
            ret += 1
        end
    end
    return ret
end

println(open(countmapped, BAMReader, ARGS[1]))

JULIA_NUM_THREADS environment variable controls the number of worker threads.
The result below shows that the elapsed time is almost halved using two threads:

~/.j/v/Bio $ time julia countmapped.jl SRR1238088.sort.bam
28040186
       29.27 real        28.64 user         0.66 sys
~/.j/v/Bio $ env JULIA_NUM_THREADS=2 time julia countmapped.jl SRR1238088.sort.bam
28040186
       17.40 real        32.31 user         0.63 sys

Package to handle BGZF files

BGZF (Blocked GZip Format) is a gzip-compliant file format commonly used in
bioinformatics. BGZF can be read using standard gzip tools but files in the
format are compressed block by block and special metadata are added to index the
compressed files for random access. BAM files are compressed in this file format
and sequence alignments in a specific genomic region can be retrieved
efficiently. BGZFStreams.jl is a new package to handle BGZF files like usual
I/O streams and it is built on top of our
Libz.jl package. Parallel decompression
mentioned above is implemented in this package layer.

julia> using BGZFStreams

julia> stream = BGZFStream("/Users/kenta/.julia/v0.5/BGZFStreams/test/bar.bgz")
BGZFStreams.BGZFStream{IOStream}(<mode=read>)

julia> readstring(stream)
"bar"

https://github.com/BioJulia/BGZFStreams.jl

Sequence demultiplexing tool

Sequence demultiplexing is a technique to distinguish the origin of a sequence
using its artificially-attached “barcode” sequence. This is often used at a
preprocessing phase after multiplexed
sequencing
,
a common technique to sequence multiple samples simultaneously. A barcode
sequence, however, may be corrupted due to sequencing error, and we need to find
the best matching barcode from a barcode set. The demultiplexer algorithm
implemented in Bio.jl is based on a trie-like data structure, and efficiently
finds the optimal barcode from the prefix of a given DNA sequence.

# Set DNA barcode pool.
julia> barcodes = DNASequence["ATGG", "CAGA", "GGAA", "TACG"];

# Create a sequence demultiplexer that allows one errors at most.
julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:hamming)
Bio.Seq.Demultiplexer{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 4
  number of correctable errors: 1

# Demultiplex a given sequence from its prefix.
julia> demultiplex(dplxr, dna"ATGGCGNT")  # 1st barcode with no errors
(1,0)

julia> demultiplex(dplxr, dna"CAGGCGNT")  # 2nd barcode with one error
(2,1)

Next step

This is still the first half of my project this year. The next term will come
with:

  • Supporting more file formats including GFF3, VCF and BCF.
  • Integration with databases.
  • Integration with genome browsers.

And, of course, improving existing features of Bio.jl and other packages. We
welcome any contributions and feature requests from you all. Gitter chat
channel
is the best place to communicate
with developers and other users. If you love Julia and/or biology, any reason
not to join us?

Acknowledgements

I gratefully acknowledge the Moore Foundation and the Julia project for
supporting the BioJulia project. I also would like to thank Ben J.
Ward
and Kevin
Murray
for comments on my program code and other
contributions.