[1912] Alignment in a SNAP: Cancer Diagnosis in the Genomic Age

Matei Zaharia, Bill Bolosky, Kristal Curtis, David Patterson, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Taylor Sittler. UCSF, San Francisco, CA; UC Berkeley, Berkeley, CA; Microsoft, Redmond, WA

Background: As the cost of DNA sequencing continues to drop at a pace exceeding that of Moore's Law, there is growing need for tools that can efficiently analyze ever larger bodies of sequence data. By mid-2013, it is estimated that we will reach the $1000 genome. The cost of sequencing a person's genome will then enter the realm of routine clinical practice and it is expected that each cancer patient will have their genome and their cancer's genome sequenced. In order to assemble and interpret this information from the massive numbers of short reads generated by current sequencing machines, significant technological advancement is necessary. Here, we address the first step in the interpretation of a cancer genome from raw sequence information: sequence alignment.
Design: We tested SNAP (Scalable Nucleotide Alignment Package) against the most popular short read aligners, including BWA, Bowtie, and SOAP. Trials included generation of reads from the hg19 build of the human genome with simulated mutations, insertions, and deletions. Additional trials demonstrating superior performance against longer reads and actual whole genome sequencing data sets will be presented at the conference.
Results: SNAP significantly outperforms existing aligners in terms of speed while achieving higher accuracy.

Comparison of Aligners using 125bp Simulated Single End Reads
AlignerSeconds per Million ReadsAccuracy (%)False Positive (%)
bowtie*1966880.07
BWA*3021930.05
MAQ*1750692.70.08
SOAP2*55591.50.17
SNAP10940.05
* These numbers were previously published in [Li et al. Bioinformatics Vol. 25 no. 14 2009, pages 1754–1760]


Conclusions: Currently, aligning a single genome takes roughly 1000 processor hours. We demonstrate a new algorithm and software package called SNAP, which is capable of aligning a genomic dataset consisting of up to 3 billion 100bp reads in 1 hour on a machine rented from Amazon for $2. This is a 100X improvement over current technologies with greater accuracy, higher error tolerance and better performance on longer read lengths, making the package compatible with upcoming developments in sequencing technology. Additionally, SNAP can align against a consensus of genomes rather than a single sequence, allowing it to more effectively discriminate between hereditary sequence variation and somatic mutations. Using SNAP, we can begin to realize the benefits of large sequencing projects such as the TCGA, and to translate their results into personalized therapeutic recommendations for each patient.
Category: Special Category - Pan-genomic/Pan-proteomic approaches to Cancer

Tuesday, March 20, 2012 1:15 PM

Platform Session: Section H1, Tuesday Afternoon

 

Close Window