Running Genome Resequencing Applications
Genome resequencing in patients is an important step in the detection of mutation for congenital diseases. Traditionally, genomics software has been run on High Performance Computing (HPC) architectures. Hadoop* and MapReduce* technologies are slowly transforming the Life Sciences arena by allowing parallel read-mapping algorithms to scale effectively and resulting in shorter execution times and lower costs (from software execution and hardware). Michael Schatz (University of Maryland) and Ben Langmead (Johns Hopkins University) has introduced various software applications like Crossbow* into the Hadoop ecosystem, enabling gene resequencing to run on Hadoop clusters as well as on the cloud (Amazon Web Services via Elastic MapReduce* service). Crossbow provides a scalable software pipeline that can analyze over 35 times coverage of the human genome on a 10-node Hadoop cluster in about one day.
However, being open source, Hadoop seems less polished in some areas and can be difficult to manage in others. Innovative companies like Intel Corporation have started with the Apache Hadoop* Distribution and added components to it for better manageability and performance – optimized for Intel® Xeon® processors, in order to provide businesses with an open enterprise Hadoop platform for next generation analytics and life sciences, called the Intel® Distribution for Apache Hadoop* software.
In this paper, we demonstrate how to install and configure Crossbow and its required components – Bowtie, SOAPsnp and SRA toolkit within Intel® Hadoop Distribution.
Read the full Running Genome Resequencing Applications.