Assembly by Short Sequences* (ABySS*)

ABySS* is an open-source de novo genome assembler for short paired-end reads.

Wall Clock Time Sees 4X Improvement1

The Michael Smith Genome Sciences Centre at the BC Cancer Agency was faced with two challenges: Reduce the execution time of their parallel de novo genome assembler, the ABySS software application, and reduce the memory requirements for general alignment tools such as BWA, Bowtie, Novoalign, and ABySS-map. Intel worked with the agency to help enable improved parallelization in ABySS version 1.9.0.

ABySS is differentiated in its ability to scale to large genomes due to its message-passing interface (MPI)-based implementation of the de Bruijn graph assembly algorithm. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The relevant code optimizations are included and enabled by default in ABySS* 1.9.0.

Performance Results

ABySS 1.3.5, the baseline version, required 25 hours to perform a human genome assembly. The optimized version, ABySS 1.9.0, took only 6 hours of wall clock time to recreate the genome when run on multiple processors and taking further advantage of that parallelism by splitting the input file. This indicates a 4X improvement over the baseline version on the same data set1.

Wall clock times for the main genome assembly stage of the ABySS pipeline, using a human genome read dataset (NA12878) are shown in the figure to the right. The first leftmost bar is the base run time before optimization. The second (middle) bar represents the run time for optimized version with all data contained within a single, monolithic input file; the third (rightmost) bar shows the effect of both code optimizations and splitting the input file into 10 equal-sized parts.

Download the code ›

Reproduce these results with this optimization recipe ›

Related Codes

Distributed Indexing Dispatched Alignment* (DIDA*) ›

Publications

J.T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. "ABySS: A Parallel Assembler for Short Read Sequence Data." Genome Research 19, no. 6 (2009): 1117-1123. doi:10.1101/gr.089532.108. Genome ResearchPubMed.

İnanç Birol, Shaun D. Jackman, Cydney Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Stephen J. M. Jones. "De Novo Transcriptome Assembly with ABySS." Bioinformatics 25, no. 21 (2009): 2872-2877. doi:10.1093/bioinformatics/btp367. Bioinformatics Advance Access.

Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D. Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q. Qian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron S. Butterfield, Richard Newsome, Simon K. Chan, Rong She, Richard Varhol, Baljit Kamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A. Moore, Martin Hirst, Marco A. Marra, Steven J. M. Jones, Pamela A. Hoodless, and İnanç Bairol. "De Novo Assembly and Analysis of RNA-seq Data" Nature Methods. 10 October, 2010. Nature.

Configuration Table

System Overview

 

Nodes

Eight HPC nodes interconnected by 40Gbps Infiniband

Processor

Each node has two Intel® Xeon® X5650 processors (2.67 GHz)

RAM

Each node has 48GB RAM

Operating System

CentOS 5.4
Intel® Cluster Studio 2013

Baseline

ABySS version 1.3.5

Optimized

ABySS version 1.9.0

Input dataset: Subset of the following BAM file (272GB)

Input data were split into 10 approximately equal-sized BAM files. Equivalent gzipped FASTQ files should perform equally well.

Data subset: The data subset corresponds to the following eight-lane IDs:

1. 20FUKAAXX100202_1

2. 20FUKAAXX100202_2

3. 20FUKAAXX100202_3

4. 20FUKAAXX100202_4

5. 20FUKAAXX100202_5

6. 20FUKAAXX100202_6

7. 20FUKAAXX100202_7

8. 20FUKAAXX100202_8

Infos sur le produit et ses performances

1

Les résultats des bancs d'essai ont été obtenus avant le déploiement de récents correctifs logiciels et mises à jour micrologicielles destinés à faire face aux failles de sécurité « Spectre » et « Meltdown ». L'installation de ces mises à jour peut rendre ces résultats inapplicables à votre appareil ou système.

Les logiciels et charges de travail utilisés dans les tests de performance ont peut-être été optimisés uniquement pour les microprocesseurs Intel®. Les tests de performance tels que SYSmark* et MobileMark* portent sur des configurations, composants, logiciels, opérations et fonctions spécifiques. Les résultats peuvent varier en fonction de ces facteurs. Pour l'évaluation complète d'un produit, il convient de consulter d'autres tests et d'autres sources d'informations, notamment pour connaître le comportement de ce produit lorsqu'il est associé à d'autres composants. Consultez http://www.intel.fr/benchmarks à ce sujet.