Halvade*
Halvade* is a MapReduce implementation of the best-practice DNA sequencing pipeline as recommended by Broad Institute.
Parallel Efficiency Reaches 91 Percent1
Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.
Halvade* is a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK* Best Practices recommendations, supporting both whole genome and whole exome sequencing. Halvade is implemented in Java and uses the Apache Hadoop* MapReduce 2.0 API. For example, it supports the Cloudera Hadoop* Distribution as well as Amazon EMR*.
Performance Results
Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in less than 3 hours with high parallel efficiency1.
The speed-up curve shows that the more Hadoop tasks, the better the performance, with almost linear scaling. Here, each task uses six physical Intel® Xeon® CPU cores, which amounts to 12 hardware threads per Hadoop task. The efficiency curve shows the same result: With 360 cores in total, parallel efficiency is at 91.1 percent, indicating that available resources are effectively used.
Without Halvade, the same pipeline would run for an estimated 288 hours (ca. 12 days) on a single node. Even with multithreading enabled within the tools that support it, a runtime of 120 hours (ca. 5 days) was measured. With Halvade, the runtime is reduced to 3 hours on a 15-node Intel® Xeon® CPU cluster running Cloudera Hadoop* Distribution. Using only a single node, the whole pipeline runs in 48 hours (ca. 2 days).
Publications
Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, and Jan Fostier. "Halvade: scalable sequence analysis with MapReduce." Bioinformatics (2015) 31 (15): 2482-2488 first published online March 26, 2015.
More Information
Configuration Table
System Overview |
|
---|---|
Nodes |
15 nodes, with 64 GB RAM each |
Processor |
In total: 30 Intel® Xeon® E5-2695 v2 CPUs @ 2.40 GHz each |
Cores |
In total: 360 physical cores (720 threads) |
RAM |
In total: 960 GB RAM |
Apache Hadoop* Distribution |
Cloudera version 5.0.1b |
Tasks per Node |
4 tasks per node, each task using 6 physical cores (12 threads) |
Genomics Codes
BWA-ALN* 0.5.10
A popular software package for mapping low-divergent sequences against a large-reference genome, such as the human genome.
MPI-HMMER* v2.3
An open-source implementation of the HMMER* protein sequence analysis suite.
BLASTn*/BLASTp*
An algorithm for comparing primary biological sequence information.
GATK*
A software package developed at the Broad Institute to analyze next-generation sequencing data.
QIAGEN
QIAGEN Bioinformatics* solutions deliver faster time to insight by combining powerful analytics that are able to interpret complex biological processes.
Halvade*
Halvade* is a MapReduce implementation of the best-practice DNA sequencing pipeline as recommended by Broad Institute.
ABySS*
ABySS* is an open-source de novo genome assembler for short paired-end reads.
DIDA*
DIDA* performs large-scale alignment tasks by distributing the indexing and alignment stages into smaller subtasks over a cluster of compute nodes.
elPrep*
elPrep* is a high-performance tool for preparing SAM/BAM/CRAM files for variant calling in genomic sequencing pipelines.
Solutions For:
Optimize Code
Accelerate Science. Translate Results
Patient Centered Care
Improve customer satisfaction and patient engagement
產品與效能資訊
效能標竿結果是在實施最近的軟體修補程式與韌體更新以解決「Spectre」和「Meltdown」安全漏洞之前取得的資料。實施這些更新可能會讓這些結果變得不適用於您的裝置或系統。
效能因使用情形、配置和其他因素而異。請造訪 www.Intel.com.tw/PerformanceIndex 進一步瞭解。