Analyzing large scale genomic data on the cloud with Sparkhit.

نویسندگان

  • Liren Huang
  • Jan Krüger
  • Alexander Sczyrba
چکیده

Motivation The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92 to 157 times faster than MetaSpark on metagenomic fragment recruitment and 18 to 32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 hours, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 hours, presenting an approach to easily associate large amounts of public datasets with reference data. Availability Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

(Supplementary document) Analyzing large scale genomic data on the cloud with Sparkhit

(Supplementary document) Analyzing large scale genomic data on the cloud with Sparkhit Liren Huang 1,2,3, Jan Krüger 1,2 and Alexander Sczyrba 1,2,3⇤ 1Faculty of Technology, Bielefeld University, Bielefeld, 33615, Germany 2Center for Biotechnology – CeBiTec, Bielefeld University, Bielefeld, 33615, Germany 3Computational Methods for the Analysis of the Diversity and Dynamics of Genomes, Bielefel...

متن کامل

IMPACTS AND CHALLENGES OF CLOUD COMPUTING FOR SMALL AND MEDIUM SCALE BUSINESSES IN NIGERIA

Cloud computing technology is providing businesses, be it micro, small, medium, and large scale enterprises with the same level playing grounds. Small and Medium enterprises (SMEs) that have adopted the cloud are taking their businesses to greater heights with the competitive edge that cloud computing offers. The limitations faced by (SMEs) in procuring and maintaining IT infrastructures has be...

متن کامل

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce

The detection of genomic structural variations (SV) remains a difficult challenge in analyzing sequencing data, and the growing size and number of sequenced genomes have rendered SV detection a bona fide big data problem. MapReduce is a proven, scalable solution for distributed computing on huge data sets. We describe a conceptual framework for SV detection algorithms in MapReduce based on comp...

متن کامل

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

A Genetic Based Resource Management Algorithm Considering Energy Efficiency in Cloud Computing Systems

Cloud computing is a result of the continuing progress made in the areas of hardware, technologies related to the Internet, distributed computing and automated management. The Increasing demand has led to an increase in services resulting in the establishment of large-scale computing and data centers, in addition to high operating costs and huge amounts of electrical power consumption. Insuffic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره   شماره 

صفحات  -

تاریخ انتشار 2017