Boiler: lossy compression of RNA-seq alignments using coverage vectors

نویسندگان

  • Jacob Pritt
  • Ben Langmead
چکیده

We describe Boiler, a new software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; we show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantification. Boiler also allows the user to pose fast and useful queries without decompressing the entire file. Boiler is free open source software available from github.com/jpritt/boiler.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data

Motivation Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, su...

متن کامل

Fast Indexing of Lattice Vectors for Image Compression

Visual communication is becoming increasingly important with applications in several areas such as multimedia, communication, data transmission and storage of remote sensing images, satellite images, education, medical etc....The image data occupies large space. Meeting bandwidth requirements and maintaining acceptable image quality simultaneously is a challenge. Hence image compression is requ...

متن کامل

RSeQC: quality control of RNA-seq experiments

MOTIVATION RNA-seq has been extensively used for transcriptome study. Quality control (QC) is critical to ensure that RNA-seq data are of high quality and suitable for subsequent analyses. However, QC is a time-consuming and complex task, due to the massive size and versatile nature of RNA-seq data. Therefore, a convenient and comprehensive QC tool to assess RNA-seq quality is sorely needed. ...

متن کامل

Rail-RNA: scalable analysis of RNA-seq splicing and coverage

Motivation RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. Results We describe Rail-RNA, a cloud-enabl...

متن کامل

Working with aligned nucleotides (WORK-IN-PROGRESS!)

This vignette belongs to the GenomicAlignments package. It illustrates how to use the package for working with the nucleotide content of aligned reads. After the reads generated by a high-throughput sequencing experiment have been aligned to a reference genome, the questions that are being asked about these alignments typically fall in two broad categories: positional only and nucleotiderelated...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 44  شماره 

صفحات  -

تاریخ انتشار 2016