The Hunt for Genomic Dark Matter: Aligning Non-coding Functional DNA
نویسندگان
چکیده
Motivation: Statistical analysis of conservation patterns in human and mouse genomes suggests that as much as 5% of human genomic DNA is under purifying selection. Known genes account for 1.5%. Identifying and characterising the remaining 3.5% is still a major challenge. With the availability of more genomes, an intuitive way to detect conservation patterns is to align the human genome with genomes of other species such as mouse or rat. Compared to coding DNA, aligning non-coding DNA is a hard task because of the relatively high sequence divergence and the absence of codon structure allowing a higher incidence of insertions and deletions. So far, it has been difficult to assess or compare the performance of alignment algorithms because no suitable gold standards and no evaluation procedures have been proposed. Results: We propose an objective measure of the alignment accuracy of non-coding DNA called ”Gap Attraction”. ”Gap Attraction” gives a measure of the proportion of ungaps, the conserved regions between two random neighbouring indel events, that have been misaligned. The ”Gap Attraction” measure is derived from a model that assumes that insertions and deletions (indels) rain on the genome independently of each other and uniformly along the sequence. From that, it follows that the length of ungaps is geometrically distributed. This hypothesis is verified by the data for the alignment of human chromosome 21, and then the whole human genome, against the mouse genome. The histogram counts for ungaps of medium length lie within the 95% confidence intervals confirming the hypothesis. ”Gap Attraction” does not require knowledge of the true alignment. We measured the ”Gap Attraction” index for two widely-used alignment algorithms; Blastz which was developed specifically for aligning human-mouse DNA, and Clustalw; a global alignment algorithm for pairwise protein and DNA sequences. As expected, Blastz performs better than Clustalw according to our evaluation measure. Contacts: [email protected], [email protected]
منابع مشابه
Genomic “Dark Matter”: Implications for Understanding Human Disease Mechanisms, Diagnostics, and Cures
What is genomic “dark matter?” The realization that protein-coding genes use only a tiny fraction of the three billion base pairs that make up the human genome has given birth to perhaps the largest and most persistent question in modern genetics: of what use, if any, is the vast non-coding sequences that we all carry in each of our cells. Is it really non-functional “junk” DNA as referred to b...
متن کاملLong non-coding RNAs and their significance in human diseases
Protein-coding genes account for only a small fraction of the human genome and most of the genomic sequences are transcriptionally silent, but recent observations indicate significant functional elements, including non-coding protein transcripts in the human genome. Long non-coding RNAs (lncRNAs) have been defined as transcripts of >200 nucleotides without protein-coding capacity that perform t...
متن کاملEvolutionary conservation and functional roles of ncRNA
Non-coding RNAs (ncRNAs) are a class of transcribed RNA molecules without protein-coding potential. They were regarded as transcriptional noise, or the byproduct of genetic information flow from DNA to protein for a long time. However, in recent years, a number of studies have shown that ncRNAs are pervasively transcribed, and most of them show evidence of evolutionary conservation, although le...
متن کاملNon-coding DNA programs express adaptation and its universal law
Mark Ya. Azbel’ School of Physics and Astronomy, Tel-Aviv University, Ramat Aviv, 69978 Tel Aviv, Israel Summary Significant fraction (98.5% in humans) of most animal genomes is noncoding “dark matter”. Its largely unknown function (1-5) is related to programming (rather than to spontaneous mutations) of accurate adaptation to rapidly changing environment. Programmed adaptation to the same univ...
متن کاملDeep Investigation of Arabidopsis thaliana Junk DNA Reveals a Continuum between Repetitive Elements and Genomic Dark Matter
Eukaryotic genomes contain highly variable amounts of DNA with no apparent function. This so-called junk DNA is composed of two components: repeated and repeat-derived sequences (together referred to as the repeatome), and non-annotated sequences also known as genomic dark matter. Because of their high duplication rates as compared to other genomic features, transposable elements are predominan...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004