Statistical data mining for symbol associations in genomic databases
نویسندگان
چکیده
A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test is proposed to assess the significance of a group of symbols when found in several genesets of a given database. Applied to symbol pairs, the thresholded p-values of the test define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections did correspond to already known interactions. On more specific selections of C2, many previously unkown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence. Background Large-scale genomic databases have been developed for over a decade as catalogs of genesets [1,2]; a geneset is a list of genes/proteins, the expression level of which was found to be associated to some biological process, cellular component, metabolic function, type of cancer, etc. Examples include the KEGG database [3],
منابع مشابه
Discovering Matrix Attachment Regions (MARs) in Genomic Databases
Lately, there has been considerable interest in applying Data Mining techniques to scientific and data analysis problems in bioinformatics. Data mining research is being fueled by novel application areas that are helping the development of newer applied algorithms in the field of bioinformatics, an emerging discipline representing the integration of biological and information sciences. This is ...
متن کاملClinic-Genomic Association Mining for Colorectal Cancer Using Publicly Available Datasets
In recent years, a growing number of researchers began to focus on how to establish associations between clinical and genomic data. However, up to now, there is lack of research mining clinic-genomic associations by comprehensively analysing available gene expression data for a single disease. Colorectal cancer is one of the malignant tumours. A number of genetic syndromes have been proven to b...
متن کاملMining Associations for Organism Characteristics in Prokaryotes - an Integrative Approach
Correlations and associations between specific organism characteristics (such as genome size, genome GC content, optimal growth temperature, habitat, oxygen requirements) may provide for deeper comprehension of evolutionary processes as well as for some prediction possibilities, e.g., trends prediction of some pandemia. There is a plenty of genotype data and gene sequences for different organis...
متن کاملBioMart and Bioconductor: a powerful link between biological databases and microarray data analysis
biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences a...
متن کاملApplication of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)
Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...
متن کامل