A geometric view of Biodiversity: scaling to metagenomics
نویسندگان
چکیده
We have designed a new efficient dimensionality reduction algorithm in order to investigate new ways of accurately characterizing the biodiversity, namely from a geometric point of view, scaling with large environmental sets produced by NGS (∼ 10 sequences). The approach is based on Multidimensional Scaling (MDS) that allows for mapping items on a set of n points into a low dimensional euclidean space given the set of pairwise distances. We compute all pairwise distances between reads in a given sample, run MDS on the distance matrix, and analyze the projection on first axis, by visualization tools. We have circumvented the quadratic complexity of computing pairwise distances by implementing it on a hyperparallel computer (Turing, a Blue Gene Q), and the cubic complexity of the spectral decomposition by implementing a dense random projection based algorithm. We have applied this data analysis scheme on a set of 10 reads, which are amplicons of a diatom environmental sample from Lake Geneva. Analyzing the shape of the point cloud paves the way for a geometric analysis of biodiversity, and for accurately building OTUs (Operational Taxonomic Units), when the data set is too large for implementing unsupervised, hierarchical, high-dimensional clustering. Key-words: Biodiversity, metabarcoding, Multidimensional Scaling, Singular Value Decomposition, Random Projection ∗ BIOGECO, INRA, Univ. Bordeaux, 33610 Cestas, France † Pleiade team INRIA Bordeaux-Sud-Ouest, France ‡ UMR Carrtel, INRA, Thonon-les-Bains, France § IDRIS, CNRS, Orsay, france ¶ HiePACS team, Inria Bordeaux-Sud-Ouest, France ‖ Corresponding author, [email protected] Caractérisation géométrique de la biodiversité : passage à l’échelle en métagénomique Résumé : Nous avons conçu un algorithme de réduction de la dimension pour explorer de nouvelles voies pour une caractérisation précise de la biodiversité, ici par une approche géométrique, qui satisfait aux critères de passage à l’échelle pour les jeux de données produits par NGS (actuellement 10 reads). Cette approche est basée sur la technique dite "Multidimensional Scaling", qui permet de projeter les éléments à étudier sur un ensemble de n points dans un espace euclidien de faible dimension, connaissant leurs distances respectives. Nous avons calculé toutes les distances deux à deux entre reads d’un échantillon environnemental, réalisé une MDS du tableau de distances, et analysé les projections sur les premiers axes par des techniques de visualisation. Nous avons abordé la question de la complexité quadratique du calcul des distances deux à deux en réalisant les calculs dans un Centre National disposant d’une machine hyperparallèle (Turing, une IBM Blue Gene Q), et la complexité cubique de la décomposition spectrale dans la MDS en utilisant un algorithme de projection aléatoire dense. Nous avons appliqué cette procédure à un jeu de 10 reads d’un échantillon environnemental de diatomées du lac Léman. L’analyse de la forme du nuage de points obtenu ouvre la voie vers une analyse géométrique de la biodiversité, et une construction rigoureuse d’OTUs (Operational Taxonomic Units) lorsque le jeu de données est trop grand pour mettre en oeuvre les méthodes de classification ascendante hiérarchique, non supervisée. Mots-clés : Biodiversité, Métabarcoding, Multidimensional Scaling, Décomposition en Valeurs Singulières, Projection aléatoire Geometric view on Biodiversity 3
منابع مشابه
The power law scaling, geometric and kinematic characteristic of faults in the Northern part of the Kerman Coal Province (KCP), Iran
According to numerous studies, there are basic and initial scaling relationship for the geometric and kinematic characteristics of faults. The study area is located in the northern part of the Kerman coal province. The statistical calculations are consisting of: measure the surface density of faults per unite area and division of the area, determining the direction of the dominant faulting and...
متن کاملA Geometric View of Similarity Measures in Data Mining
The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...
متن کاملExtended Geometric Processes: Semiparametric Estimation and Application to ReliabilityImperfect repair, Markov renewal equation, replacement policy
Lam (2007) introduces a generalization of renewal processes named Geometric processes, where inter-arrival times are independent and identically distributed up to a multiplicative scale parameter, in a geometric fashion. We here envision a more general scaling, not necessar- ily geometric. The corresponding counting process is named Extended Geometric Process (EGP). Semiparametric estimates are...
متن کاملMicrobial Genetic Biodiversity and Molecular Approach
Biodiversity is given by the variety of species on Earth resulting from billions ofyears of evolution. Molecular-phylogenetic studies have revealed that the main diversityof life is microbial and it is distributed among three domains: Achaea, Bacteria, andEukarya. The functioning of whole biosphere depends absolutely on the activities of themicrobial world. Due to their versatil...
متن کاملGeometric distortion evaluation of magnetic resonance images by a new large field of view phantom for magnetic resonance based radiotherapy purposes
Background: The magnetic resonance imaging (MRI)-based radiotherapy planning method have been considered in recent years because of the advantages of MRI and the problems of planning with two images modality. The first step in MRI-based radiotherapy is to evaluate magnetic resonance (MR) images geometric distortion. Therefore, the present study aimed to evaluate system related geometric distort...
متن کامل