Identi cation of Small Molecules using Mass Spectrometry
نویسنده
چکیده
Metabolites, small molecules that are produced by an organism, possess a broad range of functions from energy provision to the transfer of complex messages. A large number of metabolites still remain unknown. As metabolites often directly in uence the phenotype, the biology of an organism can not be fully understood without uncovering most of its metabolites. Additionally, a newly found metabolite may serve as lead for the discovery of new drugs, especially antibiotics. Two major techniques exist for the discovery of metabolites: nuclear magnetic resonance provides full structural information, but lacks sensitivity. In contrast, mass spectrometry (MS) provides much less information, but requires little amount of sample and allows for high-throughput analysis. Here, we present algorithms and work ows for the fully automated analysis of tandem MS spectra from small molecules. In the rst step, we annotate spectra with a fragmentation tree. Such a tree assigns molecular formulas to the peaks, and proposes fragmentation reactions between them. Graph theoretical formulation of this task leads to the NP-hard Maximum Colorful Subtree problem. We present algorithms for the de-novo calculation of fragmentation trees, based on the spectra alone. Using an ILP formulation the tree calculation is usually faster than the spectra can be acquired. Mass spectrometry experts con rm that the trees agree well with their knowledge of fragmentation chemistry. Additionally, we can use fragmentation trees to improve molecular formula determination by isotopic pattern. The next step in the pipeline compares the fragmentation tree of an unknown with reference fragmentation trees. We propose to use tree alignment for this, since alignments de ne similarity in a biologically meaningful way. To perform tree alignments, we adapt a dynamic programming algorithm by Jiang et al. (1995). Unfortunately, tree alignment is again NP-hard on unordered trees such as fragmentation trees. But the runtime is only exponential in the maximum out-degree of the tree. Fragmentation trees usually have small out-degrees, rendering the approach feasible. There are several possibilities how to process the fragmentation tree alignment similarities: For evaluation, we compare these scores with structural similarities of the corresponding compounds, and nd these two measures highly correlated, even if the spectra have been measured on di erent instrument types. Using tree similarities as input for hierarchical clustering results in groups that agree well with chemical compound classes. We also developed FT-BLAST, a tool to search a database of reference trees for an unknown tree. In addition to nding highly similar compounds from the database, it can lter out spurious hits by applying a decoy database strategy. Evaluations show that most of the remaining hits are meaningful. iii iv We apply the full work ow starting with molecular formula determination to a biological sample from Icelandic poppy (P. nudicaule). Clustering the unknowns together with reference compounds, enabled the prediction of compound classes for some unknowns. The FT-BLAST analysis gave hints to structural features of the unknowns. An independent manual analysis of the unknowns con rmed our ndings. In addition, reference fragmentation trees can be annotated with structural features using an in-silico fragmentation approach. Although theoretical formulation of this problem turns out to be one NP-hard problem nested in another, we managed to develop a branch-and-bound heuristic for this task. In future, it may help to further interpret tree alignments. This work ow may help researchers in the dereplication of drug leads by telling them if a promising compound is similar to an already tested lead early in the process. Another application may be the reliable reconstruction of metabolic networks from mass spectrometric data, similar to Watrous et al. (2012). Zusammenfassung Metaboliten, kleine Moleküle die von einem Lebewesen produziert werden, besitzen vielfältige Funktionen zum Beispiel als Energiespeicher oder als komplexer Botensto . Eine groÿe Zahl von Metaboliten ist noch unbekannt. Weil Metaboliten oft direkt den Phänotyp beein ussen, können die Prozesse eines Organismus nicht vollständig verstanden werden, ohne den Groÿteil seiner Metaboliten zu kennen. Auÿerdem können neue Metaboliten als Leitstruktur für die Medikamentenentwicklung dienen. Zur Entdeckung von Metaboliten werden hauptsächlich die folgenden zwei Techniken eingesetzt: Kernspinresonanzspektroskopie kann die vollständige Struktur der Substanz aufdecken, ist aber nicht sehr sensitiv. Im Gegensatz dazu liefert die Massenspektrometrie (MS) weniger Informationen, kommt aber auch mit einer viel kleineren Probenmenge aus und ermöglicht daher eine Hochdurchsatzanalyse. In dieser Arbeit stellen wir Algorithmen für die vollautomatische Analyse von Tandemmassenspektren kleiner Moleküle vor. Zunächst werden die Spektren mit einem Fragmentierungsbaum annotiert. Ein Fragmentierungsbaum ordnet den Peaks Summenformeln zu und postuliert Fragmentierungsreaktionen zwischen diesen. Die Formulierung dieser Aufgabe mithilfe der Graphentheorie führt zum NP-harten Maximum Colorful Subtree Problem. Wir stellen Algorithmen für die Berechnung von Fragmentierungsbäumen ohne Zuhilfenahme einer Datenbank vor. Bei Verwendung eines ganzzahligen linearen Programms können die Bäume meist schneller berechnet, als die Spektren gemessen werden. Massenspektrometrie-Experten bestätigen, dass die Bäume sich gut mit ihrem Wissen über Fragmentierungsreaktionen erklären lassen. Fragmentierungsbäume können auch verwendet werden, um die Identi zierung der Summenformel durch Isotopenmuster zu verbessern. Im nächsten Schritt wird der Baum eines unbekannten Sto es dann mit Referenzbäumen verglichen. Wir verwenden Baumalignments zu diesem Zweck, da Alignments Ähnlichkeit biologisch sinnvoll de nieren. Um Baumalignments zu berechnen, haben wir einen Algorithmus von Jiang et al. angepasst. Dieser basiert auf dynamischer Programmierung. Leider ist das Alignieren von ungeordneten Bäumen, wie Fragmentierungsbäumen wieder NP-schwer. Allerdings wächst die Laufzeit nur exponentiell mit dem höchsten Ausgangsgrad. Fragmentierungsbäume haben meist kleine Ausgangsgrade, was den Ansatz praktikabel macht. Es gibt verschiedene Möglichkeiten mit den Ähnlichkeiten der Fragmentierungsbäume weiterzuarbeiten: Zu Testzwecken vergleichen wir diese Ähnlichkeit mit der strukturellen Ähnlichkeit der entsprechenden Substanzen. Diese Ähnlichkeiten sind stark korreliert, sogar wenn die Spektren mit verschiedenen Spektrometer-Typen gemessen wurden. Wenn wir die Sto e basierend auf ihren Fragmentierungsv vi baumähnlichkeiten clustern, bilden sich Gruppen, die in etwa chemischen Sto klassen entsprechen. Zusätzlich haben wir FT-BLAST entwickelt, ein Programm zur Suche von Fragmentierungsbäumen in einer Referenzdatenbank. Neben der Tatsache dass es sehr ähnliche Substanzen in der Datenbank nden kann, kann es die Signi kanz der Tre er mithilfe einer Köderdatenbank berechnen. Tests zeigen, dass der Groÿteil der Tre er über einer Signi kanzschwelle sinnvoll sind. Wir verwenden den gesamten Ablauf zur Identi zierung einer biologischen Probe des Islandmohns (P. nudicaule). Durch das Clustern der unbekannten Substanzen zusammen mit Referenzmessungen konnten wir die Sto klassen einiger Unbekannter vorhersagen. Auÿerdem gab die FT-BLAST-Analyse Aufschluss über charakteristische Teilstrukturen. Eine unabhängige manuelle Analyse bestätigte unsere Ergebnisse. Referenzfragmentierungsbäume können auÿerdem mit Strukturformeln annotiert werden, indem man einen in-silico Fragmentierungsansatz benutzt. Obwohl es sich bei der theoretischen Formulierung dieses Problems um zwei ineinander geschachtelte NP-schwere Probleme handelt, konnten wir eine Branch-and-bound-Heuristik für das Problem entwickeln. Diese kann in Zukunft helfen, Baumalignments noch besser zu interpretieren. Die hier vorgestellt Interpretation von Spektren könnte bei der Dereplizierung von Leitstrukturen für Medikamente helfen, da in einer frühen Phase bereits festgestellt werden kann, ob die neue Leitstruktur einer schon getesteten Struktur ähnelt. Eine weitere Anwendung könnte die verlässliche Rekonstruktion von metabolischen Netzen aus Massenspektrometrie-Daten sein, ähnlich dem Ansatz von Watrous et al. (2012). Acknowledgements As good science is always teamwork, a lot of people supported me on this project. Here I would like to express my gratitude to them. I thank my supervisor, Sebastian Böcker, for numerous fruitful discussions, great ideas, and interesting insights into many areas of bioinformatics. His door was always open for questions of any kind and I am grateful for his support and understanding concerning personal problems that arouse during the time of this work. The same is true for my co-workers, especially Franziska Hufsky and Kerstin Scheubert, who continue the fragmentation tree project. Its been a pleasure working with you! I also thank the rest of the mass spec team, Martin Engler, Marvin Meusel and Imran Rauf, for the great work together. Malte Brinkmeyer, Bao Bui, Thasso Griebel, Frank Mäurer, Anton Pervukhin and Anke Truss warmly welcomed me in the group, and I hope I could extent this welcome to Ra ael Fassler, François Nicolas, Florian Sikora, and Sascha Winter. We had a great time together. Kathrin Schowtka always helped with bureaucracy and various other problems. Thanks a lot! I would also like to thank our collaborators in the tandem mass spectrometry project, Ale2 Svato2, Marco Kai and Ravi Maddula of the Max-Planck-Institute for Chemical Ecology, Jena, Christoph Böttcher and Ste en Neumann of the Leibniz-Institute for Plant Biochemistry, Halle, and Rolf Müller and Daniel Krug from Saarland University, Saarbrücken. Thanks for your explanations, your patience and the trust in our work. I thank Masanori Arita (University of Tokyo, Japan) and Takaaki Nishioka (Keio University, Japan) for making the MassBank data available and Miroslav Strnad (Palacký University, Olomouc, Czech Republic) and Evangelos Tatsis (MPI-CE Jena) for providing samples. Several students worked on their diploma theses in the eld of tandem MS analysis and their ideas and results helped a great deal in the development of this work. Thus, I'm grateful to Kai Dührkop, Birte Kehr, Tamara Steijger and Thomas Zichner for their excellent work. The student assistants Martin Bens, Antje Biering, Claudia Dahl, Markus Fleischauer, Johannes Helmuth and Michael Probst implemented the methods described here into a graphical user interface software. Thank you for digging deep into the realms of software development. I thank Sebastian Wernicke, my supervisor while I was a student research assistant, for introducing me to scienti c work. My projects would not have been as successful without his foundations. Last but not least, I would like to thank my wife Sarah for her support, for catalyzing ideas and bearing my moods and, of course, my daughter Maya for many hours of joy. vii Preface This thesis presents large parts of my research in the automated analysis of tandem mass spectra from small molecules. During this work, I was member of the Bioinformatics Group led by Professor Sebastian Böcker at the Friedrich-SchillerUniversität Jena. My studies were supported by the university's basic funding. Most of the results presented here have been published in [12, 76 78] and were achieved in cooperation with my supervisor Sebastian Böcker, our collaborators Ale2 Svato2, Macro Kai, Ravi Kumar Maddula and Christoph Böttcher, my colleagues Franziska Hufsky, François Nicolas, Kerstin Scheubert and Imran Rauf and, last but not least, my diploma students Tamara Steijger and Thomas Zichner. Together with other co-workers I also participated in the analysis of glycan mass spectra [7, 8], the calculation of fragmentation trees from MS data [83, 84] and GC-MS spectra [45] as well as a faster method to calculate fragmentation tree alignments [44]. Before working with the Bioinformatics group, I implemented two methods for biological network analysis under supervision of Sebastian Wernicke and Prof. Rolf Niedermeier [103 105]. The main results of this thesis are presented in Chapters 3 6, whose results are also presented in the following publications: Chapter 3 describes the calculation of fragmentation trees using results from both [77] and [78]. Sebastian Böcker and I developed the concept and the dynamic programming approach, which I also implemented. I designed the scoring function and developed and implemented the insertion heuristic. I was also involved in the development of the other heuristics and the ILP. Chapter 4 presents evaluation of the trees against expert knowledge. Here, I only performed the comparison with the Mass Frontier results, since the manual evaluation had to be performed by skilled biochemists. Chapter 5 deals with the alignment concept to compare fragmentation trees. It is published in [76]. In this project, I participated in the development of the method, the scoring and the signi cance estimation. I implemented the scoring, decoy database generation and q-value calculation. Furthermore, I performed large parts of the analyses. In Chapter 6, an approach for the annotation of fragmentation trees with molecular structures is described. These results have been published at the 9 Workshop on Algorithms in Bioinformatics (WABI 2009) [12]. Here, I worked on the problem formulation and developed the algorithms together with all co-authors. For the remainder of this thesis, I will use we as the rst person pronoun, as it is common in scienti c literature. This may be interpreted as the reader and I or as my collaborators and I , whichever suits best in the situation. ix
منابع مشابه
A direct comparison between fatty acid analysis and intact phospholipid pro®ling for microbial identi®cation
Two chemical methods for characterization of micro organisms were compared: phospholipid ester-linked fatty acid (PLFA) analysis by gas chromatography/mass spectrometry, and intact phospholipid pro®ling (IPP) using liquid chromatography/electrospray ionization/mass spectrometry. Both methods were tested on ®ve reference pseudomonad strains: Pseudomonas putida mt-2, Pseudomonas putida F1, Burkho...
متن کاملSECURING INTERPRETABILITY OF FUZZY MODELS FOR MODELING NONLINEAR MIMO SYSTEMS USING A HYBRID OF EVOLUTIONARY ALGORITHMS
In this study, a Multi-Objective Genetic Algorithm (MOGA) is utilized to extract interpretable and compact fuzzy rule bases for modeling nonlinear Multi-input Multi-output (MIMO) systems. In the process of non- linear system identi cation, structure selection, parameter estimation, model performance and model validation are important objectives. Furthermore, se- curing low-level and high-level ...
متن کاملStructural characterization of unsaturated phosphatidylcholines using traveling wave ion mobility spectrometry.
A number of phosphatidylcholine (PC) cations spanning a mass range of 400-1000 Da are investigated using electrospray ionization mass spectrometry coupled with traveling wave ion mobility spectrometry (TWIMS). A high correlation between mass and mobility is demonstrated with saturated phosphatidylcholine cations in N(2). A significant deviation from this mass-mobility correlation line is observ...
متن کاملCytokinin production by Paenibacillus polymyxa
The production of hormones has been suggested to be one of the mechanisms by which plant growth-promoting rhizobacteria (PGPR) stimulate plant growth. To evaluate whether the free-living soil bacterium, Paenibacillus polymyxa, releases the hormone group cytokinins and, if so, their identity, the content of cytokinins in the growth media, before and after cultivation of this bacterium, was deter...
متن کاملGas-Phase Fragmentation of Protonated N,2-Diphenyl-N'-(p-Toluenesulfonyl)Ethanimidamides: Tosyl Cation Transfer Versus Proton Transfer.
The gas-phase dissociation chemistry of protonated N,2-diphenyl-N'-(p-toluenesulfonyl) ethanimidamides was investigated by electrospray ionization mass spectrometry in combination with density functional theory calculation. The protonated molecules underwent fragmentation via two main competing channels: (1) migration of the tosyl cation to the anilinic N atom and the subsequent loss of 2-pheny...
متن کاملGenerating Peptide Candidates from Amino-Acid Sequence Databases for Protein Identi cation via Mass Spectrometry
Protein identiication via mass spectrometry forms the foundation of high-throughput proteomics. Tandem mass spectrometry, when applied to a complex mixture of peptides, selects and fragments each pep-tide to reveal its amino-acid sequence structure. The successful analysis of such an experiment typically relies on amino-acid sequence databases to provide a set of biologically relevant peptides ...
متن کامل