Text Mining Bridging the Gap between Knowledge and Text
نویسنده
چکیده
Useful pathway models require a complete and accurate representation of the system, which requires that all relevant molecular species are captured, together with their physical interactions and chemical reactions. Pathway model reconstruction is currently largely carried out manually by domain experts, who must carefully read the scientific literature, in order to retrieve, evaluate and interpret and distil relevant fine-grained statements. Moreover, due to the proliferation of scientific databases and ontologies, discovery of previously unknown knowledge demands that scientists take into account information from many different resources, covering different levels of contextual information (e.g., degree of confidence or certainty expressed towards a finding). Thus, given the high complexity mechanisms involved in pathway models, whose detailed description can only be derived from analysis of heterogeneous, fragmented and incomplete sources, reconstructing pathway models is a slow, difficult and laborious process. Accordingly, there is a need to develop methods that help experts to make sense of the continuously growing body of literature, in order to increase the speed and reliability of knowledge discovery. In response to the above, text mining (TM) aims to automate the above process, by finding relations (such as interactions) that hold between concepts of different types (e.g., genes/proteins, chemical compounds, metabolites, subcellular components, anatomical entities, organisms, cell lines, strains, diseases). A large number of TM methods aim to extract simple binary relations from e.g., A binds B. This is mainly achieved by focusing on textual co-occurrences, using bag-of-words approaches, analysis of controlled vocabulary metadata, and other shallow techniques. However, these approaches have several disadvantages, including the identification of many false positive relations. Additionally, they fail to take into account contextual information about relations, e.g., the cellular context of a signaling event, such as cell type and localization. In contrast, our work involves the development of more sophisticated TM techniques to extract events, which encapsulate typed n-ary relationships, i.e., interactions between any number of concepts. Events are able to capture detailed information about mechanisms of biological pertinence, e.g.,, reactions such as negative regulation, phosphorylation, carboxylation), by linking together interacting participants, which play specific roles (e.g., modifier, reactant, product, cause, location). As such, they are able to encode several types of contextual information, that are frequently missing when only binary relations are considered. Consider an intuitive example from the literature to explain our goal: The results suggest that the narL gene product activates the nitrate reductase operon. (PMID: 3035558). This sentence provides interpretative information about the reaction between the narL gene product and the nitrate reductase operon, namely that the information stated in based on an analysis/interpretation of experimental results, and that there is a certain amount of speculation expressed towards the reaction (according to the use of the verb suggest, rather than a more definite verb, such as demonstrate). Next, consider a more complex example: The analysis showed that IEXC29S was unable to significantly transactivate the c-sis/PDGFB promoter. Whilst a conventional TM analysis to find binary relationships would simply discover that some type of interaction occurs between IEXC29S and csis/PDFG-B, a more detailed contextual analysis would allow the construction of a representation that encodes the complex details of the interaction, e.g., that the information is stated based on an experimental analysis, and that the interaction has been shown to occur with a low level of intensity. . In order to extract such complex events automatically, we have developed a pipeline-based event extraction system, EventMine [1], which employs a series of classifier modules to capture core event elements: detection of triggers (words or phrases that characterise the event; typically verbs or their nominalisations ,detection of edges (finding links between pairs of concepts), and complex event detection (combining multiple edges of complex n-ary relations). EventMine utilises a rich set of features including those obtained from dependency parse trees supplied by the GENIA Dependency Parser [2], as well as from predicate-argument structures determined by Enju [3], which has been adapted for application to biomedical text. EventMine is capable of extracting interactions across different sentences, owing to its capability to incorporate results from a pre-executed coreference resolution method [4]. In this way, event participants Proceedings of the XVIII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL’2016), Ershovo, Russia, October 11 14, 2016
منابع مشابه
ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملA review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملTopic Modeling and Classification of Cyberspace Papers Using Text Mining
The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...
متن کامل-
The development and evolution of any system–person, organization–nation depends on how the system succeeds to bridge the gap between what the system knows and what the system does (with the knowledge). We call this the gap between knowing and doing or the knowing-doing gap. If the system does not do what it knows, it will lose out in competition with other systems, its relative performance in...
متن کامل@Note: A workbench for Biomedical Text Mining
Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM...
متن کامل