Scalable Ontological Query Processing over Semantically Integrated Life Science Datasets using MapReduce
نویسندگان
چکیده
To address the requirement of enabling a comprehensive perspective of life-sciences data, Semantic Web technologies have been adopted for standardized representations of data and linkages between data. This has resulted in data warehouses such as UniProt, Bio2RDF, and Chem2Bio2RDF, that integrate different kinds of biological and chemical data using ontologies. Unfortunately, the ability to process queries over ontologically-integrated collections remains a challenge, particularly when data is large. The reason is that besides the traditional challenges of processing graph-structured data, complete query answering requires inferencing to explicate implicitly represented facts. Since traditional inferencing techniques like forward chaining are difficult to scale up, and need to be repeated each time data is updated, recent focus has been on inferencing that can be supported using database technologies via query rewriting. However, due to the richness of most biomedical ontologies relative to other domain ontologies, the queries resulting from the query rewriting technique are often more complex than existing query optimization techniques can cope with. This is particularly so when using the emerging class of cloud data processing platforms for big data processing due to some additional overhead which they introduce. In this paper, we present an approach for dealing such complex queries on big data using MapReduce, along with an evaluation on existing real-world datasets and benchmark queries.
منابع مشابه
Cascading map-side joins over HBase for scalable join processing
One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable index...
متن کاملEffective Spatial Data Partitioning for Scalable Query Processing
Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and running query tasks on those partitions in parallel. Therefore, effective data partitioning is critical for task parallelization, load balancing, and directly ...
متن کاملTitle of dissertation : SCALABLE ONTOLOGY SYSTEMS
Title of dissertation: SCALABLE ONTOLOGY SYSTEMS Octavian Udrea, Doctor of Philosophy, 2008 Dissertation directed by: Professor V.S. Subrahmanian Department of Computer Science Since the adoption of the Resource Description Framework (RDF) by the World Wide Web Consortium (W3C), ontologies have become commonplace as a way to represent both knowledge and data. RDF databases have flexible schemas...
متن کاملOntology-based retrieval of scientific data in LIFE
LIFE is an epidemiological study determining thousands of Leipzig inhabitants with a wide spectrum of interviews, questionnaires, and medical investigations. The heterogeneous data are centrally integrated into a research database and are analyzed by specific analysis projects. To semantically describe the large set of data, we have developed an ontological framework. Applicants of analysis pro...
متن کاملSeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SUMMARY Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1602.01040 شماره
صفحات -
تاریخ انتشار 2014