Management of Flexible Schema Data in RDBMSs - Opportunities and Limitations for NoSQL -
نویسندگان
چکیده
ly, schema based data can be defined as a set of data (which is denoted as 'S') that satisfies the following properties: there exists a set of finite size of dimension (which is denoted as 'D') such that every element of S can be expressed as a linear combination of elements from D. Flexible schema based data is the negation of Schema based data. That is, there does NOT exit a set of finite size of dimension D such that every element of S can be expressed as a linear combination of elements from set D. Intuitively, schema based data can have unbounded number of elements but has a bounded dimensions as schema definition whereas flexible schema based data has unbounded dimensions. Because schema based data has finite dimensions, therefore, schema based data can be processed by separating the data away from its dimension so that an element in a schema based data set can be expressed by a vector of values, each of which represents the projection of the element in a particular dimension. All the dimensions are known as schema. Flexible schema based data cannot be processed by separating the data away from its dimension. Each element in a flexible schema based data has to keep track of its dimensions and the corresponding value. An element in a flexible schema based data is expressed by a vector of dimension and value (namevalue pair). Therefore, flexible schema based data requires store, query and index both schema and data together. 3.2 FSD Storage Current Practises Self-contained Document-object-store model: The current practice for storing FSD is to store FSD instances in a FSD collection using document-object-store model where both structure and data are stored together for each FSD instance so that it is self-descriptive without relying on a central schema dictionary. New structures can be added on a per-record basis without dealing with schema evolution. Aggregated storage supports full document-object retrieval efficiently without the cost of querying and stitching pieces of data from multiple relational tables. Each FSD instance can be independently imported, exported, distributed without any schema dependency. Table1 shows DDL to create resumeDoc_tab collection of resume XML documents, a shoppingCar_tab collection of shopping cart JSON objects. SQL/XML standard defines XML as a built-in datatype in SQL. For upcoming SQL/JSON standard [21], it supports storing JSON in SQL varchar, varbinary, CLOB, BLOB datatype with the new ‘IS JSON’ check constraint to ensure the data stored in the column is a valid JSON object. Adding a new domain FSD by storing into existing SQL datatype, such as varchar or LOB, without adding a new SQL type allows the new domain FSD to have full data operational completeness capability (Transactions, Replication, Partition, Security, Provenance, Export/Export, Client APIs etc) support with minimal development efforts. T1 CREATE TABLE resumeDoc_tab (id number, docEnterDate date, docVerifyDate date, resume XMLType) T2 CREATE TABLE shoppingCar_tab (oid number, shoppingCar BLOB check (shoppingCar IS JSON)) Table 1 – Document-Object-Store Table Examples Data-Guide as soft Schema: The data-guide can be computed from FSD collections to understand the complete structures of the data which helps to form queries over FSD collection. That is, FSD management with data-guide supports the paradigm of “storage without schema but query with schema”. For common top-level scalar attributes that exist in all FSD instances of a FSD collection, they can be automatically projected out as virtual columns or flexible table view [21, 22, 24]. For nested master-detail hierarchical structures exist in FSD instances, relational table indexes [11] and materialized views [35], are defined using FSD_TABLE() table function (Q4 in Table 2). They can be built as secondary structures on top of the primary hierarchical FSD storage to provide efficient relational view access of FSD. FSD_TABLE() serves as a bridge between FSD data and relational data. They are flexible because they can be created on demand. See section 5.2 for how to manage FSD_TABLE() and virtual columns as indexing or in-memory columnar structures. Furthermore, to ensure data integrity, soft schema can be defined as check constraint as verification mechanism but not storage mechanism. 3.3 FSD Storage Limitations and Research Challenges Single Hierarchy: The document-object-storage model is essentially a de-normalized storage model with single root hierarchy. When XML support was added into RDBMSs, the IMS hierarchical data model issues were brought up [32]. Fundamentally, the hierarchy storage model re-surfaces the single root hierarchy problem that relational model has resolved successfully. In particular, supporting m-n relationship in one hierarchy is quite awkward. Therefore, a research challenge is how to resolve single hierarchy problem in document-objectstorage mode that satisfies ‘data first, structural later’ requirement. Is there an aggregated storage model, other than E/R model, that can support multi-hierarchy access efficiently? Papers [20, 23] have proposed ideas on approaching certain aspects of this problem. Optimal instance level binary FSD format: The documentobject-storage model is essentially a de-normalized storage where master and detail data are stored together as one hierarchical tree structure, therefore, it is feasible to achieve better query performance than with normalized storage at the expense of update. Other than storing FSD instances in textual form, they can also be stored in a compact binary form native to the FSD domain data so that the binary storage format can be used to efficiently process FSD domain specific query language [3, 22]. In particular, since FSD is a hierarchical structure based, the domain language for hierarchical data is path-driven. The underlying native binary storage form of FSD is tree navigation friendly which improves significant performance improvement than text parsing based processing. The challenge in designing the binary storage format of FSD instance is to optimize the format for both query and update. A query friendly format typically uses compact structures to achieve ultra query performance while leaving no room for accommodating update, especially for the delta-update of a FSD instance involving structural change instead of just leaf value change. The current practise is to do full FSD instance update physically even though logically only components of a FSD instance need to be updated. Although typically a FSD instance is of small to medium size, the update may still cause larger transaction log than updating simple relational columns. A command level logging approach [27] can be investigated to see if it is optimal for high frequent delta-update of FSD instances. Optimal FSD instance size: Although the size of FSD collections can be scaled to very large number, in practise, each FSD instances is of small to medium size instead of single large size. In fact, many vendors have imposed size limit per FSD instance. This is because each FSD instance provides a logical unit for concurrency access control, document and Index update and logging granularity. Supporting single large FSD instance requires RDBMS locking, logging to provide intra-document scalability [43] in addition to the current mature inter-document scalability. 4. Querying and Updating FSD 4.1 FSD Query and Update Requirements A FSD collection is stored as a table of FSD instances. A FSD instance itself is domain specific and typically has its own domain-specific query language. For FSD of XML documents, the domain-specific query language is XQuery. For FSD of JSON objects, the domain-specific query language is the SQL/JSON path language as described in [21]. Table 2 shows the example of SQL/XML[10] and SQL/JSON[21] queries and DML statements embedding XQuery and SQL/JSON path language. In general, the domain-specific query language provides the following requirements: • Capability of querying and navigating document-object structures declaratively: A FSD instance is not shredded into tables since hierarchies in a FSD can be flexible and dynamic without being modelled as a fixed master-detail join pattern. Therefore, it is natural to express hierarchical traversal of FSD as path navigation with value predicate constructs in the FSD domain language. The path name can contain a wildcard name match and the path step can be recursive to facilitate exploratory query of the FSD data. For example, capabilities of the wildcard tag name match and recursive descendant tag match in XPath expressions support the notation of navigating structures without knowing the exact names or the exact hierarchy of the structures. See ‘.//experience’ XPath expression in Q1 and Q2. Such capability is needed to provide flexibility of writing explorative and discovery queries. • Capability of doing full context aware text search declaratively: FSD instances can be document centric with mixture of textual content and structures. There is a significant amount of full text content in FSD that are subject to full text search. However, unlike plain textual document, FSD has text content that is embedded inside hierarchical structure. Full text search can be further confined within a context identified by path navigation into the FSD instance. Therefore, context aware full text search is needed in FSD domain languages. See XQuery full text search expression in XMLEXISTS() predicate of Q1 and Q2 and path-aware full text search expression in JSON_TEXTCONTAINS() predicate of Q3. • Capability of projecting, transforming object component and constructing new document or object: Unlike relational query results which are tuples of scalar data, results of path navigational queries can be fragments of FSD. New FSD can be constructed by extracting components of existing FSD and combine them through construction and transformation. Therefore, constructing and transforming FSD instances are required in any FSD language. See XQuery constructor expression in the XMLQUERY() function in Q1. • Capability of performing component-wise update: FSD instance shall be updatable at component-wise level. New structure shall be addable to existing structures; existing structures and their values shall be updateable and deletable. XQuery update facility has provided all of these functionalities for XML document. See XQuery update facility expression in XMLQUERY() function in Q2. 4.2 FSD Query and Update Current Practises While a FSD domain-specific query and update language serves as an intra-document query language, SQL can be used as an inter-document query language. The current practices of querying FSD is to position SQL as a set-oriented language to provide declarative access of a set of FSD instances by leveraging the set based algebra supported by SQL. By positioning SQL as a Set (oriented) Query Language, SQL provides the necessary constructs to express set algebra operators, such as selection, projection, join, group by, aggregation, union, intersection and difference among FSD instances. SQL is openable to support a set of FSD_XXX() functions that can embed FSD domain specific query language. These FSD_XXX() functions are used in strategic places in SQL to filter, process, transform and update FSD instances. See Figure 1 for details. • FSD Filtering: FSD_EXISTS() is used as a conditional expression in a SQL WHERE clause to filter FSD instances. • FSD un-nesting: FSD_TABLE() is used as a table function in SQL FROM clause to unnest collection components within FSD instances into a virtual relational table. Unnesting can be done recursively, therefore Q4 shows the example NESTED PATH support in JSON_TABLE() to un-nest master-detail-detail relationships. Being concrete form of FSD_TABLE(), XMLTABLE() and JSON_TABLE() are very popular features in RDBMS to provide a relational bridge between hierarchical FSD and flattened relational table. Supporting FSD un-nesting concept can be traced back to SQL over NF2 model [42]. • FSD Scalar Projection: FSD_VALUE() is used to extract scalar value within a FSD and then to cast it as SQL builtin type values so that it can be used in a scalar value expression in SELECT, GROUP BY, ORDER BY clauses where scalar values are typically expected. • FSD Component Projection and Construction: FSD_Query() is used to query components within FSD or to construct new FSD in SELECT and UPDATE clause. • FSD Update: FSD_Query() is used at RHS side of UPDATE expression to generate a new FSD instance. • SQL JOIN of FSD Tables: SQL can be used to join multiple FSD tables. This can be accomplished by leveraging the SQL JOIN concept and FSD_VALUE() function. Q6 in table 2 shows the join of resumeDoc_tab, shoppingCar_tab. Figure 1 – FSD_XXX() Function Usages in Open-SQL Lateral
منابع مشابه
Dial M for Management: Next Generation NoSQL
NoSQL databases offer a powerful and flexible means of querying non-relational data. However, leading NoSQL systems typically achieve high performance goals while minimizing support for traditional data management services and defying the establishment of solid formal models. In particular, the systems generally shun tools and design principles rendered conventional by the long history of RDBMS...
متن کاملCan the Elephants Handle the NoSQL Onslaught?
In this new era of “big data”, traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other end of the spectrum, big data DSS analytics that used to be the domain of parallel RDBMSs is now under attack by another class of NoSQL data an...
متن کاملEfficient Implementation of Joins over Cassandra DBs
Over the last few years we witness an explosion on the development of data management solutions for big data applications. To this direction NoSQL databases provide new opportunities by enabling elastic scaling, fault tolerance, high availability and schema flexibility. Despite these benefits, their limitations in the flexibility of query mechanisms impose a real barrier for any application tha...
متن کاملEnabling Efficient Agile Software Development of NoSQL-backed Applications
NoSQL databases are popular in agile software development, where a frequently changing database schema imposes challenges for the production database. In this demo, we present Darwin, a middleware for systematic, tool-based support specifically designed for NoSQL database systems. Darwin carries out schema evolution and data migration tasks. To the best of our knowledge, Darwin is the first too...
متن کاملIntensional RDB Manifesto: a Unifying NewSQL Model for Flexible Big Data
In this paper we present a new family of Intensional RDBs (IRDBs) which extends the traditional RDBs with the Big Data and flexible and ’Open schema’ features, able to preserve the user-defined relational database schemas and all preexisting user’s applications containing the SQL statements for a deployment of such a relational data. The standard RDB data is parsed into an internal vector key/v...
متن کاملControVol Flex: Flexible Schema Evolution for NoSQL Application Development
We demonstrate ControVol Flex, an Eclipse plugin for controlled schema evolution in Java applications backed by NoSQL document stores. The sweet spot of our tool are applications that are deployed continuously against the same production data store: Each new release may bring about schema changes that conĆict with legacy data already stored in production. The type system internal to the predece...
متن کامل