Data Driven XPath Generation

نویسندگان

  • Robin De Mol
  • Antoon Bronselaer
  • Joachim Nielandt
  • Guy De Tré
چکیده

The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree representation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tabular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

When Grammars do not Suffice: Data and Content Integrity Constraints Verification in XML through a Conceptual Model

Complex applications can benefit greatly from using conceptual models and Model Driven Architecture during development, deployment and runtime. XML applications are not different. In this paper, we examine the possibility of using Object Constraint Language (OCL) for expressing constraints over a conceptual model for XML data. We go through the different classes of OCL expression and show how e...

متن کامل

Testing XPath Queries using Model Checking

XML’s rapid adoption as the data representation standard in web based systems is increasing the interest in applying XML query languages (as XPath) to access XML repositories. This technology entails new challenges related to testing, mainly derived from the hierarchical data representation in XML documents and the expressiveness of the query language. In this paper, we present a technique for ...

متن کامل

A Data Model for Temporal XML Documents

XML is expected to become the next generation standard language for exchanging data over the Internet. In general, the contents of XML documents may change as time goes by, and then, it is important to capture entire histories of those documents. In this paper, we propose a logical data model for representing histories of XML documents. The proposed model extends the XPath data model, and is ca...

متن کامل

LoPiX: A System for XML Data Integration and Manipulation

LOPiX is an implementation of XPathLog [May01b], an XML/XPath-native, rule-based programming language for manipulation and integration of XML documents. The main syntactical constructs are XPath expressions, extended with variables. Due to the close relationship with XPath, the semantics of rules is easy to grasp. In contrast to other approaches, the XPath syntax and semantics is also used for ...

متن کامل

VAMANA : A High Performance, Scalable and Cost Driven XPath Engine

Many applications are migrating or beginning to make use native XML data. We anticipate that queries will emerge that emphasize the structural semantics of XML query languages like XPath and XQuery. This brings a need for an efficient query engine and database management system tailored for XML data similar to traditional relational engines. While mapping large XML documents into relational dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014