Efficient structural similarity computation between XML documents
نویسنده
چکیده
This work is mainly motivated by the description of a new approach for calculating the structural similarity of XML documents. Practically, the majority of existing work on XML documents clustering considers the tree structures of these documents as mere vectors and, therefore, does not take into account their hierarchical contexts. Furthermore, in order to calculate the structural similarity of XML documents, most methods encountered in these works perform depth-first traversal to visit the nodes of the tree structures of these documents. More precisely, it is the preorder tree walk which is usually the most used. Recently, other studies present an alternative approach that takes into account the hierarchical contexts of these tree structures, but unfortunately, they have particularly high time complexity in the calculation of structural similarity. In this paper, we propose a new method based on breadth-first traversal of these tree structures. The goal consists in clustering more rapidly XML documents sharing similar structures. Besides the fact that the method is fast, it also takes into account the hierarchical contexts of XML documents. Reconciling the speed required for clustering XML documents with taking into account the hierarchical contexts of their tree structures ensures higher reliability of the proposed method. To validate our proposal, experiments were conducted on both real and synthetic XML data. The results clearly demonstrate the viability of our approach.
منابع مشابه
خوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملA New Sequential Mining Approach to XML Document Similarity Computation
1 Manuscript submitted to Postgraduate Research Day 2 Corresponding author Abstract Measuring the structural similarity among XML documents is the task of finding their semantic correspondence and is fundamental to many web-based applications. While there exist several methods to address the problem, the data mining approach seems to be a novel, interesting and promising one. It works on the id...
متن کاملSemantic and Structure Based XML Similarity: The XS3 Prototype
Due to the ever-increasing web availability of XML-based data, an efficient approach to compare XML documents becomes crucial in information retrieval. Such comparison of XML documents has applications in version control (finding, scoring and browsing changes between different versions of a document), change management and data warehousing (support of temporal queries and index maintenance) [3,...
متن کاملSemantic and Structure Based XML Similarity: An Integrated Approach
Since the last decade, XML has gained growing importance as a major means for information management, and has become inevitable for complex data representation. Due to an unprecedented increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes crucial in information retrieval (IR) research. A range of algorithms for comparing hierarchically str...
متن کاملStructural Similarity Evaluation Between XML Documents and DTDs
The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents and XML grammars, useful in various applications such as documents classifi...
متن کامل