A Hybrid Approximate XML Subtree Matching Method Using Syntactic Features and Word Semantics
نویسندگان
چکیده
With the exponential increase in the amount and size of XML data on the Internet, XML subtree matching has become important for many application areas such as change detection, keyword retrieval and knowledge discoveries over XML documents. In our previous work, we have proposed leaf-clustering based approximate XML subtree matching methods using syntax information of both the clustered leaf nodes and the corresponding paths. In this paper, we propose a hybrid subtree matching method, in which subtree matching is determined by using the word semantics based on WordNet thesaurus in leaf nodes and the syntactic features in the relevant paths. We also propose a one-pass hash join technique to reduce the additional join cost caused by the extra words expanded by the WordNet. We perform experiments to evaluate performance and matching precision and recall comparing the hybrid method with the original syntax-based methods. The experimental results indicate that the proposed hybrid method with one-pass hash join, comparing with the existing path-based SLAX algorithm, can effectively improve the precision and recall with about only 5% increase of the execution time for the leaf-clustering based subtree matching.
منابع مشابه
Efficient Processing of XML Tree Pattern Queries
In this paper, we present a polynomial-time algorithm for TPQ (tree pattern queries) minimization without XML constraints involved. The main idea of the algorithm is a dynamic programming strategy to find all the matching subtrees within a TPQ. A matching subtree implies a redundancy and should be removed in such a way that the semantics of the original TPQ is not damaged. Our algorithm consist...
متن کاملSemantics of haq in the Glorious Quran
Meaning plays a very important role at all levels of linguistic analysis and in linguistics. We can say that the word itself and out of the chain of speech doesn’t show the true meaning. It should be in relation with other signs within the language that its meaning be relived. Quran, the precious word of Allah, contains words that take a variety of meanings in the syntactic and topical con...
متن کاملTASM: Top-k Approximate Subtree Matching
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runti...
متن کاملTASM: Top-k Approximate Subtree Matching
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runti...
متن کاملExploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels
We propose Bilingual Tree Kernels (BTKs) to capture the structural similarities across a pair of syntactic translational equivalences and apply BTKs to sub-tree alignment along with some plain features. Our study reveals that the structural features embedded in a bilingual parse tree pair are very effective for sub-tree alignment and the bilingual tree kernels can well capture such features. Th...
متن کامل