Cluster Generation and Labeling for Web Snippets: A Fast, Accurate Hierarchical Solution

نویسندگان

  • Filippo Geraci
  • Marco Pellegrini
  • Marco Maggini
  • Fabrizio Sebastiani
چکیده

This paper describes Armil, a meta-search engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-pointfirst algorithm for metric k-center clustering. Cluster labeling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1-Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution

This paper describes Armil, a meta-search engine that groups the Web snippets returned by auxiliary search engines into disjoint labelled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster well-formedness was a key point in the de...

متن کامل

A Phrase-Based Method for Hierarchical Clustering of Web Snippets

Document clustering has been applied in web information retrieval, which facilitates users’ quick browsing by organizing retrieved results into different groups. Meanwhile, a tree-like hierarchical structure is wellsuited for organizing the retrieved results in favor of web users. In this regard, we introduce a new method for hierarchical clustering of web snippets by exploiting a phrase-based ...

متن کامل

Cluster Generation and Cluster Labelling for Web Snippets

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design...

متن کامل

Quadtree and Octree Grid Generation

Engineering analysis often involves the accurate numerical solution of boundary value problems in discrete form. Hierarchical quadtree (or octree) grid generation offers an efficient method for the spatial discretisation of arbitrary-shaped two- (or three-) dimensional domains. It consists of recursive algebraic splitting of sub-domains into quadrants (or cubes), leading to an ordered hierarchi...

متن کامل

unimelb: Topic Modelling-based Word Sense Induction for Web Snippet Clustering

This paper describes our system for Task 11 of SemEval-2013. In the task, participants are provided with a set of ambiguous search queries and the snippets returned by a search engine, and are asked to associate senses with the snippets. The snippets are then clustered using the sense assignments and systems are evaluated based on the quality of the snippet clusters. Our system adopts a preexis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Internet Mathematics

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2007