Using XPaths of inbound links to cluster template-generated web pages
نویسندگان
چکیده
Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.
منابع مشابه
Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page
The World Wide Web consists billions of web pages and hugs amount of information available within web pages. To retrieve required information from World Wide Web, search engines perform number of tasks based on their respective architecture. When a user refers a query to the search engine, it generally returns a large number of pages in response to user’s query. To support the users to navigate...
متن کاملMethods for Sampling Pages Uniformly from the World Wide Web
We present two new algorithms for generating uniformly random samples of pages from the World Wide Web, building upon recent work by Henzinger et al. (Henzinger et al. 2000) and Bar-Yossef et al. (Bar-Yossef et al. 2000). Both algorithms are based on a weighted random-walk methodology. The first algorithm (DIRECTED-SAMPLE) operates on arbitrary directed graphs, and so is naturally applicable to...
متن کاملA New Hybrid Method for Web Pages Ranking in Search Engines
There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملRoadRunner for Heterogeneous Web Pages Using Extended MinHash
The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structure...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Comput. Sci. Inf. Syst.
دوره 11 شماره
صفحات -
تاریخ انتشار 2014