Webspam demotion: Low complexity node aggregation methods

نویسندگان

  • Thomas Largillier
  • Sylvain Peyronnet
چکیده

Search engines results pages (SERPs) for a specific query are constructed according to several mechanisms. One of them consists in ranking Web pages regarding their importance, regardless of their semantic. Indeed, relevance to a query is not enough to provide high quality results, and popularity is used to arbitrate between equally relevant Web pages. The most well-known algorithm that ranks Web pages according to their popularity is the PageRank. The term Webspam was coined to denotes Web pages created with the only purpose of fooling ranking algorithms such as the PageRank. Indeed, the goal of Webspam is to promote a target page by increasing its rank. It is an important issue for Web search engines to spot and discard Webspam to provide their users with a non biased list of results. Webspam techniques are evolving constantly to remain efficient but most of the time they still consist in creating a specific linking architecture around the target page to increase its rank. In this paper we propose to study the effects of node aggregation on the well-known ranking algorithm of Google (the PageRank) in presence of Webspam. Our node aggregation methods have the purpose to construct clusters of nodes that are considered as a sole node in the PageRank computation. Since the Web graph is way to big to apply classic clustering techniques, we present four lightweight aggregation techniques suitable for its size. Experimental results on the WEBSPAMUK2007 dataset show the interest of the approach, which is moreover confirmed by statistical evidence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer

Search engines use several criteria to rank webpages and choose which pages to display when answering a request. Those criteria can be separated into two notion, relevance and popularity. This notion of popularity is calculated by the search engine and is related to links made to the webpage. Malicious webmasters want to artificially increase their popularity, the techniques they use are often ...

متن کامل

Community Detection using a New Node Scoring and Synchronous Label Updating of Boundary Nodes in Social Networks

Community structure is vital to discover the important structures and potential property of complex networks. In recent years, the increasing quality of local community detection approaches has become a hot spot in the study of complex network due to the advantages of linear time complexity and applicable for large-scale networks. However, there are many shortcomings in these methods such as in...

متن کامل

Machine Learning Methods for Spamdexing Detection

In this paper, we present recent contributions for the battle against one of the main problems faced by search engines: the spamdexing or web spamming. They are malicious techniques used in web pages with the purpose of circumvent the search engines in order to achieve good visibility in search results. To better understand the problem and finding the best setup and methods to avoid such virtua...

متن کامل

Universal Steiner Trees for Data Aggregation in Low Doubling Metrics

We describe a novel approach for constructing a single spanning tree for data aggregation towards a sink node which we call as Universal Steiner Tree (UST). The tree is universal in the sense that it is static and independent of the number of data sources and fusioncosts at intermediate nodes. The tree construction is in polynomial time, and for low doubling dimension topologies it guarantees a...

متن کامل

Prediction-based data aggregation in wireless sensor networks: Combining grey model and Kalman Filter

0140-3664/$ see front matter 2010 Elsevier B.V. A doi:10.1016/j.comcom.2010.10.003 ⇑ Corresponding author. Tel.: +86 136 06619504; fa E-mail address: [email protected] (G. Wei). In many environmental monitoring applications, since the data periodically sensed by wireless sensor networks usually are of high temporal redundancy, prediction-based data aggregation is an important approach for redu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Neurocomputing

دوره 76  شماره 

صفحات  -

تاریخ انتشار 2012