Fully distributed PageRank computation with exponential convergence

نویسندگان

  • Liang Dai
  • Nikolaos M. Freris
چکیده

This work studies a fully distributed algorithm for computing the PageRank vector, which is inspired by the Matching Pursuit and features: 1) fully distributed 2) expected converges with exponential rate 3) low storage requirement (two scalar values per page). Illustrative experiments are conducted to verify the findings. I. PROBLEM STATEMENT PageRank vector was proposed by the founders of the Google to quantify the importance rankings of the webpages of the Internet [1], [3]. Due to the generality of the idea, PageRank has been extended to application in Biology, Chemistry and some other domains, more details can be found in the review paper [4]. Suppose there are N pages in a network. The connectivity (i.e., the topology induced by the hyperlinks present in websites) can be characterized by the hyperlink matrix A ∈ R N×N defined as follows: its (i, j)-th element is 1 Nj , if there is a link from page-j to page-i, whereNj denotes the number of outgoing links of page j (the number of pages which the page j points to); otherwise Ai,j = 0. By construction A is a non-negative, column stochastic matrix (i.e., a matrix with non-negative elements and each column summing up to one). In this work, we assume without any loss of generality that there are no dangling pages (i.e. pages with no outgoing pages), i.e., A has no zero columns. One potential choice for the PageRank vector is the normalized principal eigenvector (with all the elements summing up to one) of matrix A. However, one drawback of such choice is that, when the network is not fully connected, the principal eigenvector of matrix A may not be unique. To overcome this problem, a ’perturbed’ version of A is adopted for defining the PageRank vector, given by M = αA+ (1− α)S, where S = 1 N 11 , with 1 = [1, 1, · · · , 1] ∈ R and α ∈ (0, 1). The suggested value for α is 0.85 [1]. The original PageRank vector is defined as: Definition 1 (PageRank): Given the perturbed hyperlink matrixM , the vector x is the unique vector x that satisfies: 1) Mx = x, 2) ∑N i=1 x ∗ i = 1 and x ∗ ≥ 0. Note that since M is a positive, column stochastic, and irreducible matrix, the Perron-Frobenius Theorem [5] guarantees existence and uniqueness of a positive right-eigenvector L. Dai and N. Freris are with the Department of Information Technology of Halmstad Unviersity, Halmstad, P.O. Box 823, Sweden and the Engineering Division of New York University Abu Dhabi, Saadiyat Island, P.O. Box 129188, UAE, respectively. E-mail: [email protected],[email protected] with corresponding eigenvalue equal to 1, which is precisely x. The second property is simply a normalization of its entries to sum up to one. It is plain to see that the ranking will not be affected by a positive rescaling of the PageRank vector. In our work, we will adopt the following (positively rescaled by the network size N ) scaled PageRank vector. The main advantage of this choice is that computations will not involve the network size N and will be made clear in the sequel. Definition 2 (Scaled-PageRank): Given the perturbed hyperlink matrix M , the scaled PageRank vector x is the vector which satisfies: 1) Mx = x, 2) ∑N i=1 x ∗ i = N and x ∗ ≥ 0. As the internet is of huge scale, it becomes very difficult to save the entire matrix M and solve Mx = x in a single machine. This is performed by Google on a regular basis using the centralized power iteration [3] which requires large storage and computational power. Additionally, a change in matrix M (for example the creation or deletion of a website or changes in the hyperlinks present in a page) typically entails re-computation of the PageRank vector from scratch. To overcome the difficulties, several distributed methods ( where each page updates its PageRank value by exchanging the information only with neighbouring pages, i.e., pages that it links to or pages that link to it) have been suggested for this problem. Based on the idea of Monte Carlo simulations, [9] proposed the following approach: starting from each node, the algorithm performs multiple rounds of random walks via certain absorbing Markov chains, and PageRank vector is estimated by the frequency of visits to this node from all the random walks. The method features fast convergence as well as distributed implementation, however, the simultaneous runs of a large number of random walks may lead to the problem of congestion in the network. In the following, we will focus on reviewing the ideas based on linear algebraic techniques. In [6], a randomized distributed algorithm was proposed based on stochastic power iterations together with the Polyak averaging scheme. Recently, based on an application of the Stochastic Approximation (SA) framework [13], a randomized distributed algorithm was designed [12]. Nonetheless, in both [6] and [12], during each update, a webpage needs to request information from its incoming neighbours (i.e. the set of webpages that link to it), which might impose practical limitations in that: 1) it either requires additional storage of a list of incoming neighbours, which sometimes could be of huge size; 2) or it might incur delays (for example, wait till all the information has been transimitted) in obtaining the values from the incoming neighbours. Furthermore, the approaches in [6] and [12] are of (or can be reformulated as) SA-type algorithms, which feature sub-exponential convergence rate [14], [12]. In [15], a randomized incremental optimization based distributed algorithm was proposed: nonetheless, similarly to the work in [6] and [12], information from in-coming pages are required for the algorithm’s updates. In this work, seeking to overcome these issues, we propose a fully distributed algorithm (in which updating webpages only use the PageRank values of outgoing pages, while also no knowledge of the network size is required) with provable exponential convergence (in expectation). From a signal decomposition point of view the proposed method can be seen as randomized Matching Pursuit algorithm. The main attributes of the new algorithm are: 1) It uses only the knowledge of the out-going webpages and no knowledge of the network size is assumed; 2) It converges exponentially fast, in expectation; 3) It only requires storing two scalar values per webpage (the PageRank estimate along with a residual value, explicated below). II. THE PROPOSED ALGORITHM In the following, U [m,n] will be used to denote the uniform sampling of a natural number between m and n. Conventions in Matlab will be used to denote the rows and columns of a matrix. I , 1 and 0 denote the identity matrix, all-one vector and all-zero vector respectively, where dimension will be made clear from the context. ek denotes the k-th unit vector (1 in k-th entry and 0 elsewhere), while ‖ · ‖ represents the l2 norm of a vector. A. Problem Reformulation Substituting M to the definition of PageRank vector in Definition 2, and using the property of matrix S that Sx = 1 for any x with ∑ i xi = 1, we have the following equivalent characterization of the scaled PageRank vector

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semistability of switched linear systems with application to PageRank algorithms

This paper investigates semistability and its computation for discrete-time, switched linear systems under both deterministic and random switching policies. The notion of semistability pertains to a continuum of initial state dependent equilibria, and finds wide applications in multi-agent and distributed network systems. It is shown in this paper that exponential semistability on a common equi...

متن کامل

Fast Distributed PageRank Computation

Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google’s search engine). In distributed computing alone, PageRank vectors, or more generally random walk based quantities have been used for several different applications ran...

متن کامل

Adaptive Methods for the Computation of PageRank

Stanford University Abstract. We observe that the convergence patterns of pages in the PageRank algorithm have a nonuniform distribution. Specifically, many pages converge to their true PageRank quickly, while relatively few pages take a much longer time to converge. Furthermore, we observe that these slow-converging pages are generally those pages with high PageRank. We use this observation to...

متن کامل

Efficient Computation of PageRank

This paper discusses efficient techniques for computing PageRank, a ranking metric for hypertext documents. We show that PageRank can be computed for very large subgraphs of the web (up to hundreds of millions of nodes) on machines with limited main memory. Running-time measurements on various memory configurations are presented for PageRank computation over the 24-million-page Stanford WebBase...

متن کامل

Distributed Pagerank for P2P Systems

This paper defines and describes a fully distributed implementation of Google’s highly effective Pagerank algorithm, for “peer to peer”(P2P) systems. The implementation is based on chaotic (asynchronous) iterative solution of linear systems. The P2P implementation also enables incremental computation of pageranks as new documents are entered into or deleted from the network. Incremental update ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1705.09927  شماره 

صفحات  -

تاریخ انتشار 2017