Keeping Web Indices up-to-date

نویسندگان

  • Marcel Dasen
  • Erik Wilde
چکیده

Search engines play a crucial role in the Web. Without search engines large parts of the Web becomes inaccessible for the majority of users. Search engines can make new and smaller sites accessible at low cost. Without them, other media, such as Television, would be needed to advertise the existence new site on the Web, only large commercial sites can follow this path. The Web would be endangered to become dominated by a few, well known sites. A crucial problem of search engines is to keep their index up-to-date. Especially if the index grows, the effort needed to update the index increases, since Web documents are dynamic and thus already stored data becomes obsolete. There have been various attempts to monitor the evolve-ment of the Web [1][2]. However, we believe, that change model used in prior work overestimates the rate of change due to an inadequate change model. Our change model has been adapted from the information retrieval field to distinguish index relevant changes from irrelevant modifications in Web documents , e.g. simple spelling corrections or dynamic advertisement links. We have monitored multiple smaller collections of documents over a time period of six month to measure the documents change. Not all changes in Web documents need to be index relevant. E.g. links in the documents might have been updated, some spelling has been improved or the document has been extended with more material of the same kind. Therefore, Web change estimations based on Bit identity, e.g. using checksum, typically overestimate the change in documents [1][2]. We have applied a more refined change model, by using a well understood technique from information retrieval, the vector retrieval model [7][8]. In this model the frequency of occurrence of all words in a document form a vector describing this document. To account for the relative relevancy of words, the words are additionally weighted inversely to their appearance in documents (inverse document frequency). This model has been widely applied in digital library indexing and also in the context of the Web [3][4][5]. The change of two instances is calculated by forming the scalar product of their vectors. The change of two documents is calculated by the following equality in the vector retrieval model. Multiple samples of 10K documents from the Web have been regularly revisited and the change to the original documents has been assessed using the vector retrieval model (Figure 2). The sample has …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information-Seeking Behaviors of Computer Scientists: Challenges for Electronic Literature Search Tools

Since the recent emergence of electronic literature resources, researchers have begun to adopt new informationseeking practices. The purpose of this research is to investigate the information needs and searching behaviors of researchers, and their implications for electronic literature search tools. We conducted mixed-method case studies involving interviews, diary logs, and observations of com...

متن کامل

Keeping Web Pages Up-to-Date with SQL: 1999

From the beginnings of the World Wide Web (WWW or Web) and the definition of the Common Gateway Interface (CGI), Web site administrators have used dynamically generated HTML pages to provide up-to-date information. Due to the high resource consumption of dynamic page generation approaches, many sites have switched over to periodical updates of frequently visited pages, e.g., a headline index of...

متن کامل

Comparison of ISI web of knowledge, SCOPUS, and Google Scholar h-indices of Iranian nuclear medicine scientists

Introduction: In the current study, we compared the h-indices of Web of Science (WOS), SCOPUS, and GS of the Iranian nuclear medicine scientists Methods: Full time members of two major nuclear medicine research centers of Iran with more than 5 year of experience (Nuclear Medicine Research Center of Mashhad University of Medical Sciences, and Research Institute for Nuclear Medicine of Tehran Un...

متن کامل

Acquiring XML pages for a WebHouse

Xyleme is a dynamic warehouse for the XML data of the web supporting change control and data integration. Major issues are the acquisition of XML data and keeping data up to date with the web as best as possible. This is the topic of the present paper.

متن کامل

Designed for Success - Empirical Evidence on Features of Corporate Web Pages

We investigate how eight concepts derived from the media characteristics of the WWW impact corporate success in E-Business if implemented as features of companies’ web sites. We construct a path model for testing our research hypotheses on three subsets of a representative survey of 1,308 cases, 469 general-, 215 which target businesses (B2B), and 224 companies which target consumers (B2C). We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001