Towards Distribution of Web Sites in a Crawler Used for Large Scale Web Accessibility Assessment

نویسندگان

  • LEIMING CHEN
  • DONGMEI WANG
چکیده

A mechanisms used for large scale accessibility measuring may involve a distributed web crawler. Furthermore, it makes sense to spread the web sites involved to di erent access points (crawler locations / crawler nodes) of the distributed crawler. We will in this publication present an algorithm utilising the available resources to a much greater extent than the traditional uniform distribution of web sites. Our novel algorithm, namely the Time Weighted Object Migration Automaton (TWOMA), is an extension on the Object Migration Automaton (OMA) presented in [1]. The heart of our scheme involves continuously accessing web sites while measuring the duration of each access. Note that accessing a site involves downloading and measuring the accessibility. When a web site is accessed the following happens; If the duration of accessing the web site is less than the average duration for all web sites in the corresponding accesspoint, the web site is moved one state closer to the most internal state of this access point. If the duration of accessing the web site is more than the average duration in the corresponding accesspoint, the web site is moved one state closer to the boundary state of this access point. If the site is already located in the boundary state, the site is moved to another random access point. The above scheme is repeated as long as the crawling / measurement is ongoing. This ensures that the scheme works in a dynamic environment (as the real web). Furthermore, we will in this publication show that the algorithm is working towards an optimal distribtion of web sites in available access points using experimental data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Positioning of Industries in Cyberspace Evaluation of Web Sites Using Correspondence Analysis

  In today’s extremely competitive markets it is crucial for companies to strategically position their brands, products and services relative to their competitors. With the emerging trend in internationalization of companies especially SME’s and the growing use of the Internet with this regard, great amount of attention has been turned to effective involvement of the Internet channel in the mar...

متن کامل

Crawling the Web: Discovery and Maintenance of Large-scale Web Data

This dissertation studies the challenges and issues faced in implementing an effective Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putt...

متن کامل

Early Results from Automatic Accessibility Benchmarking of Publ

Benchmarking of web accessibility is performed throughout Europe, to assess and raise awareness of web accessibility. The evaluation is often based on manual assessments with a high cost and with long intervals. The Web Content Accessibility Guidelines from W3C/WAI are the basis of most evaluations. Although the same guidelines are used, a range of different evaluation methodologies and scoring...

متن کامل

Reliability, Readability and Quality of Online Information about Femoracetabular Impingement

Background: The Internet has become the most widely-used source for patients seeking information more about their health and many sites geared towards this audience have gained widespread use in recent years. Additionally, many healthcare institutions publish their own patient-education web sites with information regarding common conditions. Little is known about how these resources impact pati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006