The Availability and Persistence of Web References in D-Lib Magazine

نویسندگان

  • Frank McCown
  • Sheffan Chan
  • Michael L. Nelson
  • Johan Bollen
چکیده

We explore the availability and persistence of URLs cited in articles published in D-Lib Magazine. We extracted 4387 unique URLs referenced in 453 articles published from July 1995 to August 2004. The availability was checked three times a week for 25 weeks from September 2004 to February 2005. We found that approximately 28% of those URLs failed to resolve initially, and 30% failed to resolve at the last check. A majority of the unresolved URLs were due to 404 (page not found) and 500 (internal server error) errors. The content pointed to by the URLs was relatively stable; only 16% of the content registered more than a 1 KB change during the testing period. We explore possible factors which may cause a URL to fail by examining its age, path depth, top-level domain and file extension. Based on the data collected, we found the half-life of a URL referenced in a D-Lib Magazine article is approximately 10 years. We also found that URLs were more likely to be unavailable if they pointed to resources in the .net, .edu or country-specific top-level domain, used non-standard ports (i.e., not port 80), or pointed to resources with uncommon or deprecated extensions (e.g., .shtml, .ps, .txt).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Federated Searching Across Heterogeneous Collections

This article describes a scalable system for searching heterogeneous multilingual collections on the World Wide Web. It details a markup language for describing the characteristics of a search engine and its interface, and a protocol for requesting word translations between languages.

متن کامل

Object Persistence and Availability in Digital Libraries

We have studied object persistence and availability of 1,000 digital library (DL) objects. Twenty World Wide Web accessible DLs were chosen and from each DL, 50 objects were chosen at random. A script checked the availability of each object three times a week for just over 1 year for a total of 161 data samples. During this time span, we found 31 objects (3% of the total) that appear to no long...

متن کامل

Observed Web Robot Behavior on Decaying Web Subsites

We describe the observed crawling patterns of various search engines (including Google, Yahoo and MSN) as they traverse a series of web subsites whose contents decay at predetermined rates. We plot the progress of the crawlers through the subsites, and their behaviors regarding the various file types included in the web subsites. We chose decaying subsites because we were originally interested ...

متن کامل

A Method for Identifying Personalized Representations in the Archives

Web resources are becoming increasingly personalized — two different users clicking on the same link at the same time can see content customized for each individual user. These changes result in multiple representations of a resource that cannot be canonicalized in Web archives. We identify characteristics of this problem by presenting a potential solution to generalize personalized representat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/cs/0511077  شماره 

صفحات  -

تاریخ انتشار 2005