InfoMall: A Large-Scale Storage System for Web Archiving

نویسندگان

  • Lian'en Huang
  • Jinping Li
  • Xiaoming Li
چکیده

The World Wide Web is a fluid medium which means that Web pages or entire Web sites frequently change or disappear, often without leaving any trace. Considering the great value of the Web, it is quite necessary to archive the current Web for the future. In order to do this, a large-scale storage system is required. In this paper we propose such a system which is designed for storing the massive Web pages we have been collecting consistently since 2001. One significant feature of this collection of Web pages is that it is space-time dimensioned which means every Web page is attached with a URL and a time, while one URL is possible to contain lots of Web pages crawled at different times. Our system is designed that sorted Web pages are clustered and stored together by some degree of space-time granularity. As a result, users are able to retrieve effectively Web pages with URLs and times specified or batches of Web pages with URL ranges and time ranges specified.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Data Storage and Service Model of China Web InfoMall1

The Web consists of enormous pages which is easier vanishing than traditional media such as newspaper, journals. To preserve the web resources, we began the China Web archiving project, named Web InfoMall, from 2001. The paper describes the data storage and service model of Web InfoMall 2.0 to meet the goals of collecting the stuff broadly, storing them perennially, and locating requests effici...

متن کامل

A New Data Storage and Service Model of China Web

The Web consists of enormous pages which is easier vanishing than traditional media such as newspaper, journals. To preserve the web resources, we began the China Web archiving project, named Web InfoMall, from 2001. The paper describes the data storage and service model of Web InfoMall 2.0 to meet the goals of collecting the stuff broadly, storing them perennially, and locating requests effici...

متن کامل

Implementation Issues of A Cloud Computing Platform

Cloud computing is Internet based system development in which large scalable computing resources are provided “as a service” over the Internet to users. The concept of cloud computing incorporates web infrastructure, software as a service (SaaS), Web 2.0 and other emerging technologies, and has attracted more and more attention from industry and research community. In this paper, we describe ou...

متن کامل

ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph

Archiving the web is socially and culturally critical, but presents problems of scale. In this paper, we present ArcLink, an exemplary system to optimize the construction, storage, and access to the temporal web graph from large-scale web archive. We divide the web archive construction into four stages (filtering, extraction, storage, and access) and explore optimizations for each stage. We wer...

متن کامل

Stability analysis and selection of optimuim support system of large scale underground space-Case study

The Azad pumped storage power plant including the pumping and transformer cavern and surge tanks has been located in the Sanandaj-Sirjan formation with the alternation of slate and phyllite and meta sandstone.Due to the sensitivity and special use of these spaces, stability analysis and ensuring the safety of carven is very important.Nowadays, using the surface storage tanks is very costly; The...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013