Distributed Analytics over Web Archives

نویسندگان

  • Yagiz Kargin
  • Srikanta Bedathur
  • Avishek Anand
  • Gerhard Weikum
چکیده

Evolving content of the Web is being accumulated over time into Web archival collections. This creates the need for time travel search to explore the dynamics of the content. Text analytics has also a key role in exploring interesting information in text collections. Moreover, frequent phrase mining, a special case of text analytics, is an important analytical task that is motivated by the need of knowledge on frequent phrases in various areas of computer science, such as information retrieval and machine translation etc. However, time travel search and frequent phrase mining have to be conducted on increasingly large-scale data. Distributed approaches such as MapReduce, which is mainly designed to work on vast amount of text, can be utilized in this case. We address two separate problems in this thesis. The first problem is that time travel inverted index, which enables searching on the time dimension, is proposed in centralized setting. In our work, we parallelize the construction of time travel inverted index using MapReduce, having a distributed index as an end product. The second problem is that finding frequent phrases through näıve counting, even in MapReduce, is a time consuming task, because data to be processed gets much larger in size, when phrases are considered. As our work, we present partitioned approximate phrase counting, a very fast way to retrieve most of the frequent phrases together with their counts out of a collection to enable interactive analysis of the content. Included in this, we propose and a novel technique, partitioned in-mapper combining, which enables us to aggregate data in memory correctly, even though the data to be aggregated is larger than the available memory. Evaluation of experiments on New York Times Annotated Corpus, which contains roughly 2 million documents, show that our approach works at least 2 times faster as compared to näıve approach. It finds more than 90% of frequent phrases with high precision. Moreover, it is able to find all highly frequent phrases exactly, along with their accurate counts. Furthermore, by a quick second pass on the data, we precisely provide most of the frequent phrases with their corresponding true counts, still being significantly faster than näıve approach. Figure 1: Tag cloud of phrases in New York Times Annotated Corpus (1987 2007)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Observing the Web of Data through Efficient Distributed SPARQL Queries

Dealing with heterogeneity is one of the key challenges of Big Data analytics. The emergence of Linked Data provides better interoperability and thus further enhances potential of Big Data analytics. The use of Linked Data for analytics raises performance challenges when considering the distribution of data sources and the performance of Linked Data stores in comparison to other storage technol...

متن کامل

Exploiting Multimedia in Creating and Analysing Multimedia Web Archives

The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its conten...

متن کامل

Event Processing over a Distributed JSON Store: Design and Performance

Web applications are increasingly built to target both desktop and mobile users. As a result, modern Web development infrastructure must be able to process large numbers of events (e.g., for location-based features) and support analytics over those events, with applications ranging from banking (e.g., fraud detection) to retail (e.g., just-in-time personalized promotions). We describe a system ...

متن کامل

Longitudinal Analytics on Web Archive Data: It's About Time!

Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the sheer volume of multi-modal content constitutes a gold mine for analysts of all sorts, across different application areas, from political analysts and marketing agencies to academic researchers and product developers. In ...

متن کامل

Privacy-preserving Distributed Analytics: Addressing the Privacy-Utility Tradeoff Using Homomorphic Encryption for Peer-to-Peer Analytics

Data is becoming increasingly valuable, but concerns over its security and privacy have limited its utility in analytics. Researchers and practitioners are constantly facing a privacy-utility tradeoff where addressing the former is often at the cost of the data utility and accuracy. In this paper, we draw upon mathematical properties of partially homomorphic encryption, a form of asymmetric key...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011