Approximate Frequency Counts over Data Streams
نویسندگان
چکیده
Research in data stream algorithms has blossomed since late 90s. The talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research. The talk will also touch upon a recent development: analysis of personal data streams for improving our quality of lives. 1. BIOGRAPHICAL SKETCHES Gurmeet Manku (1973-) is a software engineer at Google since 2004. He has worked for Infrastructure, Google+ and Ads teams. Presently, he is part of the Google Analytics team which is focused on data mining of clickstream data. Gurmeet finished his B.Tech in Computer Science at IIT Delhi (1995). He then got his M.S. and Ph.D. from UC Berkeley (1997) and Stanford University (2004) respectively. In between, he worked in the Exploratory Database Group at IBM Almaden Research Center for two years. Gurmeet has written over 20 research papers in top conferences. His areas of interest have included data stream algorithms, peer to peer systems and data compression. Rajeev Motwani (1962-2009) was a professor of Computer Science at Stanford University whose research focused on theoretical computer science. He was an early advisor and supporter of companies including Google and PayPal, and a special advisor to Sequoia Capital. He completed his B.Tech in Computer Science from IIT Kanpur in 1983, got his Ph.D. in Computer Science from U.C. Berkeley in 1988 under the supervision of Richard Karp and joined Stanford soon after U.C. Berkeley. Motwani was one of the co-authors (with Larry Page and Sergey Brin, and Terry Winograd) of an influential early paper on the PageRank algorithm, the basis for Google's search techniques in its early days. He also co-authored another seminal search paper What Can You Do With A Web In Your Pocket with those same authors. He was also an author of two widely-used theoretical computer science textbooks, Randomized Algorithms (Cambridge University Press 1995, with Prabhakar Raghavan) and Introduction to Automata Theory, Languages, and Computation (2nd ed., Addison-Wesley, 2000, with John Hopcroft and Jeffrey Ullman). Prior to his involvement with Google, Motwani founded the Mining Data at Stanford project (MIDAS), an umbrella organization for several groups looking into new and innovative data management concepts. His research included data privacy, web search, robotics, and computational drug design. He was an avid angel investor and had funded a number of successful startups to emerge from Stanford. He sat on the boards of Google, Kaboodle, Mimosa Systems, Adchemy, Baynote, Vuclip, NeoPath Networks (acquired by Cisco Systems in 2007), Tapulous and Stanford Student Enterprises among others. He was also active in the Business Association of Stanford Entrepreneurial Students (BASES). He was a winner of the Gödel Prize in 2001 for his work on the PCP theorem and its applications to hardness of approximation. He served on the editorial boards of SIAM Journal on Computing, Journal of Computer and System Sciences, ACM Transactions on Knowledge Discovery from Data, and IEEE Transactions on Knowledge and Data Engineering. 2. CITATION FROM THE TEN-YEARBEST-PAPER AWARD COMMITTEE This paper [1], one of many on the hot topic of data streams that year (2002), presents algorithms for computing frequency counts exceeding a user-specified threshold. The paper deftly combines theory, algorithms, and experiments, introducing novel algorithms for sticky sampling and lossy counting (though with provably bounded error), which are important for many applications in databases in general, in data mining, in web-server logs, and in networking. A beautifully written paper, it has garnered a truly amazing number of citations over the last decade (a quality shared by some of the other papers appearing in that unusually impactful conference), including a good number just in the last year, a sign that the paper is still quite relevant. One such general concept highlighted in the paper is that of summary data structures with a small memory footprint.
منابع مشابه
Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams
Maintaining frequency counts for items over data stream has a wide range of applications such as web advertisement fraud detection. Study of this problem has attracted great attention from both researchers and practitioners. Many algorithms have been proposed. In this paper, we propose a new method, error-adaptive pruning method, to maintain frequency more accurately. We also propose a method c...
متن کاملMining Frequent Itemsets Over Arbitrary Time Intervals in Data Streams
Mining frequent itemsets over a stream of transactions presents di cult new challenges over traditional mining in static transaction databases. Stream transactions can only be looked at once and streams have a much richer frequent itemset structure due to their inherent temporal nature. We examine a novel data structure, an FP-stream, for maintaining information about itemset frequency historie...
متن کاملStreaming for large scale NLP: Language Modeling
In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We s...
متن کاملStreaming Large Language Models for Statistical Machine Translation
This paper presents an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (4 GB ...
متن کاملA The Value of Multiple Read/Write Streams for Approximating Frequency Moments
We consider the read/write streams model, an extension of the standard data stream model in which an algorithm can create and manipulate multiple read/write streams in addition to its input data stream. Like the data stream model, the most important parameter for this model is the amount of internal memory used by such an algorithm. The other key parameters are the number of streams the algorit...
متن کاملSynopsis Construction in Data Streams
Unlike traditional data sets, stream data flow in and out of a computer system continuously and with varying update rates. It may be impossible to store an entire data stream due to its tremendous volume. To discover knowledge or patterns from data streams, it is necessary to develop data stream summarization techniques. Lots of work has been done to summarize the contents of data streams in or...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 5 شماره
صفحات -
تاریخ انتشار 2002