Top-k Item Identification on Dynamic and Distributed Datasets
نویسندگان
چکیده
The problem of identifying the most frequent items across multiple datasets has received considerable attention over the last few years. When storage is a scarce resource, the topic is already a challenge; yet, its complexity may be further exacerbated not only by the many independent data sources, but also by the dynamism of the data, i.e., the fact that new items may appear and old ones disappear at any time. In this work, we provide a novel approach to the problem by using an existing gossip-based algorithm for identifying the k most frequent items over a distributed collection of datasets, in ways that deal with the dynamic nature of the data. The algorithm has been thoroughly analyzed through trace-based simulations and compared to state-of-the-art decentralized solutions, showing better precision at reduced communication overhead.
منابع مشابه
Ensemble-based Top-k Recommender System Considering Incomplete Data
Recommender systems have been widely used in e-commerce applications. They are a subclass of information filtering system, used to either predict whether a user will prefer an item (prediction problem) or identify a set of k items that will be user-interest (Top-k recommendation problem). Demanding sufficient ratings to make robust predictions and suggesting qualified recommendations are two si...
متن کاملExtracting Support Based k most Strongly Correlated Item Pairs in Large Transaction Databases
Support confidence framework is misleading in finding statistically meaningful relationships in market basket data. The alternative is to find strongly correlated item pairs from the basket data. However, strongly correlated pairs query suffered from suitable threshold setting problem. To overcome that, top-k pairs finding problem has been introduced. Most of the existing techniques are multi-p...
متن کاملRetrieval of the most relevant facts from data streams joined with slowly evolving dataset published on the Web of Data
Finding the most relevant facts among dynamic and heterogeneous data published on the Web of Data is getting a growing attention in recent years. RDF Stream Processing (RSP) engines offer a baseline solution to integrate and process streaming data with data distributed on the Web. Unfortunately, the time to access and fetch the distributed data can be so high to put the RSP engine at risk of lo...
متن کاملA Boosting Algorithm for Item Recommendation with Implicit Feedback
Many recommendation tasks are formulated as top-N item recommendation problems based on users’ implicit feedback instead of explicit feedback. Here explicit feedback refers to users’ ratings to items while implicit feedback is derived from users’ interactions with items, e.g., number of times a user plays a song. In this paper, we propose a boosting algorithm named AdaBPR (Adaptive Boosting Per...
متن کاملEntropy-based Scheduling Policy for Cross Aggregate Ranking Workloads
Many data exploration applications require the ability to identify the top-k results according to a scoring function. We study a class of top-k ranking problems where top-k candidates in a dataset are scored with the assistance of another set. We call this class of workloads cross aggregate ranking. Example computation problems include evaluating the Hausdorff distance between two datasets, fin...
متن کامل