Large-scale Submodular Greedy Exemplar Selection with Structured Similarity Matrices
نویسندگان
چکیده
Exemplar clustering attempts to find a subset of data-points that summarizes the entire data-set in the sense of minimizing the sum of distances from each point to its closest exemplar. It has many important applications in machine learning including document and video summarization, data compression, scalability of kernel methods and Gaussian processes, active learning and feature selection. A key challenge in the adoption of exemplar clustering to large-scale applications has been the availability of accurate and scalable algorithms. We propose an approach that combines structured similarity matrix representations with submodular greedy maximization that can dramatically increase the scalability of exemplar clustering and still enjoys good approximation guarantees. Exploiting structured similarity matrices within the context of submodular greedy algorithms is by no means trivial, as naive approaches still require computing all the entries of the matrix. We propose a randomized approach based on sampling sign-patterns of columns of the similarity matrix and establish accuracy guarantees. We demonstrate significant computational speed-ups while still achieving highly accurate solutions, and solve problems with up-to millions of data-points in around a minute or less on a single commodity computer.
منابع مشابه
Distributed Submodular Cover: Succinctly Summarizing Massive Data
How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condit...
متن کاملFast Multi-stage Submodular Maximization
Motivated by extremely large-scale machine learning problems, we introduce a new multistage algorithmic framework for submodular maximization (called MultGreed), where at each stage we apply an approximate greedy procedure to maximize surrogate submodular functions. The surrogates serve as proxies for a target submodular function but require less memory and are easy to evaluate. We theoreticall...
متن کاملSubmodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets
To cope with the high level of ambiguity faced in domains such as Computer Vision or Natural Language processing, robust prediction methods often search for a diverse set of high-quality candidate solutions or proposals. In structured prediction problems, this becomes a daunting task, as the solution space (image labelings, sentence parses, etc.) is exponentially large. We study greedy algorith...
متن کاملFast Multi-Stage Submodular Maximization: Extended version
Motivated by extremely large-scale machine learning problems, we introduce a new multistage algorithmic framework for submodular maximization (called MultGreed), where at each stage we apply an approximate greedy procedure to maximize surrogate submodular functions. The surrogates serve as proxies for a target submodular function but require less memory and are easy to evaluate. We theoreticall...
متن کاملSubmodular Maximization and Diversity in Structured Output Spaces
We study the greedy maximization of a submodular set function F : 2 → R when each item in the ground set V is itself a combinatorial object, e.g. a configuration or labeling of a base set of variables z = {z1, ..., zm}. This problem arises naturally in a number of domains, such as Computer Vision or Natural Language Processing, where we want to search for a set of diverse high-quality solutions...
متن کامل