An improved approximation algorithm for the column subset selection problem

نویسندگان

  • Christos Boutsidis
  • Michael W. Mahoney
  • Petros Drineas
چکیده

We consider the problem of selecting the “best” subset of exactly k columns from an m× n matrix A. In particular, we present and analyze a novel two-stage algorithm that runs in O(min{mn2,m2n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stage (the randomized stage), the algorithm randomly selects O(k log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-k right singular subspace of A. In the second stage (the deterministic stage), the algorithm applies a deterministic column-selection procedure to select and return exactly k columns from the set of columns selected in the first stage. Let C be the m × k matrix containing those k columns, let PC denote the projection matrix onto the span of those columns, and let Ak denote the “best” rank-k approximation to the matrix A as computed with the singular value decomposition. Then, we prove that ‖A− PCA‖2 ≤ O ( k 3 4 log 1 2 (k) (n− k) 1 4 ) ‖A−Ak‖2 , with probability at least 0.7. This spectral norm bound improves upon the best previouslyexisting result (of Gu and Eisenstat [23]) for the spectral norm version of this Column Subset Selection Problem. We also prove that ‖A− PCA‖F ≤ O ( k √ log k ) ‖A−Ak‖F , with the same probability. This Frobenius norm bound is only a factor of √ k log k worse than the best previously existing existential result and is roughly O( √ k!) better than the best previous algorithmic result (both of Deshpande et al. [12]) for the Frobenius norm version of this Column Subset Selection Problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems

Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...

متن کامل

A Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems

Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...

متن کامل

Greedy Column Subset Selection: New Bounds and Distributed Algorithms

The problem of column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy algorithm has been shown to be quite effective in practice. However, theoretical guarantees on its performance have not been explored thoroughly, especially i...

متن کامل

ec 2 00 8 An Improved Approximation Algorithm for the Column Subset Selection Problem ∗

We consider the problem of selecting the “best” subset of exactly k columns from an m× n matrix A. In particular, we present and analyze a novel two-stage algorithm that runs in O(min{mn2,m2n}) time and returns as output an m × k matrix C consisting of exactly k columns of A. In the first stage (the randomized stage), the algorithm randomly selects O(k log k) columns according to a judiciously-...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009