A fast algorithm for two-dimensional Kolmogorov-Smirnov two sample tests
نویسنده
چکیده
By using the brute force algorithm, the application of the two-dimensional two-sample Kolmogorov–Smirnov test can be prohibitively computationally expensive. Thus a fast algorithm for computing the two-sample Kolmogorov–Smirnov test statistic is proposed to alleviate this problem. The newly proposed algorithm is O(n) times more efficient than the brute force algorithm, where n is the sum of the two sample sizes. The proposed algorithm is parallel and can be generalized to higher dimensional spaces. © 2016 Elsevier B.V. All rights reserved. 1. A fast algorithm for one-dimensional Kolmogorov–Smirnov test Given two continuous probability distribution functions F 1 and F 2 in one-dimensional space, consider the hypothesis test problem H0 : F 1 = F 2 vs. Ha : F 1 ≠ F 2 (1) based on the samples {X1 i } n1 i=1 and {X 2 j } n2 j=1 from the respective distributions. The classical Kolmogorov–Smirnov test uses the maximum difference of the empirical distribution functions (or cumulative frequency functions) at the observed values. Specifically, let F k nk (k = 1, 2) be the empirical distribution function based on the sample {X k t } nk t=1, that is, F k nk(x) = #{t : Xk t ≤ x, 1 ≤ t ≤ nk} nk , ∞ < x < ∞, (2) where # means ‘‘the number of’’, then the Kolmogorov–Smirnov test statistic DKS is computed as (up to a multiple) DKS = max{ max 1≤i≤n1 |F 1 n1(X 1 i ) − F 2 n2(X 1 i )|, max 1≤j≤n2 |F 1 n1(X 2 j ) − F 2 n2(X 2 j )|}. (3) The value of DKS is often computed by a brute force algorithm, which simply counts the number of sample values that are less than X1 i or X 2 j for each i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2. The number of comparisons needed by the brute force algorithm is O(n2), where n = n1 + n2. However, there exists a faster algorithm. Let L be the least common multiple of n1 and n2, d1 = L/n1, d2 = L/n2, and let {X0 (t) : 1 ≤ t ≤ n} = {X 0 (1) ≤ X 0 (2) ≤ · · · ≤ X 0 (n)} (4) E-mail address: [email protected]. http://dx.doi.org/10.1016/j.csda.2016.07.014 0167-9473/© 2016 Elsevier B.V. All rights reserved. 54 Y. Xiao / Computational Statistics and Data Analysis 105 (2017) 53–58 be the pooled sample arranged ascendingly. (Throughout this paper we assume all the observed values have no ties when necessary.) Define ht = L × [F 1 n1(X 0 (t)) − F 2 n2(X 0 (t))], 0 ≤ t ≤ n. (5) The value of h0 is set to be 0. The reader can easily verify the following recurrence: ht = ht−1 + d1 if X0 (t) = X 1 i for some i, ht−1 − d2 if X0 (t) = X 2 j for some j. (6) See Burr (1963), Hájek and Šidàk (1967) and Xiao et al. (2007). The value of the Kolmogorov–Smirnov test statistic is the maximum value of |ht |/L over 1 ≤ t ≤ n: DKS = max 0≤t≤n |ht |/L. (7) If the quick sort method is used, this algorithm only needs O(n log2 n) comparisons (Hoare, 1961), which is O(n) times more efficient than the brute force algorithm. In addition, the use of L even speeds up the algorithm since all the intermediate results are integers. 2. Generalization to two-dimensional spaces The generalization of the Kolmogorov–Smirnov test to high dimensional probability distributions is a challenge. To generalize the Kolmogorov–Smirnov test to two-dimensional space, Peacock (1983) proposed a procedure which makes the use of four (rather than just one) pairs of cumulative frequency functions. Denote the two given samples in a plane by {(Xk i , Y k i )} nk i=1, k = 1, 2, respectively, the four pairs of cumulative frequency functions used by Peacock’s test are given by F k ++ (x, y) = #{i : Xk i > x, Y k i > y, 1 ≤ i ≤ nk}/nk, (8) F k +− (x, y) = #{i : Xk i > x, Y k i ≤ y, 1 ≤ i ≤ nk}/nk, (9) F k −+ (x, y) = #{i : Xk i ≤ x, Y k i > y, 1 ≤ i ≤ nk}/nk, (10) and F k −− (x, y) = #{i : Xk i ≤ x, Y k i ≤ y, 1 ≤ i ≤ nk}/nk, (11) where ∞ < x, y < ∞ and k = 1, 2. Let {X0 t : t = 1, 2, . . . , n} be the pooled data set consisting of the values of the X-components of the given samples and {Y 0 t : t = 1, 2, . . . , n} the pooled data set consisting of the values of the Y -components of the given samples. Define D++ def = max 1≤s≤n, 1≤t≤n |F 1 ++ (X0 s , Y 0 t ) − F 2 ++ (X0 s , Y 0 t )|, (12) D+− def = max 1≤s≤n, 1≤t≤n |F 1 +− (X0 s , Y 0 t ) − F 2 +− (X0 s , Y 0 t )|, (13) D−+ def = max 1≤s≤n, 1≤t≤n |F 1 −+ (X0 s , Y 0 t ) − F 2 −+ (X0 s , Y 0 t )|, (14) and D−− def = max 1≤s≤n, 1≤t≤n |F 1 −− (X0 s , Y 0 t ) − F 2 −− (X0 s , Y 0 t )|. (15) Peacock’s test is then defined as D2DKS = max{D++, D+−, D−+, D−−}. (16) The test is often performed by a brute force algorithm and its application is very expensive in terms of computing time unless the sample sizes n1 and n2 are very small. Indeed, to compute the value of D−−, we need to compute the value of the difference of the cumulative frequency functions F 1 −− and F 2 −− at all the n2 pairs (Xs, Yt), Xs and Yt being coordinates of any pairs in the given samples. It will need O(n) comparisons to compute the value of the difference of the cumulative frequency functions F 1 −− and F 2 −− at a single point. Thus, it will take O(n3) comparisons to compute the value of D−−. Similar conclusions can be made for D++, D+−, D−+. To alleviate the problem, Fasano and Franceschini (1987, F&F, for short) revised Peacock’s test by comparing the cumulative frequency functions at the observed sample points only, so the number of comparisons needed is only O(n2). The F&F test is widely used in practice. But it is a variant of Peacock’s test, a different approach in essence. In fact, there exists a fast algorithm for evaluating the value of Peacock’s test statistic. Denote by {(X ′ (t), Y ′ t ) : 1 ≤ t ≤ n} the pooled sample sorted ascendingly by the values of the X-components of the data points, and by {(X ′ t , Y ′ (t)) : 1 ≤ t ≤ n}
منابع مشابه
Non-Parametric Testing of Distributions – the Epps-Singleton two-sample test using the Empirical Characteristic Function
In statistics, two-sample tests are used to determine whether two samples have been drawn from the same population. A widely used test as such is the Kolmogorov-Smirnov two-sample test. There are other distribution-free tests which might be applied in similar occasions. In this article, we describe a two-sample omnibus test introduced by Epps and Singleton, which has – albeit being distribution...
متن کاملCooperative Spectrum Sensing: Two-sample Kolmogorov-Smirnov Test under Rician Fading Channel
Signal detection performance in cognitive radio architecture is enhanced by the cooperation of sensing detectors if the fading and shadowing effects exist on the channel. A cooperative spectrum sensing technique in cognitive radio networks based on two-sample Kolmogorov-Smirnov test is proposed in this paper and proposed sensing scheme is examined under Rician fading channel. The performance of...
متن کاملA Kolmogorov-Smirnov test for the molecular clock based on Bayesian ensembles of phylogenies
Divergence date estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests of molecular clocks. Here we propose two non-parametric tests of strict and relaxed molecular clocks built upon a framework that uses the empirical cumulative distribution (ECD) of branch lengths obtained from an ensemble of Bayesian trees and well known non-para...
متن کاملThe signed Kolmogorov-Smirnov test: why it should not be used
The two-sample Kolmogorov-Smirnov (KS) test is often used to decide whether two random samples have the same statistical distribution. A popular modification of the KS test is to use a signed version of the KS statistic to infer whether the values of one sample are statistically larger than the values of the other. The underlying hypotheses of the KS test are intrinsically incompatible with thi...
متن کاملFeature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter
An algorithm for filtering information based on the Kolmogorov-Smirnov correlation-based approach has been implemented and tested on feature selection. The only parameter of this algorithm is statistical confidence level that two distributions are identical. Empirical comparisons with 4 other state-of-the-art features selection algorithms (FCBF, CorrSF, ReliefF and ConnSF) are very encouraging.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computational Statistics & Data Analysis
دوره 105 شماره
صفحات -
تاریخ انتشار 2017