Random Projection and Geometrization of String Distance Metrics
نویسنده
چکیده
Edit distance is not the only approach how distance between two character sequences can be calculated. Strings can be also compared in somewhat subtler geometric ways. A procedure inspired by Random Indexing can attribute an D-dimensional geometric coordinate to any character N-gram present in the corpus and can subsequently represent the word as a sum of N-gram fragments which the string contains. Thus, any word can be described as a point in a dense N-dimensional space and the calculation of their distance can be realized by applying traditional Euclidean measures. Strong correlation exists, within the Keats Hyperion corpus, between such cosine measure and Levenshtein distance. Overlaps between the centroid of Levenshtein distance matrix space and centroids of vectors spaces generated by Random Projection were also observed. Contrary to standard non-random “sparse” method of measuring cosine distances between two strings, the method based on Random Projection tends to naturally promote not the shortest but rather longer strings. The geometric approach yields finer output range than Levenshtein distance and the retrieval of the nearest neighbor of text’s centroid could have, due to limited dimensionality of Randomly Projected space, smaller complexity than other vector methods. Mèδεις ageôμετρèτος eisitô μου tèή stegèή
منابع مشابه
Detecting Transliterated Orthographic Variants via Two Similarity Metrics
We propose a detection method for orthographic variants caused by transliteration in a large corpus. The method employs two similarities. One is string similarity based on edit distance. The other is contextual similarity by a vector space model. Experimental results show that the method performed a 0.889 F-measure in an open test.
متن کاملA Comparison of String Distance Metrics for Name-Matching Tasks
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid s...
متن کاملA Comparison of String Metrics for Matching Names and Records
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...
متن کاملScalar Curvature and Geometrization Conjectures for 3-Manifolds
We first summarize very briefly the topology of 3-manifolds and the approach of Thurston towards their geometrization. After discussing some general properties of curvature functionals on the space of metrics, we formulate and discuss three conjectures that imply Thurston’s Geometrization Conjecture for closed oriented 3-manifolds. The final two sections present evidence for the validity of the...
متن کاملRandom Projections with Bayesian Priors
The technique of random projection is one of dimension reduction, where high dimensional vectors in RD are projected down to a smaller subspace in Rk. Certain forms of distances or distance kernels such as Euclidean distances, inner products [10], and lp distances [12] between high dimensional vectors are approximately preserved in this smaller dimensional subspace. Word vectors which are repre...
متن کامل