Random Projection and Geometrization of String Distance Metrics

نویسنده

  • Daniel Devatman Hromada
چکیده

Edit distance is not the only approach how distance between two character sequences can be calculated. Strings can be also compared in somewhat subtler geometric ways. A procedure inspired by Random Indexing can attribute an D-dimensional geometric coordinate to any character N-gram present in the corpus and can subsequently represent the word as a sum of N-gram fragments which the string contains. Thus, any word can be described as a point in a dense N-dimensional space and the calculation of their distance can be realized by applying traditional Euclidean measures. Strong correlation exists, within the Keats Hyperion corpus, between such cosine measure and Levenshtein distance. Overlaps between the centroid of Levenshtein distance matrix space and centroids of vectors spaces generated by Random Projection were also observed. Contrary to standard non-random “sparse” method of measuring cosine distances between two strings, the method based on Random Projection tends to naturally promote not the shortest but rather longer strings. The geometric approach yields finer output range than Levenshtein distance and the retrieval of the nearest neighbor of text’s centroid could have, due to limited dimensionality of Randomly Projected space, smaller complexity than other vector methods. Mèδεις ageôμετρèτος eisitô μου tèή stegèή

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Transliterated Orthographic Variants via Two Similarity Metrics

We propose a detection method for orthographic variants caused by transliteration in a large corpus. The method employs two similarities. One is string similarity based on edit distance. The other is contextual similarity by a vector space model. Experimental results show that the method performed a 0.889 F-measure in an open test.

متن کامل

A Comparison of String Distance Metrics for Name-Matching Tasks

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid s...

متن کامل

A Comparison of String Metrics for Matching Names and Records

We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...

متن کامل

Scalar Curvature and Geometrization Conjectures for 3-Manifolds

We first summarize very briefly the topology of 3-manifolds and the approach of Thurston towards their geometrization. After discussing some general properties of curvature functionals on the space of metrics, we formulate and discuss three conjectures that imply Thurston’s Geometrization Conjecture for closed oriented 3-manifolds. The final two sections present evidence for the validity of the...

متن کامل

Random Projections with Bayesian Priors

The technique of random projection is one of dimension reduction, where high dimensional vectors in RD are projected down to a smaller subspace in Rk. Certain forms of distances or distance kernels such as Euclidean distances, inner products [10], and lp distances [12] between high dimensional vectors are approximately preserved in this smaller dimensional subspace. Word vectors which are repre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013