Word2Vec is only a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing

نویسنده

  • Hirotaka Niitsuma
چکیده

We show that Correspondence Analysis (CA) is equivalent to defining the Gini-index with appropriately scaled one-hot encoding. Using this relation, we introduce nonlinear kernel extension of CA. The extended CA gives well-known analysis for categorical data (CD) and natural language processing by specializing kernels. For example, our formulation can give G-test, skip-gram with negative-sampling (SGNS), and GloVe as special cases. We introduce two kernels for natural language processing based on our formulation: a stop word (SW) kernel and a word similarity (WS) kernel. The SW kernel is a system introducing appropriate weights for SW. The WS kernel enables the use of WS test data as training data for vector space representations of words. We show that these kernels enhance accuracy when training data are insufficiently numerous.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural Network-Based Learning Kernel for Automatic Segmentation of Multiple Sclerosis Lesions on Magnetic Resonance Images

Background: Multiple Sclerosis (MS) is a degenerative disease of central nervous system. MS patients have some dead tissues in their brains called MS lesions. MRI is an imaging technique sensitive to soft tissues such as brain that shows MS lesions as hyper-intense or hypo-intense signals. Since manual segmentation of these lesions is a laborious and time consuming task, automatic segmentation ...

متن کامل

Word Agent Based Natural Language Processing

Natural language processing (NLP) is often based on declaratively represented grammars with emphasis on the competence of an ideal speaker/hearer. In contrast to these broadly used methods, we present a procedural and performance-oriented approach to the analysis of natural language expressions. The method is based on the idea that each word-class can be connected with a functional construct, t...

متن کامل

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

Can string kernels pass the test of time in Native Language Identification?

We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio rec...

متن کامل

Learning Sequence Kernels

Kernel methods are used to tackle a variety of learning tasks including classification, regression, ranking, clustering, and dimensionality reduction. The appropriate choice of a kernel is often left to the user. But, poor selections may lead to a sub-optimal performance. Instead, sample points can be used to learn a kernel function appropriate for the task by selecting one out of a family of k...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1605.05087  شماره 

صفحات  -

تاریخ انتشار 2016