Word Context and Token Representations from Paradigmatic Relations and Their Application to Part-of-Speech Induction

نویسنده

  • Enis Rıfat Sert
چکیده

Representation of words as dense real vectors in the Euclidean space provides an intuitive definition of relatedness in terms of the distance or the angle between one another. Regions occupied by these word representations reveal syntactic and semantic traits of the words. On top of that, word representations can be incorporated in other natural language processing algorithms as features. In this thesis, we generate word representations in an unsupervised manner by utilizing paradigmatic relations which are concerned with substitutability of words. We employ an Euclidean embedding algorithm (SCODE) to generate word context and word token representations from the substitute word distributions, in addition to word type representations. Word context and word token representations are capable of handling syntactic category ambiguities of word types because they are not restricted to a single representation for each word type. We apply the word type, word context and word token representations to the part-of-speech induction problem by clustering the representations with k-means algorithm and obtain type and token based part-of-speech induction for Wall Street Journal section of Penn Treebank with 45 gold-standard tags. To the best of our knowledge, these part-of-speech induction results are the state-of-the-art for both type based and token based part-of-speech induction with Many-To-One mapping accuracies of 0.8025 and 0.8039, respectively. We also introduce a measure of ambiguity, Gold-standard-tag Perplexity, which we use to show that our token based part-of-speech induction is indeed successful at inducing part-of-speech categories of ambiguous word types.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Descriptive Semantics of the Nominal Hapax Legomenon of the Word Menhaj and the Pathology of its Three Translations (Meybodi, Makarem Shirazi and Ansarian)

Understanding the Quran depends upon appreciating meanings of the single words and concepts that are interconnected and interrelated like a chain. Nominal hapax legomenon in the Quran is a word that occurs only once in the holy Quran. Hence, such words need semantic scrutiny since they are difficult to understand. Accordingly, understanding hapax legomenons calls for examining and identifying t...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

Correspondence of Syntagmatic and Paradigmatic Axes Relations, and Their Transformation in Relation to the Communicative Role of Shahnameh Illustration in Shiraz School of Miniature

When treated like texts with their own visual language, illustrations from the Shiraz School of miniature are a mixture of the syntagmatic and paradigmatic relations of signs. Syntagmatic relations reveal the different ways the elements of a text are connected, while paradigmatic relations identify the sets of signifiers that signify the content of the text, dealing with intratextual and intert...

متن کامل

Using functional magnetic resonance imaging (fMRI) to explore brain function: cortical representations of language critical areas

Pre-operative determination of the dominant hemisphere for speech and speech associated sensory and motor regions has been of great interest for the neurological surgeons. This dilemma has been of at most importance, but difficult to achieve, requiring either invasive (Wada test) or non-invasive methods (Brain Mapping). In the present study we have employed functional Magnetic Resonance Imaging...

متن کامل

Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical rel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013