Choosing a Distance Metric for Automatic Word Categorization

نویسندگان

  • Emin Erkan Korkmaz
  • Göktürk Üçoluk
چکیده

WORD CATEGORIZATION Emin Erkan Korkmaz G okt urk  U coluk Department of Computer Engineering Middle East Technical University Ankara-Turkey Emails: [email protected] [email protected] Abstract This paper analyzes the functionality of different distance metrics that can be used in a bottom-up unsupervised algorithm for automatic word categorization. The proposed method uses a modi ed greedy-type algorithm. The formulations of fuzzy theory are also used to calculate the degree of membership for the elements in the linguistic clusters formed. The unigram and the bigram statistics of a corpus of about two million words are used. Empirical comparisons are made in order to support the discussions proposed for the type of distance metric that would be most suitable for measuring the similarity between linguistic elements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Method for Improving Automatic Word Categorization

A METHOD FOR IMPROVING AUTOMATIC WORD CATEGORIZATION Korkmaz, Emin Erkan MS., Department of Computer Engineering Supervisor: Ass. Prof. Dr. G okt urk  U coluk September 1997, 57 pages In this thesis study a new approach to automatic word categorization which improves both the e ciency of the algorithm and the quality of the formed clusters is presented. The unigram and the bigram statistics ...

متن کامل

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...

متن کامل

Emin Erkan Korkmaz and Gg Okt Urk Uu Coluk (1997) a Method for Improving Automatic Word Categorization. a Method for Improving Automatic Word Categorization

This paper presents a new approach to automatic word categorization which improves both the eeciency of the algorithm and the quality of the formed clusters. The unigram and the bigram statistics of a corpus of about two million words are used with an eecient distance function to measure the similarities of words, and a greedy algorithm to put the words into clusters. The notions of fuzzy clust...

متن کامل

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Word fragments or n-grams have been widely used to perform different Natural Language Processing tasks such as information retrieval [1] [2], document categorization [3], automatic summarization [4] or, even, genetic classification of languages [5]. All these techniques share some common aspects such as: (1) documents are mapped to a vector space where n-grams are used as coordinates and their ...

متن کامل

Method for Improving Automatic Word Categorization

This paper presents a new approach to automatic word categorization which improves both the efficiency of the algorithm and the quality of the formed clusters. The unigram and the bigram statistics of a corpus of about two million words are used with an efficient distance function to measure the similarities of words, and a greedy algorithm to put the words into clusters. The notions of fuzzy c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998