Beyond Bag-of-Words: A New Distance Metric for Keywords Extraction and Clustering

نویسنده

  • Shengqi Zhu
چکیده

Bag-of-Words (BoW) is a widely used model in a variety tasks in Natural Language Processing (NLP). However, this model does not consider any relations between words in the bag, which will bring about multiple problems in some NLP aspects. In this project, I proposed a framework for calculating pair-wise word relations within a bag, using both deterministic Wordnet database and stochastic context information. The final relation matrix could be viewed as both state transition matrix and inner product matrix, which will be helpful for both keywords abstraction and clustering tasks commonly seen in meta-search engines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...

متن کامل

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

Efficient Codebook for Human Activity Recognition in Surveillance Video

M. E. Student, Dept of Computer Science, Annamalai University, India. Associate Professor, Dept of Computer Science, Annamalai University, India. Research Scholar, Dept of Computer Science, Annamalai University, India. [email protected], [email protected], [email protected] ABSTRACT—Automatic human activity recognition methods are useful for many applications such as Video Surveillanc...

متن کامل

Metric Learning in Codebook Generation of Bag-of-Words for Person Re-identification

Person re-identification is generally divided into two part: first how to represent a pedestrian by discriminative visual descriptors and second how to compare them by suitable distance metrics. Conventional methods isolate these two parts, the first part usually unsupervised and the second part supervised. The Bag-of-Words (BoW) model is a widely used image representing descriptor in part one....

متن کامل

Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009