Beyond Bag-of-Words: A New Distance Metric for Keywords Extraction and Clustering
نویسنده
چکیده
Bag-of-Words (BoW) is a widely used model in a variety tasks in Natural Language Processing (NLP). However, this model does not consider any relations between words in the bag, which will bring about multiple problems in some NLP aspects. In this project, I proposed a framework for calculating pair-wise word relations within a bag, using both deterministic Wordnet database and stochastic context information. The final relation matrix could be viewed as both state transition matrix and inner product matrix, which will be helpful for both keywords abstraction and clustering tasks commonly seen in meta-search engines.
منابع مشابه
یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیکهای یادگیری معیار فاصله
Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...
متن کاملComposite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کاملEfficient Codebook for Human Activity Recognition in Surveillance Video
M. E. Student, Dept of Computer Science, Annamalai University, India. Associate Professor, Dept of Computer Science, Annamalai University, India. Research Scholar, Dept of Computer Science, Annamalai University, India. [email protected], [email protected], [email protected] ABSTRACT—Automatic human activity recognition methods are useful for many applications such as Video Surveillanc...
متن کاملMetric Learning in Codebook Generation of Bag-of-Words for Person Re-identification
Person re-identification is generally divided into two part: first how to represent a pedestrian by discriminative visual descriptors and second how to compare them by suitable distance metrics. Conventional methods isolate these two parts, the first part usually unsupervised and the second part supervised. The Bag-of-Words (BoW) model is a widely used image representing descriptor in part one....
متن کاملClustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool
With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is o...
متن کامل