similarity score

Text Clustering Using a Suffix Tree Similarity Measure

Journal: :JCP 2011

Chenghui Huang Jian Yin Fang Hou

In text mining area, popular methods use the bagof-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between ...

متن کامل

Constructing Curriculum Ontology and Dynamic Learning Path Based on Resource Description Framework

2016

Makoto Urakawa Masaru Miyazaki Hiroshi Fujisawa Masahide Naemura Ichiro Yamada

Curriculum for school is generated based on the academic year. Because students have to study several subjects each and every year, the relative topics are put into curricula in discrete. In this study, we propose a method to construct a dynamic learning path which enables us to learn the relative topics continuously. In this process, we define two kinds of similarity score, inheritance score a...

متن کامل

Shortest-Path Graph Kernels for Document Similarity

2017

Giannis Nikolentzos Polykarpos Meladianos François Rousseau Yannis Stavrakas Michalis Vazirgiannis

In this paper, we present a novel document similarity measure based on the definition of a graph kernel between pairs of documents. The proposed measure takes into account both the terms contained in the documents and the relationships between them. By representing each document as a graph-of-words, we are able to model these relationships and then determine how similar two documents are by usi...

متن کامل

Building a Hierarchy of Events and Topics for Newspaper Digital Libraries

2003

Aurora Pons-Porrata Rafael Berlanga Llavori José Ruiz-Shulcloper

In this paper we propose an incremental hierarchical clustering algorithm for on-line event detection. This algorithm is applied to a set of newspaper articles in order to discover the structure of topics and events that they describe. In the first level, articles with a high temporal-semantic similarity are clustered together into events. In the next levels of the hierarchy, these events are s...

متن کامل

Detecting Transliterated Orthographic Variants via Two Similarity Metrics

2004

Kiyonori Ohtake Youichi Sekiguchi Kazuhide Yamamoto

We propose a detection method for orthographic variants caused by transliteration in a large corpus. The method employs two similarities. One is string similarity based on edit distance. The other is contextual similarity by a vector space model. Experimental results show that the method performed a 0.889 F-measure in an open test.

متن کامل

A Hybrid Approach for Extending Ontology from Text

2013

Wei He Shuang Li Xiaoping Yang

Ontology is applied to various fields of computer as a conceptual modeling tool, and is used to organize information and manage knowledge. Ontology extension is used to add the new concepts and relationship into the existing ontology, which is a more complex task. In this paper, we propose a hybrid approach for ontology extension from text using semantic relatedness between words, which exploit...

متن کامل

Post Summarization of Microblogs of Sporting Events

2017

Mehreen Gillani Muhammad U. Ilyas Saad Saleh Jalal S. Alowibdi Naif R. Aljohani Fahad S. Alotaibi

Every day 645 million Twitter users generate approximately 58 million tweets. This motivates the question if it is possible to generate a summary of events from this rich set of tweets only. Key challenges in post summarization from microblog posts include circumnavigating spam and conversational posts. In this study, we present a novel technique called lexi-temporal clustering (LTC), which ide...

متن کامل

Are Word Embedding-based Features Useful for Sarcasm Detection?

2016

Aditya Joshi Vaibhav Tripathi Kevin Patel Pushpak Bhattacharyya Mark James Carman

This paper makes a simple increment to state-ofthe-art in sarcasm detection research. Existing approaches are unable to capture subtle forms of context incongruity which lies at the heart of sarcasm. We explore if prior work can be enhanced using semantic similarity/discordance between word embeddings. We augment word embedding-based features to four feature sets reported in the past. We also e...

متن کامل

Methodology and Results for the Competition on Semantic Similarity Evaluation and Entailment Recognition for PROPOR 2016

Journal: :CoRR 2017

Luciano Barbosa Paulo Rodrigo Cavalin Victor Guimaraes Matthias Kormaksson

In this paper, we present the methodology and the results obtained by our teams, dubbed Blue Man Group, in the ASSIN (from the Portuguese Avaliação de Similaridade Semântica e Inferência Textual) competition, held at PROPOR 2016. Our team’s strategy consisted of evaluating methods based on semantic word vectors, following two distinct directions: 1) to make use of low-dimensional, compact, feat...

متن کامل

Triple Scoring Using Paragraph Vector - The Gailan Triple Scorer at WSDM Cup 2017

Journal: :CoRR 2017

Esraa Ali Annalina Caputo Séamus Lawless

In this paper we describe our solution to the WSDM Cup 2017 Triple Scoring task. Our approach generates a relevance score based on the textual description of the triple’s subject and value (Object). It measures how similar (related) the text description of the subject is to the text description of its values. The generated similarity score can then be used to rank the multiple values associated...

متن کامل