Linguistically Fuelled Text Similarity
نویسندگان
چکیده
This paper describes TEXTSIM, a system for determining the similarity between texts. Further, we show the results of a comparison between two various configurations of TEXTSIM; one with and one without any deeper linguistic analysis. To evaluate and compare the two models of TEXTSIM we used two sets of examples: a set of automatically generated examples and a set of examples acquired from two assessors. Depending on the type of documents, we found the model using linguistic analysis to perform equally well or better than the model not using linguistic analysis.
منابع مشابه
How Noisy Social Media Text, How Diffrnt Social Media Sources?
While various claims have been made about text in social media text being noisy, there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia, whi...
متن کاملBaldwin, Timothy, Paul Cook, Marco Lui, Andrew MacKinlay and Li Wang (to appear) How Noisy Social Media Text, How Diffrnt Social Media Sources?, In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), Nagoya, Japan
While various claims have been made about text in social media text being noisy, but there has never been a systematic study to investigate just how linguistically noisy or otherwise it is over a range of social media sources. We explore this question empirically over popular social media text types, in the form of YouTube comments, Twitter posts, web user forum posts, blog posts and Wikipedia,...
متن کاملEffects of Creativity and Cluster Tightness on Short Text Clustering Performance
Properties of corpora, such as the diversity of vocabulary and how tightly related texts cluster together, impact the best way to cluster short texts. We examine several such properties in a variety of corpora and track their effects on various combinations of similarity metrics and clustering algorithms. We show that semantic similarity metrics outperform traditional n-gram and dependency simi...
متن کاملLinguistically Optimized Text Entry on a Mobile Phone
We present an analysis of linguistically optimized text entry on mobile phones. This analysis compares the behavior of a linguistically optimized system with wordbased dis ambiguation methods. Through theoretical analysis, it is shown that in real-world situations in which typing errors are common and dictionaries are incomplete, the speed of text -entry using a word-guessing method degrades to...
متن کاملBridging the Gap between Domain-Oriented and Linguistically-Oriented Semantics
This paper compares domain-oriented and linguistically-oriented semantics, based on the GENIA event corpus and FrameNet. While the domain-oriented semantic structures are direct targets of Text Mining (TM), their extraction from text is not straghtforward due to the diversity of linguistic expressions. The extraction of linguistically-oriented semactics is more straghtforward, and has been stud...
متن کامل