Global Evaluation of Random Indexing through Swedish Word Clustering Compared to the People's Dictionary of Synonyms
نویسندگان
چکیده
Evaluation of word space models is usually local in the sense that it only considers words that are deemed very similar by the model. We propose a global evaluation scheme based on clustering of the words. A clustering of high quality in an external evaluation against a semantic resource, such as a dictionary of synonyms, indicates a word space model of high quality. We use Random Indexing to create several different models and compare them by clustering evaluation against the People’s Dictionary of Synonyms, a list of Swedish synonyms that are graded by the public. Most notably we get better results for models based on syntagmatic information (words that appear together) than for models based on paradigmatic information (words that appear in similar contexts). This is quite contrary to previous results that have been presented for local evaluation. Clusterings to ten clusters result in a recall of 83% for a syntagmatic model, compared to 34% for a comparable paradigmatic model, and 10% for a random partition.
منابع مشابه
Synonym Dictionary Improvement through Markov Clustering and Clustering Stability
Abstract. The aim of the work presented here is to clean up a dictionary of synonyms which appeared to be ambiguous, incomplete and inconsistent. The key idea is to use Markov Clustering and Clustering Stability techniques on the network that represents the synonymy relation contained in the dictionary. Each densely connected cluster is considered to correspond to a specific concept, and ambigu...
متن کاملDistributional Semantics Approach to Detecting Synonyms in Croatian Language
Identifying synonyms is important for many natural language processing and information retrieval applications. In this paper we address the task of automatically identifying synonyms in Croatian language using distributional semantic models (DSM). We build several DSMs using latent semantic analysis (LSA) and random indexing (RI) on the large hrWaC corpus. We evaluate the models on a dictionary...
متن کاملText Clustering Exploration Swedish Text Representation and Clustering Results Unraveled
Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clust...
متن کاملExtracting Clinical Findings from Swedish Health Record Text
Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Sw...
متن کاملAutomatic Induction of Synsets from a Graph of Synonyms
This paper presents a new graph-based approach that induces synsets using synonymy dictionaries and word embeddings. First, we build a weighted graph of synonyms extracted from commonly available resources, such as Wiktionary. Second, we apply word sense induction to deal with ambiguous words. Finally, we cluster the disambiguated version of the ambiguous input graph into synsets. Our meta-clus...
متن کامل