Detecting Semantic Shifts in Slovene Twitterese
نویسندگان
چکیده
This paper presents first results of automatic semantic shift detection in Slovene tweets. We use word embeddings to compare the semantic behaviour of common words frequently occurring in a reference corpus of Slovene with their behaviour on Twitter. Words with the highest model distance between the corpora are considered as semantic shift candidates. They are manually analysed and classified in order to evaluate the proposed approach as well as to gain a better qualitative understanding of the nature of the problem. Apart from the noise due to preprocessing errors (45%), the approach yields a lot of valuable candidates, especially the novel senses occurring due to daily events and the ones produced in informal communication settings.
منابع مشابه
The JOS Linguistically Tagged Corpus of Slovene
The JOS language resources are meant to facilitate developments of HLT and corpus linguistics for the Slovene language and consist of the morphosyntactic specifications, defining the Slovene morphosyntactic features and tagset; two annotated corpora (jos100k and jos1M); and two web services (a concordancer and text annotation tool). The paper introduces these components, and concentrates on jos...
متن کاملLearning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources
The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions among the extracted candidates. The results of t...
متن کامل‘Knowing How’ in Slovene: Treading the Other Path*
For the linguistic expression of the concept of knowledge, the Slavic languages use verbs deriving from the Indo-European roots *ĝnō and *u̯ ei̯d. They differ in terms of the availability of both types of verbs in the contemporary standard languages and in terms of their semantic range. As will be shown in this paper, these differences are interesting not only from a language-specific lexico lo g...
متن کاملA Multilingual Approach to Building Slovene Wordnet
The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translational relation on the one hand reduces ambiguity of a source word and on the other conveys semantic...
متن کاملMorphosyntactic Tagging of Slovene Using Progol
We consider the task of tagging Slovene words with morphosyntactic descriptions (MSDs). MSDs contain not only part-of-speech information but also attributes such as gender and case. In the case of Slovene there are 2,083 possible MSDs. P-Progol was used to learn morphosyntactic disambiguation rules from annotated data (consisting of 161,314 examples) produced by the MULTEXT-East project. P-Prog...
متن کامل