Using Semantic Clustering for Detecting Bengali Multiword Expressions
نویسنده
چکیده
Multiword Expressions (MWEs), a known nuisance for both linguistics and NLP, blur the lines between syntax and semantics. The semantic of a MWE cannot be expressed after combining the semantic of its constituents. In this study, we propose a novel approach called “semantic clustering” as an instrument for extracting the MWEs especially for resource constraint languages like Bengali. At the beginning, it tries to locate clusters of the synonymous noun tokens present in the document. These clusters in turn help measure the similarity between the constituent words of a potential candidate using a vector space model. Finally the judgment for the suitability of this phrase to be a MWE is carried out based on a predefined threshold. In this experiment, we apply the semantic clustering approach only for noun-noun bigram MWEs; however we believe that it can be extended to any types of MWEs. We compare our approach with the state-ofthe-art statistical approach. The evaluation results show that the semantic clustering outperforms all other competing methods. As a byproduct of this experiment, we have started developing a standard lexicon in Bengali that serves as a productive Bengali linguistic thesaurus.
منابع مشابه
Identifying Bengali Multiword Expressions using Semantic Clustering
One of the key issues in both natural language understanding and generation is the appropriate processing of Multiword Expressions (MWEs). MWEs pose a huge problem to the precise language processing due to their idiosyncratic nature and diversity in lexical, syntactical and semantic properties. The semantic of a MWE cannot be expressed after combining the semantic of its constituents. Therefore...
متن کاملSemantic Clustering: an Attempt to Identify Multiword Expressions in Bengali
One of the key issues in both natural language understanding and generation is the appropriate processing of Multiword Expressions (MWEs). MWE can be defined as a semantic issue of a phrase where the meaning of the phrase may not be obtained from its constituents in a straightforward manner. This paper presents an approach of identifying bigram noun-noun MWEs from a medium-size Bengali corpus b...
متن کاملA Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions
This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE ...
متن کاملBuilding an Arabic Multiword Expressions RepositoryBuilding an Arabic Multiword Expressions RepositoryBuilding an Arabic Multiword Expressions RepositoryBuilding an Arabic Multiword Expressions RepositoryBulding an Arabic Multiword Expressions Repository
We introduce a list of Arabic multiword expressions (MWE) collected from various dictionaries. The MWEs are grouped based on their syntactic type. Every constituent word in the expressions is manually annotated with its full context-sensitive morphological analysis. Some of the expressions contain semantic variables as place holders for words that play the same semantic role. In addition, we ha...
متن کاملIdentification of Reduplication in Bengali Corpus and their Semantic Analysis: A Rule Based Approach
In linguistic studies, reduplication generally means the repetition of any linguistic unit such as a phoneme, morpheme, word, phrase, clause or the utterance as a whole. The identification of reduplication is a part of general task of identification of multiword expressions (MWE). In the present work, reduplications have been identified from the Bengali corpus of the articles of Rabindranath Ta...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Informatica (Slovenia)
دوره 38 شماره
صفحات -
تاریخ انتشار 2014