Data Mining Meets Collocations Discovery
نویسندگان
چکیده
In this paper we discuss the problem of discovering interesting word sequences in the light of two traditions: sequential pattern mining (from data mining) and collocations discovery (from computational linguistics). Smadja (1993) defines a collocation as “a recurrent combination of words that cooccur more often than chance and that correspond to arbitrary word usages.” The notion of arbitrariness underlines the fact that if one word of a collocation is substituted by a synonym, the resulting phrase may become peculiar or even incorrect. For instance, “strong tea” cannot be replaced with “powerful tea”. Acquisition of collocations, a.k.a multi-word units, are crucial for many fields, like lexicography, machine translation, foreign language learning, and information retrieval. We attempt to describe the collocations discovery problem as a general problem of discovering interesting sequences in text. Moreover, we give a survey of common approaches from both collocations discovery and data mining and propose new avenues for fruitful combination of these approaches.
منابع مشابه
News Article Classification Based on a Vector Representation Including Words’ Collocations
In this paper we present a proposal including collocations into the pre-processing of the text mining, which we use for the fast news article recommendation and experiments based on real data from the biggest Slovak newspaper. The news article section can be predicted based on several article’s characteristics as article name, content, keywords etc. We provided experiments aimed at comparison o...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملAn Integrated Data Mining System to Automate Discovery of Measures of Association
Many data analysts require tools which can integrate their database management packages (e.g. Microsoft Access) with their data analysis ones (e.g. SAS, SPSS), and provide guidance for the selection of appropriate mining algorithms. In addition, the analysts need to extract and validate statistical results to facilitate data mining. In this paper, we describe an integrated data mining system ca...
متن کاملMind the Time: Unleashing the Temporal Aspects in Pattern Discovery
Temporal Data Mining is a core concept of Knowledge Discovery in Databases handling time-oriented data. Stateof-the-art methods are capable of preserving the temporal order of events as well as the information in between. The temporal nature of the events themselves, however, can likely be misinterpreted by current algorithms. We present a new definition of the temporal aspects of events and ex...
متن کاملExpert Discovery: A web mining approach
Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...
متن کامل