Extraction of Folksonomies from Noisy Texts
نویسندگان
چکیده
We built a system for the automatic creation of a text-based topic hierarchy, meant to be used in a geographically defined community. This poses two main problems. First, the appearance of both standard language and a community-related dialect, demanding that dialect words should be as much as possible corrected to standard words, and second, the automatic hierarchic clustering of texts by their topic. The problem of correcting dialect words is dealt with by performing a nearest neighbor search over a dynamic set of known words, using a set of transition rules from dialect to standard words, which are learned from a pair-wise lexicon. We tackle the clustering problem by implementing a hierarchical co-clustering algorithm that automatically generates a topic hierarchy of the collection and simultaneously groups documents and words into clusters.
منابع مشابه
Folksonomies versus Automatic Keyword Extraction: an Empirical Study
Semantic Metadata, which describes the meaning of documents, can be produced either manually or else semi-automatically using information extraction techniques. Manual techniques are expensive if they rely on skilled cataloguers, but a possible alternative is to make use of community produced annotations such as those collected in folksonomies. This paper reports on an experiment that we carrie...
متن کاملEnabling Folksonomies for Knowledge Extraction: A Semantic Grounding Approach
Folksonomies emerge as the result of the free tagging activity of a large number of users over a variety of resources. They can be considered as valuable sources from which it is possible to obtain emergingvocabularies that can be leveraged in knowledge extraction tasks. However, when it comes to understanding the meaning of tags in folksonomies, several problems mainly related to the appearanc...
متن کاملTowards Ontological Structures Extraction from Folksonomies: An Efficient Fuzzy Clustering Approach
Folksonomies are one of the technologies of Web 2.0 that permit users to annotate resources on the Web. In this paper, the authors propose an integrated approach to extract ontological structures from unstructured and semi-structured resources. Our proposal overcomes limitations of existing approaches. It gives a formal, simple, and efficient solution to the tag clustering and disambiguation pr...
متن کاملExploring the Value of Folksonomies for Creating Semantic Metadata
Finding good keywords to describe resources is an on-going problem. Typically, we select such words manually from a thesaurus of terms, or they are created using automatic keyword extraction techniques. Folksonomies are an increasingly well-populated source of unstructured tags describing Web resources. This article explores the value of the folksonomy tags as a potential source of keyword meta...
متن کاملDefinition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.
متن کامل