DKPro-UGD: A Flexible Data-Cleansing Approach to Processing User-Generated Discourse
نویسندگان
چکیده
User-generated discourse from Web 2.0 poses particular challenges to natural language processing (NLP) due to its noise and error proneness. A data cleansing step preceding the analysis steps in an NLP pipeline can reduce the problems. While recent efforts provide generalpurpose collections of UIMA-based analysis components, data cleansing seems not yet to be covered. The five-stage data cleansing approach proposed here offers a maximum of flexibility in identifying problematic artifacts, deciding how to deal with them and analysing cleansed data. Simultaneously, it allowed us to create reusable UIMA-based components for the actual data cleansing and for mapping annotations created on the clean data back to the original representation. These components are released as part of the Darmstadt Knowledge Processing Software Repository (DKPro) under the name of DKPro-UGD.
منابع مشابه
DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data
We present DKPro TC, a framework for supervised learning experiments on textual data. The main goal of DKPro TC is to enable researchers to focus on the actual research task behind the learning problem and let the framework handle the rest. It enables rapid prototyping of experiments by relying on an easy-to-use workflow engine and standardized document preprocessing based on the Apache Unstruc...
متن کاملDKPro Keyphrases: Flexible and Reusable Keyphrase Extraction Experiments
DKPro Keyphrases is a keyphrase extraction framework based on UIMA. It offers a wide range of state-of-the-art keyphrase experiments approaches. At the same time, it is a workbench for developing new extraction approaches and evaluating their impact. DKPro Keyphrases is publicly available under an open-source license.1
متن کاملExploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse
Analyzing arguments in user-generated Web discourse has recently gained attention in argumentation mining, an evolving field of NLP. Current approaches, which employ fully-supervised machine learning, are usually domain dependent and suffer from the lack of large and diverse annotated corpora. However, annotating arguments in discourse is costly, errorprone, and highly context-dependent. We ask...
متن کاملFast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملA Novel Multi-user Detection Approach on Fluctuations of Autocorrelation Estimators in Non-Cooperative Communication
Recently, blind multi-user detection has become an important topic in code division multiple access (CDMA) systems. Direct-Sequence Spread Spectrum (DSSS) signals are well-known due to their low probability of detection, and secure communication. In this article, the problem of blind multi-user detection is studied in variable processing gain direct-sequence code division multiple access (VPG D...
متن کامل