Segmenting Chinese Microtext: Joint Informal-Word Detection and Segmentation with Neural Networks
نویسندگان
چکیده
State-of-the-art Chinese word segmentation systems typically exploit supervised models trained on a standard manually-annotated corpus, achieving performances over 95% on a similar standard testing corpus. However, the performances may drop significantly when the same models are applied onto Chinese microtext. One major challenge is the issue of informal words in the microtext. Previous studies show that informal word detection can be helpful for microtext processing. In this work, we investigate it under the neural setting, by proposing a joint segmentation model that integrates the detection of informal words simultaneously. In addition, we generate training corpus for the joint model by using existing corpus automatically. Experimental results show that the proposed model is highly effective for segmentation of Chinese microtext.
منابع مشابه
Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems that conduct the t...
متن کاملFeature-based Neural Language Model and Chinese Word Segmentation
In this paper we introduce a feature-based neural language model, which is trained to estimate the probability of an element given its previous context features. In this way our feature-based language model can learn representation for more sophisticated features. We introduced the deep neural architecture into the Chinese Word Segmentation task. We got a significant improvement on segmenting p...
متن کاملChinese Informal Word Normalization: an Experimental Study
We study the linguistic phenomenon of informal words in the domain of Chinese microtext and present a novel method for normalizing Chinese informal words to their formal equivalents. We formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. Our two-stage selec...
متن کاملA Gap-Based Framework for Chinese Word Segmentation via Very Deep Convolutional Networks
Most previous approaches to Chinese word segmentation can be roughly classified into character-based and word-based methods. The former regards this task as a sequence-labeling problem, while the latter directly segments character sequence into words. However, if we consider segmenting a given sentence, the most intuitive idea is to predict whether to segment for each gap between two consecutiv...
متن کاملAn Automated MR Image Segmentation System Using Multi-layer Perceptron Neural Network
Background: Brain tissue segmentation for delineation of 3D anatomical structures from magnetic resonance (MR) images can be used for neuro-degenerative disorders, characterizing morphological differences between subjects based on volumetric analysis of gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF), but only if the obtained segmentation results are correct. Due to image arti...
متن کامل