Beyond Normalization: Pragmatics of Word Form in Text Messages

نویسندگان

  • Tyler Baldwin
  • Joyce Yue Chai
چکیده

Non-standard spellings in text messages often convey extra pragmatic information not found in the standard word form. However, text message normalization systems that transform non-standard text message spellings to standard form tend to ignore this information. To address this problem, this paper examines the types of extra pragmatic information that are conveyed by non-standard word forms. Empirical analysis of our data shows that 40% of non-standard word forms contain emotional information not found in the standard form, and 38% contain additional emphasis. This extra information can be important to downstream applications such as text-to-speech synthesis. We further investigated the automatic detection of non-standard forms that display additional information. Our empirical results show that character level features can provide important cues for such detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalizing Microtext

The use of computer mediated communication has resulted in a new form of written text—Microtext—which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for ...

متن کامل

Linguistic Features of English Textese and Digitalk of Iranian EFL Students

This study aimed at investigating the English textese of Iranian EFL learners by scrutinizing the linguistic features through a qualitative design. In doing so, 700 messages were collected from 43 MA Iranian EFL learners of both genders. The features were categorized and analyzed calculating the frequency and percentage. The findings of the study showed that Iranian EFL students used different ...

متن کامل

Normalization of Dutch User-Generated Content

This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly develop...

متن کامل

Morphological Disambiguation and Text Normalization for Southern Quechua Varieties

We built a pipeline to normalize Quechua texts through morphological analysis and disambiguation. Word forms are analyzed by a set of cascaded finite state transducers which split the words and rewrite the morphemes to a normalized form. However, some of these morphemes, or rather morpheme combinations, are ambiguous, which may affect the normalization. For this reason, we disambiguate the morp...

متن کامل

Influence of Word Normalization on Text Classification

In this paper we focus our attention on the comparison of various lemmatization and stemming algorithms, which are often used in nature language processing (NLP). Sometimes these two techniques are considered to be identical, but there is an important difference. Lemmatization is generally more utilizable, because it produces the basic word form which is required in many application areas (i.e....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011