Statistical denormalization for Arabic text

نویسندگان

  • Mohammed Moussa
  • Mohammed Fakhr
  • Kareem Darwish
چکیده

In this paper, we focus on a sub-problem of Arabic text error correction, namely Arabic Text Denormalization. Text Denormalization is considered an important post-processing step when performing machine translation into Arabic. We examine different approaches for denormalization via the use of language modeling, stemming, and sequence labeling. We show the effectiveness of different approaches and how they can be combined to attain better results. We perform intrinsic evaluation as well as extrinsic evaluation in the context of machine translation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Techniques for Arabic Morphological Detokenization and Orthographic Denormalization

The common wisdom in the field of Natural Language Processing (NLP) is that orthographic normalization and morphological tokenization help in many NLP applications for morphologically rich languages like Arabic. However, when Arabic is the target output, it should be properly detokenized and orthographically correct. We examine a set of six detokenization techniques over various tokenization sc...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Using Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media

Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...

متن کامل

A Fault-Tolerant Three-Way Merge for XML and HTML

Three-way merging is a technique that is used to reintegrate changes to a document when multiple independently modified copies have been made. Tools for three-way merge of ASCII text files exist in the form of the ubiquitous diff and patch tools, but these are of limited applicability when parts of the documents have been rearranged. Our fault-tolerant three-way merge for XML and HTML was desig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012