From Spelling Correction to Text Cleaning - Using Context Information

نویسندگان

  • Martin Schierle
  • Sascha Schulz
  • Markus Ackermann
چکیده

Spelling correction is the task of correcting words in texts. Most of the available spelling correction tools only work on isolated words and compute a list of spelling suggestions ranked by edit-distance, letter-n-gram similarity or comparable measures. Although the probability of the best ranked suggestion being correct in the current context is high, user intervention is usually necessary to choose the most appropriate suggestion (Kukich, 1992). Based on preliminary work by Sabsch (2006), we developed an efficient context sensitive spelling correction system dcClean by combining two approaches: the edit distance based ranking of an open source spelling corrector and neighbour cooccurrence statistics computed from a domain specific corpus. In combination with domain specific replacement and abbreviation lists we are able to significantly improve the correction precision compared to edit distance or context based spelling correctors applied on their own.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrated Scoring For Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text

An increasing number of language and speech applications are gearing towards the use of texts from online sources as input. Despite such rise, not much work can be found in the aspect of integrated approaches for cleaning dirty texts from online sources. This paper presents a mechanism of Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). The ...

متن کامل

Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer ...

متن کامل

Enhanced Integrated Scoring for Cleaning Dirty Texts

An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansio...

متن کامل

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors.  Also developing Persian tools will provide Persian progr...

متن کامل

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. To tune the parameters o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007