Statistical / Rule - based Hybrid Phrase Break

نویسنده

  • Byeongchang Kim
چکیده

In this paper, we present a new phrase break detection architecture that integrates proba-bilistic approach with rule-based error correction. The architecture consists of a probabilis-tic phrase break detector and a transformational rule-based post error corrector. The probabilistic method alone usually suuers from performance degradation due to inherent data sparseness problems. So we adopted transformational rule-based error correction to overcome these training data limitations. The probabilistic phrase break detector segments the POS sequences into several phrases according to word trigram probabilities. The probabilis-tic phrase break detection only covers a limited range of contextual information. Moreover, the module does not see the morpheme tag selectively and relative distance to the other phrase breaks. The initially phrase break tagged morpheme sequence is corrected with error correcting rules. The rules are learned by comparing the correctly tagged phrase break corpus with the output of the probabilistic phrase break detector. The rule-based post error correction provided more accurate results even with the phrase break detector that has poor performance. Our main contributions include presenting transformational rule-based error correction for phrase break detection. Also, probabilistic phrase break detection was implemented as an initial annotator of the transformational rule-based error correction. The architecture can provide accurate results even with the phrase break detector that has poor initial performance. Moreover, the system can be exibly tuned to new corpus without massive retraining .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Universitat d'Alacant hybrid machine translation system for WMT 2011

This paper describes the machine translation (MT) system developed by the Transducens Research Group, from Universitat d’Alacant, Spain, for the WMT 2011 shared translation task. We submitted a hybrid system for the Spanish–English language pair consisting of a phrase-based statistical MT system whose phrase table was enriched with bilingual phrase pairs matching transfer rules and dictionary e...

متن کامل

TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models

This paper presents a hybrid method for extracting Chinese noun phrase collocations that combines a statistical model with rule-based linguistic knowledge. The algorithm first extracts all the noun phrase collocations from a shallow parsed corpus by using syntactic knowledge in the form of phrase rules. It then removes pseudo collocations by using a set of statistic-based association measures (...

متن کامل

The UA-Prompsit hybrid machine translation system for the 2014 Workshop on Statistical Machine Translation

This paper describes the system jointly developed by members of the Departament de Llenguatges i Sistemes Informàtics at Universitat d’Alacant and the Prompsit Language Engineering company for the shared translation task of the 2014 Workshop on Statistical Machine Translation. We present a phrase-based statistical machine translation system whose phrase table is enriched with information obtain...

متن کامل

A Hybrid Approach Using Phrases and Rules for Hindi to English Machine Translation

The present work focuses on developing a hybrid approach for developing a machine translation (MT) scheme for automatic translation of Hindi sentences to English. Development of machine translation (MT) systems for Indian languages to English almost invariably suffers from the limited availability of linguistic resources. As a consequence, statistical, rule-based or example-based approaches hav...

متن کامل

Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation

We describe a hybridisation strategy whose objective is to integrate linguistic resources from shallow-transfer rule-based machine translation (RBMT) into phrase-based statistical machine translation (PBSMT). It basically consists of enriching the phrase table of a PBSMT system with bilingual phrase pairs matching transfer rules and dictionary entries from a shallow-transfer RBMT system. This n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999