Automatic conversion of colloquial Finnishto standard Finnish
نویسندگان
چکیده
This paper presents a rule-based method for converting between colloquial Finnish and standard Finnish. The method relies upon a small number of orthographical rules combined with a large language model of standard Finnish for ranking the possible conversions. Aside from this contribution, the paper also presents an evaluation corpus consisting of aligned sentences in colloquial Finnish, orthographically-standardised colloquial Finnish and standard Finnish. The method we present outperforms the baseline of simply treating colloquial Finnish as standard Finnish, but is outperformed by a phrase-based MT system trained by the evaluation corpus. The paper also presents preliminary results which show promise for using normalisation in the machine translation task.
منابع مشابه
Unsupervised Word Categorization Using Self-Organizing Maps and Automatically Extracted Morphs
Automatic creation of syntactic and semantic word categorizations is a challenging problem for highly inflecting languages due to excessive data sparsity. Moreover, the study of colloquial language resources requires the utilization of fully corpus-based tools. We present a completely automated approach for producing word categorizations for morphologically rich languages. Self-Organizing Map (...
متن کاملStudies on Training Text Selection for Conversational Finnish Language Modeling
Current ASR and MT systems do not operate on conversational Finnish, because training data for colloquial Finnish has not been available. Although speech recognition performance on literary Finnish is already quite good, those systems have very poor baseline performance in conversational speech. Text data for relevant vocabulary and language models can be collected from the Internet, but web da...
متن کاملCRF-based Diacritisation of Colloquial Arabic for Automatic Speech Recognition
Most of the available resources of colloquial Arabic speech are transcribed without diacritics. Those diacritics provide short vowels and other pronunciation information and by omitting them a considerable amount of ambiguity is introduced. In this paper, we propose the use of an automatic diacritisation method as front-end for training of automatic speech recognition systems of colloquial Arab...
متن کاملTone choice in the English intonation of proficient non-native speakers
An experiment is reported in which twelve Finnish test subjects, first-year university students of English, acted a pre-written conversational dialogue representing colloquial English. To obtain baseline data, twelve native speakers of English were recruited to act the same dialogue. The speech data was investigated acoustically in terms of f0. Most of the Finnish test subjects could make, both...
متن کاملAnalysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015