Several measures for selecting suitable speech CORPORA

نویسندگان

  • Shuichi Itahashi
  • Naoko Ueda
  • Mikio Yamamoto
چکیده

Wemake statistical investigations of various speech corpora to extract useful information re ecting the contents of the corpus so that we can create a sort of guidelines for selecting the most suitable corpus. A word is not separated by spaces in the Japanese text. Accordingly, we adopt n-gram counting methods to extract frequent mora sequences instead of words. A mora roughly corresponds to a syllable. By investigating the frequencies of 1 to 10-mora sequences in the existing six corpora, we can nd the distinction between the written and the spoken languages, keywords and topics of dialogues. This paper shows that the simple statistical investigation makes it possible to represent the contents of the corpus to some extent without conducting a complicated job such as morphological analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CorpVis: An Online Emotional Speech Corpora Visualisation Interface

Our research in emotional speech analysis has led to the construction of several dedicated high quality, online corpora of natural emotional speech assets. The requirements for querying, retrieval and organization of assets based on both their metadata descriptors and their analysis data led to the construction of a suitable interface for data visualization and corpus management. The CorpVis in...

متن کامل

The EASR Corpora of European Portuguese, French, Hungarian and Polish Elderly Speech

Currently available speech recognisers do not usually work well with elderly speech. This is because several characteristics of speech (e.g. fundamental frequency, jitter, shimmer and harmonic noise ratio) change with age and because the acoustic models used by speech recognisers are typically trained with speech collected from younger adults only. To develop speech-driven applications capable ...

متن کامل

An analytical approach to similarity measure selection for self-training

We present a framework for investigating properties of similarity measures as a criterion for selecting the best-suited measure for a specific task, in this paper: corpus selection for self-training. We focus on the squared Pearson’s correlation coefficient as the property to rank similarity measures. Selftraining is an unsupervised domain adaptation technique, in which three corpora are involv...

متن کامل

Automatic generation of phonetic transcriptions for large speech corpora

We describe a method for the automatic production of phonetic transcriptions in large speech corpora. First, we focus on the application of different techniques for the generation of pronunciation variants. Then, we explain the application of a speech recognition system for selecting the acoustically best matching phonetic transcription. The system is evaluated on different test sets selected f...

متن کامل

Measuring the homogeneity and similarity of language corpora

Corpus-based methods are now dominant in Natural Language Processing (NLP) . Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need metho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997