HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus

نویسندگان

Yi Liu

Pascale Fung

Yongsheng Yang

Christopher Cieri

Shudong Huang

David Graff

چکیده

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other applicationdependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes b...

متن کامل

MAT - A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan

A cooperative project, called Polyphone, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the developm...

متن کامل

A Mandarin-English Code-Switching Corpus

Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of codeswitching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four pa...

متن کامل

Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus

Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However...

متن کامل

2000 Nist Evaluation of Conversational Speech Recognition over the Telephone: English and Mandarin Performance Results

This paper documents the use of conversational telephone speech test materials in the NIST coordinated evaluation conducted early in 2000. The primary evaluation was of General American English speech, but a subsidiary evaluation of Mandarin speech was also offered. The primary test data consisted of twenty conversations collected for the original Switchboard Corpus but not released with the pu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus

نویسندگان

چکیده

منابع مشابه

A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program

MAT - A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan

A Mandarin-English Code-Switching Corpus

Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus

2000 Nist Evaluation of Conversational Speech Recognition over the Telephone: English and Mandarin Performance Results

عنوان ژورنال:

اشتراک گذاری