HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus
نویسندگان
چکیده
The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other applicationdependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.
منابع مشابه
A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program
In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes b...
متن کاملMAT - A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan
A cooperative project, called Polyphone, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the developm...
متن کاملA Mandarin-English Code-Switching Corpus
Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of codeswitching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four pa...
متن کاملStatistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus
Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However...
متن کامل2000 Nist Evaluation of Conversational Speech Recognition over the Telephone: English and Mandarin Performance Results
This paper documents the use of conversational telephone speech test materials in the NIST coordinated evaluation conducted early in 2000. The primary evaluation was of General American English speech, but a subsidiary evaluation of Mandarin speech was also offered. The primary test data consisted of twenty conversations collected for the original Switchboard Corpus but not released with the pu...
متن کامل