Developing an SDR Test Collection from Japanese Lecture Audio Data
نویسندگان
چکیده
The lecture is one of the most valuable genres of audiovisual data. However, spoken lectures are difficult to reuse because browsing and efficient searching within spoken lectures is difficult. To promote the research activities in the spoken lecture retrieval, this paper reports a test collection for its evaluation. The test collection consists of the target spoken documents of about 2,700 lectures (604 hours) taken from the Corpus of Spontaneous Japanese (CSJ), 39 retrieval queries, the relevant passages in the target documents for each query, and the automatic transcription of the target speech data. We report the retrieval performance targeting the constructed test collection by applying a standard spoken document retrieval (SDR) method, which serves as a baseline for the forthcoming SDR studies using the test collection. We also introduce the several studies conducted by the users of the test collection.
منابع مشابه
Construction of a Test Collection for Spoken Document Retrieval from Lecture Audio Data
The lecture is one of the most valuable genres of audiovisual data. Though spoken document processing is a promising technology for utilizing the lecture in various ways, it is difficult to evaluate because the evaluation require a subjective judgment and/or the verification of large quantities of evaluation data. In this paper, a test collection for the evaluation of spoken lecture retrieval i...
متن کاملTest Collections for Spoken Document Retrieval from Lecture Audio Data
The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recog...
متن کاملConstructing Japanese test collections for spoken term detection
Spoken Document Retrieval (SDR) and Spoken Term Detection (STD) have been two of the most intensively investigated topics in spoken document processing research according to the establishment of the SDR and STD test collections by the Text REtrieval Conference (TREC) and NIST. Because Japanese spoken document processing researchers also requires such test collections for SDR and STD, we have es...
متن کاملThe TREC Spoken Document Retrieval Track: A Success Story
This paper describes work within the NIST Text REtrieval Conference (TREC) over the last three years in designing and implementing evaluations of Spoken Document Retrieval (SDR) technology within a broadcast news domain. SDR involves the search and retrieval of excerpts from spoken audio recordings using a combination of automatic speech recognition and information retrieval technologies. The T...
متن کاملUnsupervised topic adaptation for lecture speech retrieval
We are developing a cross-media information retrieval system, in which users can view specific segments of lecture videos by submitting text queries. To produce a text index, the audio track is extracted from a lecture video and a transcription is generated by automatic speech recognition. In this paper, to improve the quality of our retrieval system, we extensively investigate the effects of a...
متن کامل