Tagging a Corpus of Spoken Swedish
نویسنده
چکیده
In this article, we present and evaluate a method for training a statistical partof-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from spoken language is still very limited for most languages. The overall accuracy of the tagger developed for spoken Swedish is quite respectable, varying from 95% to 97% depending on the tagset used. In conclusion, we argue that the method presented here gives good tagging accuracy with relatively little effort.
منابع مشابه
Tagging spoken corpus
Spoken languages are more flexible in usage than written languages. Thus, tagging spoken corpus differs from tagging traditional written corpus. This paper proposes a new framework for tagging spoken corpus. The framework adopts the written tagger to process spoken data with the special consideration of the characteristics of spoken language. Besides, the problems of different tagging sets betw...
متن کاملThe Swedish NICE Corpus – Spoken and embodied characters in a c
This article describes the collection and analysis of a Swedish database of spontaneous and unconstrained children–machine dialogues. The Swedish NICE corpus consists of spoken dialogues between children aged 8 to 15 and embodied fairytale characters in a computer game scenario. Compared to previously collected corpora of children’s computer-directed speech, the Swedish NICE corpus contains ext...
متن کاملA semantic tagging tool for spoken dialogue corpus
In this paper, we report our semantic tagging tool for spoken dialogue corpus. This tagging tool can acquire analysis rules using Transformation-based Learning (TBL) from small scale training corpus. It can learn dialogue act tagging rules and semantic frame tagging rules. The precisions are 72% in dialogue act tagging and 58% of semantic frame tagging in open test.
متن کاملThe Swedish NICE corpus - spoken dialogues between children and embodied characters in a computer game scenario
This article describes the collection and analysis of a Swedish database of spontaneous and unconstrained children–machine dialogues. The Swedish NICE corpus consists of spoken dialogues between children aged 8 to 15 and embodied fairytale characters in a computer game scenario. Compared to previously collected corpora of children’s computer-directed speech, the Swedish NICE corpus contains ext...
متن کاملTagging Spoken Language Using Written Language Statistics
This paper reports on two experiments with a probabilistic part-of-speech tagger, trained on a tagged corpus of written Swedish, being used to tag a corpus of (transcribed) spoken Swedish. The results indicate that with very little adaptations an accuracy rate of 85% can be achieved, with an accuracy rate for known words of 90%. In addition, two different t reatments of pauses were explored but...
متن کامل