Information Explication from Computer-Transcribed Conversations: The Weighted Markov Classifier
نویسنده
چکیده
In this research, I attempt to extract information from a computer-transcribed conversation. Leveraging the SDK of a sophisticated, commercial speech recognition system, I have created a classifier that weights words based on the output confidence factor from the recognition engine, and maximizes the probability of the word frequency using both 0th and first order Markov models. I demonstrate the success of this algorithm when applied to location classification, even when used on different speakers. While location certainty cannot be achieved from this data alone, given enough labeled training sets, a type of 'situational awareness' can be developed. Applications for such awareness could potentially include topic spotting, conversation mediation, and virtual memory augmentation. Finally I will discuss other potential impacts of this technology within the realm of ubiquitous computing. Background The accuracy of commercial voice recognition systems has reached a point where almost half of all words spoken by a single individual can be correctly recognized during most conversations. Extracting information from gigabytes of transcribed conversation data is a job for an appropriate machine learning algorithm. However, data analysis on a sequence appears to be more difficult when the sequence can be drawn from an unbounded set. In Problem Set 4 we looked at character frequency and took data from 27 possible values. Words, however, are in principle unbounded because there is always a non-zero probability of finding a word never previously encountered. Despite this fact, this project is able to take advantage of limitations in the vocabulary both of the speaker and of the speech recognition engine to condense a working vocabulary into a set of approximately 5000 words. Data Gathering I collected data for one month, between November and December 2001. Both at home and at lab, I installed a wireless network to support streaming audio directly to my computer. I wore a wireless microphone during most of the day and, after informing people of the purpose of the apparatus, began to record many of my daily conversations. I captured data in two ways. For the first fifteen days, the audio data was sent directly into a recognition engine and 1 Pereira,F.; Singer, Y.; Tishby, N. Beyond Word N-Grams was transcribed in real-time. During the latter portion of the data gathering process, I captured my conversations as audio files for later processing. Saving the raw audio data enabled me to gather more information about the data such as word confidence factor and time stamp. This richer set of data was later used as my test set. After each conversation I saved the audio file with a "label name" including place, people involved, and subject matter (ie: lab.vish.vikram.mla.wav or home.liz.dinner.wav). During that month I recorded over 30 hours of my daily conversations and transcribed approximately 30,000 words. Enabling Software and Hardware and Additional Data TCL interface to the ViaVoice Recognition Engine Leveraging the sophisticated algorithms behind a commercial speech recognition system, a TCL script was written to interface with the ViaVoice SDK. Recorded conversations in a 22kHz, 16 bit .wav file format were placed in a select directory where they were input into the script. The script's output corresponds to the transcribed words from the ViaVoice recognition engine. Along with each word is a time stamp marking the beginning of the utterance and a recognition confidence factor ranging from 100 to 100. The three data types in each output data file (transcribed word, timestamp, and confidence factor) are then input into a Matlab program that parses the data into a 3xn cell array. Wireless Microphone Transmitters and Receivers AudioTechnica wireless microphones and receivers were used for all of the data collection. The settings on each microphone were adjusted manually and the AF gain was tuned for optimal sound quality. The two receivers were set up in my office and at home. They input raw sound into either the recognition engine or a sound editor that converted the file to the correct data type for later processing. (22kHz, 16 bit, mono) Third Person Audio Data For additional training data, I used 10 hours of labeled audio data gathered in the similar environments with the help of Brian Clarkson. 2 enabled by Geva Patz Discussion of Possible Classification Models: Prediction Suffix Trees (PSTs) Modeling longer-range regularities is quite computationally intensive due to a size explosion caused by model order. Indeed, each bi-gram used in this paper from a simple bounded 5000-word vocabulary takes well over 150 MB of memory. An uncompressed tri-gram would need almost 2 GB of RAM. These current computation constraints prohibit traditional methods of classification to capture even relatively local dependencies that exceed model order. However, a prediction suffix tree (PST) has nodes that store suffixes of previous inputs and can provide a probability distribution for possible subsequent words in a computationally efficient manner. This ability makes PSTs a good model choice for many text classification problems. Unfortunately, even with advanced features such as 'wildcards' that allow a particular word position to be ignored in a prediction, PSTs are not the best candidate to model data output from today's voice recognition engines. The output word from the engine has an accuracy of approximately 50%, and when the engine is incorrect, it often models one word as several, or vice versa. Longer streams of data are not likely to be repeatable by the audio engine. In addition, the grammar model discovered by the PST will likely be more highly correlated with that of the speech recognition engine rather than the actual grammar spoken. Hence the utility of the PST model is no longer appropriate to this application.
منابع مشابه
NEW CRITERIA FOR RULE SELECTION IN FUZZY LEARNING CLASSIFIER SYSTEMS
Designing an effective criterion for selecting the best rule is a major problem in theprocess of implementing Fuzzy Learning Classifier (FLC) systems. Conventionally confidenceand support or combined measures of these are used as criteria for fuzzy rule evaluation. In thispaper new entities namely precision and recall from the field of Information Retrieval (IR)systems is adapted as alternative...
متن کاملAutomated Tumor Segmentation Based on Hidden Markov Classifier using Singular Value Decomposition Feature Extraction in Brain MR images
ntroduction: Diagnosing brain tumor is not always easy for doctors, and existence of an assistant that facilitates the interpretation process is an asset in the clinic. Computer vision techniques are devised to aid the clinic in detecting tumors based on a database of tumor c...
متن کاملCluster-Based Image Segmentation Using Fuzzy Markov Random Field
Image segmentation is an important task in image processing and computer vision which attract many researchers attention. There are a couple of information sets pixels in an image: statistical and structural information which refer to the feature value of pixel data and local correlation of pixel data, respectively. Markov random field (MRF) is a tool for modeling statistical and structural inf...
متن کاملAudio-Visual Emotion Recognition Using Semi-Coupled HMM and Error-Weighted Classifier Combination
This paper presents an approach to automatic recognition of emotional states from audio-visual bimodal signals using semi-coupled hidden Markov model and error weighted classifier combination for Human-Computer Interaction (HCI). The proposed model combines a simplified state-based bimodal alignment strategy and a Bayesian classifier weighting scheme to obtain the optimal solution for audio-vis...
متن کاملCollaboration from Conversation
There are several different types of information inherent in conversations: speech features, participants, and content. By recording conversations in organizations and aggregating this information, we claim that high-potential collaborations and expertise can be identified. Preliminary results for learning context from computer transcribed conversations show encouraging results. Current researc...
متن کامل