Corpus-Driven Knowledge Acquisition for Discourse Analysis
نویسندگان
چکیده
The availability of large on-line text corpora provides a natural and promising bridge between the worlds of natural language processing (NLP) and machine learning (ML). In recent years, the NLP community has been aggressively investigating statistical techniques to drive part-of-speech taggers, but application-specific text corpora can be used to drive knowledge acquisition at much higher levels as well. In this paper we will show how ML techniques can be used to support knowledge acquisition for information extraction systems. It is often very difficult to specify an explicit domain model for many information extraction applications, and it is always labor intensive to implement hand-coded heuristics for each new domain. We have discovered that it is nevertheless possible to use ML algorithms in order to capture knowledge that is only implicitly present in a representative text corpus. Our work addresses issues traditionally associated with discourse analysis and intersentential inference generation, and demonstrates the utility of ML algorithms at this higher level of language analysis. The benefits of our work address the portability and scalability of information extraction (IE) technologies. When hand-coded heuristics are used to manage discourse analysis in an information extraction system, months of programming effort are easily needed to port a successful IE system to a new domain. We will show how ML algorithms can reduce this development time to a few days of automated corpus analysis without any resulting degradation of overall system
منابع مشابه
The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis
This paper reports the effect of corpus size on case frame acquisition for discourse analysis in Japanese. For this study, we collected a Japanese corpus consisting of up to 100 billion words, and constructed case frames from corpora of six different sizes. Then, we applied these case frames to syntactic and case structure analysis, and zero anaphora resolution. We obtained better results by us...
متن کاملInducing Clause-Combining Rules: A Case Study with the SPaRKy Restaurant Corpus
We describe an algorithm for inducing clause-combining rules for use in a traditional natural language generation architecture. An experiment pairing lexicalized text plans from the SPaRKy Restaurant Corpus with logical forms obtained by parsing the corresponding sentences demonstrates that the approach is able to learn clause-combining operations which have essentially the same coverage as tho...
متن کاملA Conversation Analysis of Ellipsis and Substitution in Global Business English Textbooks
Despite the body of research on textbook evaluation from the discourse analysis perspective, cohesive devices have rarely been analyzed in English for Specific Purposes (ESP) textbooks. The acquisition and use of cohesive devices is inherent to naturalistic communication, including business interactions. Hence, L2 learners of business English should be exposed to these devices through cohesion-...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کامل