COSY - MATS : An Intelligent and Scalable Summarisation
نویسنده
چکیده
ANN ANN ANN Encoder Encoder Encoder S T A M Y S O C DOMAIN ANN DOMAIN ANN ANN Pragmatic Encoder ANN ANN SURFACE Encoder INTERMEDIARY Decoder Semantic Pragmatic Morphological Syntactic Semantic Pragmatic Synthesiser Synthesiser Synthesiser Synthesiser Synthesiser Synthesiser Synthesiser Synthesiser BLACKBOARD BLACKBOARD APPLICATION-SPECIFIC GENERIC SYMBOLIC MODULES SYMBOLIC MODULES APPLICATION-SPECIFIC SYNTHESISERS SYNTHESISERS GENERIC ANALYSERS ANALYSERS PRAGMATIC CONTENT SELECTION A N N Summary List of Important Propositions Lexical Lexical Analyser Lexical Morphological Analyser Syntactic Analyser Analyser Lexical Figure 4: The Architecture of COSY-MATS 3 A Scalable Architecture for Intelligent Summarisation Having identi ed `universal' content selection features, as well as some of the ways these interact with each other, the following architecture was designed for a full-scale implementation of the cosy-mats summarisation shell (Fig. 4) (Aretoulaki, 1996). Every sentence in the text to be summarised1 is rst processed by a cluster of standard symbolic analysers; morphological, syntactic, semantic and pragmatic. The result of this processing is the evaluation of a set of basic linguistic and extralinguistic features that provide the input for a cascade of low and higher-level Arti cial Neural Networks (anns), each responsible for speci c subtasks. The low-level anns map linguistic features (surface and intermediary) into extralinguistic features (intermediary and pragmatic). The pragmatic features provide the input to the highest-level content selection ann that ultimately determines the relative degree of importance of each sentence. This latter ann is also the only component of cosy-mats that has been implemented to date. Finally, the sentences selected as important during the content selection phase will be used as the basis for generating either a comprehensive summary or a more concise abstract (Aretoulaki, 1996). This processing will take place in another cluster of symbolic processors, almost symmetric to that used for text analysis and interpretation. It is here that the planning and the actual synthesis of the summary/abstract will be realised. However, it is important to note that the output list of the best-scoring sentences produced by the content selection ann can also be used to provide a draft summary, i.e. a concatenation of already-existing sentences instead of an original text (cf. (Kupiec et al., 1995)). This is also the only type of generation that is currently preoccupying this research (cf. Section 4.1). Despite the dominance of the generic modules therein, cosy-mats does provide for the incorporation of application-speci c information. First of all, the architecture is highly modular, so that new specialised processors can be |in principle| simply plugged in. The simplicity of the interface between the various 1which is assumed to be integral and coherent, rather than a random collection of propositions; modules means that new modules that are either symbolic or connectionist can equally well be accommodated. For example, in addition to the existing lower-level anns, other anns can be easily incorporated which have been trained to recognise speci c keywords and structural phrases that di erentiate one domain or text type from the other in expressing the same rhetorical and pragmatic functions. Hence, cosy-mats can function as a shell for the building of specialised summarisers. As regards the front-end symbolic analysers, the processing that will take place therein will be dictated by the type of data that needs to be computed in the anns. The latter computation, in turn, will be based on the identi ed generic and application-speci c mappings across the three levels of description: the pragmatic, the intermediary and the surface (Section 2). In addition, it is the implementation of the content selection ann that will determine the eventual type and number of pragmatic features required for the whole process of summarisation (Section 4). As a result, a partial analysis and interpretation of the input text only need to be performed in cosy-mats. The common problem in nlu-based systems of combinatorial explosion and ine cient computation in the search for a solution will thus be largely avoided. At the same time, this pragmatism in the analysis and interpretation processes does not decrease the amount of deep processing (semantic, discourse and pragmatic) carried out in the system. High-level processing is salient in the pragmatic features identi ed. These are, nonetheless, 'grounded' by means of the generic lower-level features, as well as other surface and semantic characteristics of texts pertaining to the speci c application of interest. In summary, the proposed architecture is both modular and hybrid. The complex task of content selection is systematically decomposed into much more manageable computations. In addition, the strong points of both symbolic and connectionist processing are combined in a complementary way (cf. (Aretoulaki, 1996)). The symbolic analysers can work with structured data of arbitrary length laden with variables. They also have powerful symbol-matching facilities (as is appropriate for lower-level text analysis). In contrast, the anns are able to deal with fuzzy and inexact processing (as is involved in importance determination and interlevel feature mappings) (McClelland and Rumelhart, 1986; Rumelhart and McClelland, 1986). 4 Empirical Evidence As the rst and most crucial step in implementing cosy-mats, a prototype of its content selection ann was developed. This is a standard feed-forward back-propagation network (Rumelhart et al., 1986). This ann receives individual text sentences from the text to be summarised, hand-coded2 by means of the identi ed pragmatic features, and assigns to them degrees of importance. It has been a major assumption behind this work that it is feature combinations rather than individual features that characterise sentence importance (Sections 1 & 2). An ann learns such interactions naturally, which is why the connectionist paradigm was adopted for the content selection task. The training corpus consisted of 1,880 sentences in total, taken from the real-world text collection described in Section 2. 1,100 of them are sentences largely out of their context, while the remaining 780 sentences make up 29 full texts. In contrast to the diversity of the former subcorpus, each of the latter texts is approximately 23 sentences long and was fully encoded. The encoding was carried out by 5 individuals on the basis of the above-mentioned manual which exempli es the correlations between the surface and the more abstract features in the proposed scheme. The manual was used in order to standardise the encoding process as much as possible, as well as to validate the proposed ways in which the evaluation of the abstract pragmatic features can be objecti ed and fully automated later on in the completed system. Experiments to date (cf. (Aretoulaki, 1996)) have demonstrated the superiority of the pragmatic features over input to the ann from across the three levels of abstraction (58.1% vs 56.1% success on average; where 'success' coincides with agreement with the judgement made by the human encoder regarding the level of importance of the corresponding sentence). The simultaneous use of control experiments with noisy data3 has ensured the validity of these results (50.1% success). In addition, the testing on whole texts has provided comparable results to those acquired with isolated sentences, namely 56.8% success on average; this suggests that the pragmatic features are su ciently abstract to capture hierarchical and structural aspects of the corresponding discourses. The diversity of the corpus in terms of subject matter, text type and length provides su cient evidence for the appropriateness of the pragmatic features for the high-level representation of texts from any domain or text type. Moreover, the portability of these pragmatic content selection features has also been partly proved with experiments on whole texts (Aretoulaki, 1996). These indicated that only a small amount of retraining is required for the ann to deal with new text types, which involves a limited number of representative texts. 2given that the remaining components of cosy-mats have not been implemented as yet; 3These used characteristics of the text that should be irrelevant to the content selection task, such as 'The second word in the sentence ends in a vowel'. Thus, what is predicted to di er between text types is the relative in uence of each of the identi ed featuresin the nal weighting of the corresponding sentence.4.1 Generating Draft SummariesThe `draft' summaries that result after concatenating the sentences of the input text that were selectedby the ann as important are, on the whole, adequate for current awareness purposes (See (Aretoulaki, toappear) for a detailed evaluation of this and other draft output). The ann receives a single |coherentand largely cohesive| text each time, rather than a collection of unrelated texts. Sentence selection wasbased on the 24 pragmatic features used for their encoding and the statistical correlations among them, asindicated in the training corpus. Most importantly, by ltering out the sentences for which the ann did nothave a clear decision, i.e. by adapting the corresponding threshold on-line, content selection can be morene-grained and the output summaries more brief. An example draft summary for a newspaper article afterthe application of this type of ltering is shown below. In this case, there was 82.6% agreement between theann decision and the corresponding human judgement regarding the importance of individual sentences inthis article4.(1) Moscow editors feel the old-fashioned grip of the state (Headline)(3) Intense party pressure for the dismissal of a prominent liberal editor and a new campaign to discreditthe radical politician Boris Yeltsin both apparently with the backing of President Gorbachev have raisedfears among reformers of a conservative swing by the Soviet leadership. (5) On Monday evening, he wassummoned to the Central Committee to be told in so many words by Vadim Medvedev, the Politburomember in charge of ideology, that he should leave his post. (6) The move follows a harsh talk deliveredlast week by Mr Gorbachev to a group of senior Soviet editors, in which he gave several a dressing down.(12) Some journalists are talking of a protest strike. (13) `The press is quite simply now facing bans onwhat it can write about, we're going back to the situation of years ago,' one complained yesterday. (16) Themotion, which could pre gure a head-on clash between the party and a steadily more assertive parliament,attacks the Central Committee ideology department for its `unacceptable attempts' to cow a newspaper.(22) Backing for Mr Yeltsin is not universal. (23) But the fact that the parliamentary exchanges werebroadcast on prime time television leaves no doubt that a campaign is under way to smear a man whosehuge following makes him Mr Gorbachev's only real rival.Despite the coincidental cohesiveness therein, this draft output comprises the majority of the semanticallysubstantial sentences in the input text. The concatenation of sentences from the original is undoubtedly amuch simpler task than the generation of an extended summary or a concise abstract. Novel text synthesisin the fully-developed cosy-mats will also bene t from the proposed mappings between the surface andthe more abstract content selection features. Since the corresponding modules, however, have not beenimplemented yet, the processes involved will not be exempli ed here.5 Conclusion: COSY-MATS is not a UtopiaAll experimental results to date indicate that content selection in the completed cosy-mats environmentcan be robust and e cient, even in the absence of any customisation to the speci c application (domain ortext type) or the user requirements. This is due to the adoption of the connectionist paradigm for this fuzzytask and the proven generic nature of the pragmatic and lower-level features used therein.In the context of further implementing this summarisation shell, current work includes the testing ofalternative learning algorithms for the prototype content selection ann in order to improve its success rate.In addition, the more rigourous speci cation of the mappings between the surface cues and the intermediaryand pragmatic features is attempted for the subsequent development of specialised processors that computethem. Thus, the encoding of the pragmatic features will be fully automated and it will also be possible tomeasure the precise e ect that this will have on the training of the whole cascade of anns, given the currentpractice of hand-coding. Moreover, the impact on the content selection ann of incorporating application-dependent information in the system will also be studied (cf. (Aretoulaki, 1996)). What is important is thatresearch to date has proved that the realisation of the cosy-mats intelligent and scalable summarisationshell is by no means a utopia.6 AcknowledgementsThe research reported in this paper was carried out as part of a Ph.D. programme at the Centre for Com-putational Linguistics at U.M.I.S.T. I am indebted to my supervisor, Prof. Jun-ichi Tsujii, for his invaluable4The 5 subjects were free as to the number of sentences they could pick out from any text as important. Importance,in turn, was de ned as the relative indispensability from the nal summary of the propositions expressed in thecorresponding sentence. This was determined on the basis of the whole text the sentence belongs to. feedback and encouragement during that time. I am also grateful to canon Europe Ltd. for granting mewith a 2-year Research Studentship, without which this research would not have ever been possible. Finally,I would like to thank the two anonymous reviewers for constructive comments which have increased thedegree of clarity of this paper.ReferencesM. Aretoulaki. COSY-MATS: A Hybrid Connectionist Symbolic Approach To The Pragmatic Analysis OfTexts For Their Automatic Summarisation. Ph.D. Thesis, Dept. of Language Engineering, U.M.I.S.T.,Manchester, U.K., March, 1996.M. Aretoulaki. A Hybrid Connectionist-Symbolic Approach To Pragmatics and Text Summarisation. Uni-versity College London (UCL), London, U.K., To Appear.J. L. Austin. How To Do Things With Words. Oxford University Press, Oxford, U.K., 1962.Science and Technology Feature. Short Cuts. The Economist, pages 97{98, 1994. December 17th.M. Coulthard, editor. Advances in Written Text Analysis. Routledge, London, U.K., 1994.B. Endres-Niggemeyer and E. Neugebauer. Professional Summarising: No Cognitive Simulation without Ob-servation. In Proceedings of the 4th International Colloquium on Cognitive Science (ICCS-95), Donostia,San Sebastian, Spain, 1995.J. Fukumoto and J. Tsujii. Breaking Down Rhetorical Relations for the Purpose of Analysing DiscourseStructures. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), volume 2, pages 1177{1183, Kyoto, Japan, August 1994. Association for Computational Linguistics.R. Garigliano, R. G. Morgan, and M. H. Smith. The lolita System as a
منابع مشابه
COSY-MATS: An Intelligent And Scalable Summarisation Shell
In tins paper, an architecture is presented for robust and portable summansatlon, COSY-MATS COSY-MATS Can avmd the superfimahty and domain-dependence of IE approaches by means of lngh-level (pragmatic and rhetorical) content selectmn features It can also obviate the text typedependence and cumbersome computation• revolved m NLU-based snmmansatl0n systems, because surface criteria are add~t~onal...
متن کاملIntelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms
Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...
متن کاملText Summarisation for Knowledge Filtering Agents in Distributed Heterogeneous Environments
The rapidly growing volume of electronic information available in distributed heterogeneous environments, such as the World Wide Web, has made it increasingly difficult and time-consuming to search for and locate relevant documents (in textual, visual or audio format). To refieve users of the burden of this task, intelligent tools that automate the search and retrieval tasks by generating profi...
متن کاملFast analysis of scalable video for adaptive browsing interfaces
Driven by a high demand for user-centred video interfaces and recent advances in scalable video coding technology, this work introduces a novel framework for video browsing by utilising inherently hierarchical compressed-domain features of scalable video and efficient dynamic video summarisation. This approach enables instant adaptability of generated video summaries to user requirements, avail...
متن کاملIntegrated Runtime Measurement Summarisation and Selective Event Tracing for Scalable Parallel Execution Performance Diagnosis
Straightforward trace collection and processing becomes increasingly challenging and ultimately impractical for more complex, longrunning, highly-parallel applications. Accordingly, the kojak measurement system for mpi, openmp and shmem parallel applications is incorporating runtime management and summarisation capabilities. This offers a more scalable and effective profile of parallel executio...
متن کامل