Shake-And-Bake Machine Translation

نویسنده

  • John L. Beaven
چکیده

A novel approach to Machine Translation (MT), called Shake-and-Bake , is presented, which exploits recent advances iLL Computational Linguistics in terms of tile increased spread of lexicMist unification-based grammar theories. It is argued that it overcomes some difficulties encountered by transfer and interfingual methods. It offers a greater modularity of the monolingual components, which can be written with independence of each other, using purely monofinguM considerations. These are put into correspondence by means of a bilingual lexicon. The Shake -and-Bake approach for MT consists of parsing the Source Language in any usual way, then looking up the words in the bilinguM lexicon, and finally generating from tile set of translations of these words, but allowing the Target Language grammar to instantiate tile relative word ordering, taking advantage of the fact that the parse produces lexical and phrasal signs which are highly constrained (specifically in the semantics). TILe main algorithm presented for generation is a variation on the well-known CKY one used for parsing. A toy bidirectional MT system was written to translate between Spanish and Enghsh, and some of the entries are shown. 1 M o t i v a t i o n 'l/he research reported here was motivated by the desire to exploit recent trends in Computational *The work reported here was carried out at the University of Edinburgh under the support of a studentship from the Science and Engineering Research Council. Thanks to Ann Copestake, Mark Ilepple, Antonio Sanfilippo, Arturo Trujillo, Pete Whitelock and the anonymous reviewers for t he i r co ln l l ten ts , A n y error8 relnaJn l ay own. Linguistics, such as tile appearance of lexicalist unification-ba~sed grammar formalisms for the purposes of machine translation, in an attempt to overcoine what are perceived to be some of tile major shortcomings of transfer and inter]ingual at~proaches. With a transfer-based MT system, the transfer component is very imcch language-pair specific, and must be written bearing very closely in mind both monofingual components in order to ensure compatibifity. Depending on how much work is clone by the analysis and generation components, the tasks carried out by the transfer element may vary, but iLL gener',d this module is very idiosyncratic and will involve several hundred transfer rules. Writing these transfer rules is the most time-consmning aspect of the design of a transferbased system, as it nlust be consistent with hoth nlonolingual grammars. The process is therefore error-prone, and the result is not very portable, since the consequences of making changes to the monolingual components may be far-reaching as far as the transfer rules are concerned. One of the mMu difficulties with interlingual approaches is what Laudsbergen [Landsbergen 87] refers to as the subset problem. If the system is to be robust, it is essential to guarantee that any interlingual formula derived from ally Source Language (SL) expression is amenable to generation into tile Target Language (TL). If the interlingua is powerful enough to represent all the meanings in all tile languages involved, there will be several t probably iatlnitely many) formulae in that inter ingua which are logically equivalent to the one produced by the analyser. It cannot then be guaranteed that this fornmla comes under the coverage or the TL generator, unless we can draw logical inferences in the interfingua. The complexity of this task may be eompntationMly daunting, since suh-problems of this (such as satistiability and non-tautology) are known to be Nl'-complete ([Garey and Johnson 1979]). The approach presented here bears some similarity with that of [Alshawi et al 91], which uses AcrEs DE COLING-92, NAIVI'~. 23-28 Aovr 1992 6 0 3 Paoc. ov COLING-92. NAntES, AUG. 23-28, 1992 the algorithm of [Shieber et al. 90] for generation from quasi-logical forms. On the other hand, generation here takes place from a set of TL lexical items, with instantiated semantics, which makes the task easier. This approach was tested with independentlywritten grammars for small yet linguistically interesting fragments of Spanish and English, which are used both for parsing and generation. These are put into correspondence by means of a bilingual lexicon containing the kind of information one might expect to find in an ordinary bilingual dictionary. 2 T h e g r a m m a r f o r m a l i s m A version of Unification Categorial Grammar (UCG) ([Calder et al. 88]) is used. Like many other current grammatical formalisms ([Shieber 86], [Pollard and Sag 87], [Uszkoreit 86D, it represents linguistic objects by sets of feature (or attribute)-value pairs, called signs. The values of these signs may be atomic, variables or further sets feature-value pairs. They can therefore be represented as directed acyclic graphs or as attribute-value matrices using the PATR-II notation of [Shieber 86]. The notion of unification is then used to combine these. The main features used in the signs are ORTHOGRAPHY, CAT (the categorial g rammar syntax), OItDER (the directionality of the "slash", which specifies linear ordering), FEATS (a set of syntactic features), CASES (a case-assignment mechanism built on top of s tandard UCG), and SEM, a unification-based semantics with a neoDavidsonian t reatment of roles ([Parsons 80, Dowty 89]). The semantics of an expression is of the form I:P, where l i s a variable for the semantic index of the whole expression, and P is a conjunction of propositions in which that index appears. In addition, features called ARGO, ARG] and so on provide useful "handles" for allowing the bilingual lexicon to access the semantic indices, but they are not strictly necessary for the grammars The signs presented are only shorthand abbreviations of the full ones used, and the interested reader is referred to [Beaven 92] for a more complete view. The PATR-II notation will be used, with the Prolog convention that names start ing with upper case s tand for variables. In addition, for the sake of clarity and brevity, the nonessential features will be omitted, as will be their names when these are are obvious. The grammar rules used subsume both functional application and composition, but for the examples given here, only functional application will be necessary. An important feature of this approach is that this will make it possible to have an MT system in which no meaningful elements in the translation relation are introduced syncategorematically (in the form of transfer rules or operations with interlingual representations). In particular, assuming we have very rich lexicai entries (which contain information about various dimensions of the language, such as orthography, syntax and semantics), all tha t is needed is a correspondence between the lexieai entries, supplied by a bilingual lexicon, together with a set of constraints for each correspondence. The design of such a translation system will therefore involve three components: two monolingual lexicons for the languages concerned, and a bilingual lexicon. The Spanish and English components were designed using purely monolingual considerations, and as a consequences the treatments of English and Spanish grammars are quite different. The basics of the g rammar will be explained by presenting the monolingual lexical entries required for the Spanish sentence Maria visit6 Madrid, which corresponds to the English Mary visited Madrid. More linguistically interesting sentences will be offered at a later stage. 2 .1 T h e S p a n i s h G r a m m a r The Spanish grammar is somewhat an unconventional version of UCG, in tha t VPs are treated as sentences (S), and NPs as sentence modifiers (S/S in the eategorial notation). The reasons for this decision have to do with accounting for subject pro-drop, and are discussed in [Whitehick 88] and [[leaven 92]. A ease-assignment mechanism is added to s tandard UCG. Amongst other uses, it provides a coverage of clitic placement. NPs are sentence modifiers. The following one, for instance, looks for a sentence with semantics 11: Seml, and returns another sentence, in which the semantics have been modified to state that F3 (an index standing for Maria), plays a certain (unspecified) role in the semantics of 11. The operation U stands for set union, and "all the propositions in the semantics are interpreted here as being conjoined. (1) (c RTItO 'M a(ia' ]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Aspects of Shake-and-bake Machine Translation between English and Italian

Shake-and-Bake is an approach to bidirectional and multilingual Machine Translation which takes advantage of the features of lexically-based uniication grammars to design modular systems, where grammars are written on purely monolingual considerations. An extension to the standard Shake-and-Bake model is proposed in order to increase such peculiarities. It consists in introducing in the system ...

متن کامل

A Chart Generator for Shake and Bake Machine Translation

A generation algorithm based on an active chart parsing algorithm is introduced which can be used in conjunction with a Shake and Bake machine translation system. A concise Prolog implementation of the algorithm is provided, and some performance comparisons with a shift-reduce based algorithm are given which show the chart generator is much more efficient for generating all possible sentences f...

متن کامل

Improving the Efficiency of a Generation Algorithm for Shake and Bake Machine Translation Using Head-Driven Phrase Structure Grammar

A Shake and Bake machine translation algorithm for Head-Driven Phrase Structure Grammar is introduced based on the algorithm proposed by Whitelock for unification categorial grammar. The translation process is then analysed to determine where the potential sources of inefficiency reside, and some proposals are introduced which greatly improve the efficiency of the generation algorithm. Prelimin...

متن کامل

Using Template-Grammars for Shake & Bake Paraphrasing

In this paper we propose an approach to corpus-based generation in a machine translation framework that is similar to shake & bake (Whitelock, 1992). A bag of words is mapped against an automatically induced TL template grammar and a sentence is generated by recursively applying rules that are extracted from the template grammar. A test version of the template grammar is enriched with further l...

متن کامل

An Efficient Generation Algorithm for Lexicalist MT

The lexicalist approach to Machine Translation offers significant advantages in the development of linguistic descriptions. However, the Shake-and-Bake generation algorithm of (Whitelock, 1992) is NPcomplete. We present a polynomial time algorithm for lexicalist MT generation provided that sufficient information can be transferred to ensure more determinism.

متن کامل

A Lexicalist Approach to the Translation of Colloquial Text

Colloquial English (CE) as found in television programs or typical conversations is different than text found in technical manuals, newspapers and books. Phrases tend to be shorter and less sophisticated. In this paper, we look at some of the theoretical and implementational issues involved in translating CE. We present a fully automatic large-scale multilingual natural language processing syst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992