The MetaMorpho Machine Translation System

About machine translation (MT) software

Electronic translation tools can be classified according to the nature of the source text and the applied linguistic knowledge:

without linguistic knowledge with linguistic knowledge

single word or phrase dictionary application conjugating dictionary application

sentence, word by word - word by word dictionary

sentence, phrase by phrase statistical translation tool example-based translation tool

sentence translation memory rule-based translation tool

Translation tools parse the source text to create an intermediary representation, from which they produce the text in the target language. The systems can be differentiated by the complexity of their intermediary representation. This is the most simple in the case of a dictionary application, while the most complex in the case of rule-based translation tools.

As the table indicates, word by word dictionary, example-based translation tools and rule-based translation tools reveal relationships between words only, phrases only and all relationships, respectively, for the translation. The quality of the translation is directly proportional to the number of relationships revealed in the sentence.

Rule-based translation tools can be further classified. Primary distinction can be made based on the depth of understanding the source text. In the beginning, so-called direct translation programs tried to translate by moving words and constituents without revealing the entire structure of the sentence. Nowadays, MT software attempt to parse the entire sentence syntactically. Here again we can distinguish two methods. In the case of an interlingual solution, source language parsing and target language generation are completely independent of each other. In the case of a transfer solution, the description also contains rules of conversion between the two languages. Most of the machine translation software currently available are transfer solutions aiming at minimizing transfer rules, i.e. independence from specific language pairs. However, it must be noted that no perfect interlingual solution has been created so far despite many such attempts.

The ultimate goal is the interlingual solution because a parsing independent from the target language could then be applied to any given language pairs. Thus, as opposed to transfer solutions where separate modules are required per language pair, the interlingual solution would only require a single parser and a single generator module for each of the languages. In the case of the twenty-three official languages of the European Union, this would mean n*(n-1), that is 506 transfer modules, while by using the interlingual solution, only 2*n, that is 46 modules would be necessary.

The two types of sentence translating tools without linguistic knowledge are translation memories and statistical translation tools. Their operation is similar, but while the translation memory, serving as a basis for the translation, is generated during previous translations done by the professional translator, statistical systems attempt to find these previous translations automatically in bilingual corpora and on the Internet. The problem with statistical translation tools is that they could only provide translations of acceptable quality if their building blocks were not only phrases but complete sentences, but this is unlikely due to the limited amount of available material. For this reason, linguistic knowledge is often built in, which, on the other hand, brings up both theoretical and practical issues. At the end of the day, a statistical translation system of marketable quality has not been developed up to now. Nevertheless, statistical methods may play a role in so-called hybrid systems, where rule-based operation is sometimes supplemented by decisions based on statistics.

Conceptual basics of the MetaMorpho system

The MetaMorpho system was born with the aim to incorporate the advantageous features of different translation tools. Translation memories provide good translations since they are built up from direct mappings, but only work in a limited number of cases. MT software will translate anything, but if the translation is generated by linguistic algorithms, linguistic quality may deteriorate. Our objective has been to build a system that, if possible, uses direct mappings, and at the same time generate productive translations with the help of grammar, if necessary.

This concept led to the development of an essentially new architecture. MetaMorpho stores all linguistic data in pairs. This makes the system neither a transfer nor an interlingual procedure, in that those consist of separate parsing and generating rules. In MetaMorpho, each parsing rule has its generating counterpart. Its operation can be best described as a kind of transfer taking place on all linguistic levels. We abandoned the popular interlingual idea, and develop a separate language module for each language pair. At the same time, using language pairs as building blocks is rewarding because:

dictionaries can be built and used in a natural way,
human translations and translation memories can be integrated in a natural way,
user extensibility can be easily realized,
grammar is only required for the translation of the given language pair.

Building a database based on language pairs does not pose a serious problem: on the one hand, the number of really important languages is low, and on the other, these can be relatively easily produced through offline use of data.

Therefore, MetaMorpho cannot be classified into any of the known translation method categories. It is essentially a rule-based system, but, as opposed to transfer and interlingual methods, it consists exclusively of direct mappings. However, these direct mappings are used not directly but rather in the separated generating phase. The examples serve for describing both the grammar and the dictionary.

About the database

To describe MetaMorpho's rules, we have developed a descriptive language, which is capable of formulating context-free linguistic rules. We call this formalism MetaMorpho Dictionary (MMD) format. As described above, parsing and generating rows are not isolated within this format, but each rule contains both the parsing and the generating part. Words, phrases and grammatical rules are stored in the same database. Dictionary entries are different from grammatical rules (e.g. a sentence consist of a subject and a predicate) only in that the former contains specific words, and the latter contains abstract linguistic symbols. Another important feature of the MetaMorpho system is the dual representation of rules on two levels. This concept is like that of high level and assembly languages used in machine translation software, only here the language and syntax is not so greatly different on the two levels. This solution was necessary for similar reasons. High-level language can be easily read and developed. Information required for successful program operation only appears in the low-level language. A good example for the difference between the two levels can be the explicit representation of feature inheritances hidden on the high level. Below are two dictionary examples to describe the database. The first one is a simple nominal entry, and the second one is a verb phrase.

*dog:24263
EN.NX[animtype=YES, ct=CNT] = N(lex="dog")
HU.NX = N[lex="kutya"]

*VP=love+DOBJ:530
EN.VP[idiom=NO] = TV(lex="love", :passtr=BOTH) + DOBJ
HU.VP = TV[:lex="szeret"] + DOBJ[addet=YES]

Interpretation of the nominal example: if there exists such an English word as 'dog', create an NX type symbol that is living and countable. If this has to be translated, the translation must be 'kutya'. Interpretation of the verb phrase example: if there exists such an English word as 'love' and it has an object in the sentence, it must be translated as 'szeret'. If the English object has no article, it must be complemented by a definite article (determinant). Numerous features do not appear explicitly, but we know that apart from their lexical attribute (dog, kutya), nouns have number, case and many other features that are inherited through rules but are hidden in the high-level rule. The syntactic database of the English to Hungarian translation tool consists of approx. 200,000 linguistic examples. All examples all formulated in the MMD format described above. Rules have converters that can be used for converting the rules to XML or back, if required, thus making the system compatible with other linguistic descriptions. Syntactic descriptions are supplemented by monolingual morphological analyzer and generator databases, and other additional linguistic data (such as statistics for morphological and semantical disambiguation, etc.).

About parsing

Parsing is controlled by a basically context-free grammar and done bottom-up. This essentially means that starting from the words, the rules create higher-order linguistic symbols by concatenation. Parsing is considered successful if it succeeds in creating a sentence symbol that uses every word of the source sentence. In that case, the sentence is likely to be translatable in acceptable quality. Since there is no transfer phase, generation merely consists of the execution of generation rows belonging to the symbols created during parsing.

Note that even this very simple sentence went through several linguistic transformations: the English pronoun is represented by the verbal suffix in Hungarian, and a definite article was added to the Hungarian object. In the parse trees, symbols below one another and in the same character position are siblings. A symbol above and to the left of a node is its parent, and a symbol below and to the right of a node is its child. The tree also shows that the parser used the VP rule from the example. On the contrary, the NX rule from the example is not visible because the tree creator application only drew the most important parsing levels for the sake of clarity. To complete parsing and create the parse tree, it had to create about ten times the amount of these symbols. During generation, only the symbols listed here had to be created since this is a simple reading out of the results. The tree only displays some important node attributes (e.g. lexical form and number of the English noun).

In reality, an English nominal symbol has about a hundred features that all determine the outcome of the parsing procedure. Parsing is not always successful. If there is no solution for the whole sentence, partial parsing results are merged in attempt to cover the whole sentence and produce the best translation possible. This is called mosaic translation. Several aspects, including statistical ones, are considered for the collection of partial results. Sometimes translations have more than one solution. The display of different results is determined by the user interfaces. They are displayed in MoBiCAT, which translates texts sentence by sentence, but not in MorphoWord, which performs continuous text translation. These are usually technical ambiguities and we seek to minimize their occurrences. At the same time, we plan to add a feature to the user interface so that users can choose.

Architecture

The MetaMorpho system incorporates the following modules: tokenizer, morphological analyzer, morphologic tagger, sentence segmenter, morphosyntactic converter for parsing, syntactic parser, sense disambiguator, syntactic generator, morphosyntactic converter for generation, morphological generator, word concatenator module. Each and every sentence goes through all these steps. During parsing, the knowledge accumulated keeps expanding, as every parsing step has access to information generated during previous steps. The program can be run in client-server mode, thus being capable of serving several translation requests from the network or the Internet. Numerous clients can connect to the server program. So far, the following applications have been realized:

MoBiCAT: popup translation service
MorphoWord: Microsoft Word add-on for translating
RuBi: extension module for MetaMorpho. Enables the teaching of words and phrases, one by one or an entire dictionary in one step.
MorphoWeb: web page translator
MorphoWAP: WAP-based translator
Microsoft Office 2003 translator: sentence translator demo available in Office 2003

The program has been written in C++. In consists of more than 250 projects and more than 2,000 own source files. Apart from code written by us, the MetaMorpho system integrates numerous free source code external solutions (database manager, graphical UI, etc.).

Tihanyi László
project manager