Hybrid technology “SMT and RBMT”

Hybrid translation technology involves the use of statistical methods to automatically build vocabulary databases based on parallel corpora, generate several possible translations both at the lexical level and at the level of the syntactic structure of the target language sentence, apply post-editing in automatic mode and select the best (most probable) translation of the possible on the basis of a language model built on a specific corpus of the target language.

Hybrid (SMT + RBMT) System

Rule-based MT with a post-processing statistical approach.
Statistical MT with rule-based preprocessing.
Full integration of RBMT and SMT.
Statistical MT seeks to use linguistic data, while systems with a “classic” rule-based approach apply statistical methods.

The addition of some “end-to-end” rules, that is, the creation of hybrid systems, somewhat improves the quality of translations, especially with insufficient input data used to build index files for storing linguistic information of a machine translator based on N-grams.

Combining RBMT and Statistical Machine Translation:

  • Linguistic analysis of the input sentence;
  • Generation of translation variants;
  • Use of statistical technologies;
  • Evaluation and selection of the best translation option using the Language Model.

Stages of Hybrid SMT and RBMT technology:

  • RBMT training based on a parallel corpus using statistical technologies;
  • Operation based on a trained system.

Syntax-Based Statistical Translation – Syntax-based SMT

It is worth briefly mentioning this method. Before the advent of neural networks, syntactic translation was talked about for many years as the “future of translators”, but it never managed to achieve success.

The adherents of syntactic translation believed in combining the approaches of SMT and the old transfer translation by rules. You need to learn how to do a fairly accurate parsing of the sentence – to determine the subject, predicate, dependent members, and that’s all, and then build a tree.

With such a tree, one can train the machine to correctly convert the figures of one language into the figures of another, performing the rest of the translation by words or phrases.

Only to do it now not by hand, but by machine learning. In theory, this would solve the word order problem forever.

The problem is that although mankind considers the problem of parsing solved long ago (for many languages there are ready-made libraries), in fact it works very shitty. I have personally tried many times to use syntax trees for tasks more complex than subject and verb separation, and each time I have given up in favor of other methods.

Rule-based machine translation

Ideas for rule-based machine translation began to emerge as early as the 1970s. Scientists watched the work of linguists-translators and tried to program their large and slow computers to repeat after them. Their systems consisted of:

  • Bilingual dictionary
  • A set of linguistic rules for each language

In general, everything. Optionally, they were supplemented with hacks such as lists of names, spelling correctors and transliterators.

Transfer systems

In them, we do not immediately rush to translate from the dictionary, but prepare a little. We parse the text into the subject, predicate, look for definitions and everything else as taught at school.

Adult uncles say “we highlight syntactic constructions.” After that, we no longer lay the rules for translating each word into the system, but manipulate entire structures. In theory, we can even achieve a more or less good conversion of word order in languages.

In practice, it is still difficult, linguists are still dying from physical exhaustion, and the translation is actually literal. On the one hand, it is easier: you can set general rules for agreement by gender and case.

On the other hand, it is more difficult: there are much more combinations of words than the words themselves. Each option will not be taken into account by hand.