Can the “huge amounts” of synthetic data in the field improve machine translation?

With the many notable advancements in machine translation (MT) and natural language processing (NLP), it’s no wonder that large and small-scale users now expect each new iteration of MT to surpass measurably its predecessor.

Functionally, TM is getting better and better – thanks in large part to research and all the large data sets freely available for training equally large MT engines. However, domain-specific machine translation (see a recent example) is still a work in progress.

Researchers Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way from the Adapt Center, Dublin City University, National College of Ireland, and Technological University Dublin set out to tackle this domain-specific problem. with an experiment using three different configurations.

In a paper published in August 2022, this group of NLP specialists defined the problem as “the scarcity of data in the field […] common in translation contexts due to the lack of specialized datasets and terminology, or the inconsistency and inaccuracy of translations available in the field.

The researchers also cited the lack of adequate computing resources and specialized in-house translation memories as part of the problem. Moreover, they consider the process of exploring open data sets “ineffective”.

MT Fine Tuning and HITL models

Many researchers have already worked on the domain-specific TM problem, as shown in the detailed bibliography of the article. Some of the approaches attempted include the use of large monolingual datasets with an additional selection of subsets. These are automatically translated forward and then refined.

Other approaches have used fuzzy matches from bilingual datasets, followed by further editing and adjustment; or general MT engine training followed by domain-specific fine-tuning.

The researchers present an approach to domain adaptation that still uses pre-trained language models, but adds an augmentation of domain-specific data through back-translation.

The methodology also takes into account certain linguistic characteristics, such as fluency. It uses mixed fine tuning (i.e. additional MT training) and oversampling (i.e. adding samples from the larger corpus to the dataset) to generate this which the researchers call “enormous amounts of bilingual synthetic data in the field”.

Their approach, paper reports, produced “significant improvements in the quality of translation of the test set in the field.” The proposed methodology follows these main steps:

  • Text generation with a large language model in the target language to augment domain data;
  • Back-translation to obtain parallel source sentences;
  • Mixed focus; and
  • Oversampling.

The researchers used a base configuration and two different domain-specific configurations. A domain-specific configuration is based on a small set of bilingual data; the other, based on a source dataset only (using direct translation to generate the target language).

The experiment was conducted using Arabic-English and English-Arabic language pairs, and the domain chosen was public health. The datasets used were from the Open Parallel Corpus (OPUS) and a domain-specific dataset, the Covid-19 Translation Initiative (or TICO-19) dataset.

The result was automatically assessed using spBLEU numerical calculations, a benchmark of 3,001 multi-domain sentences professionally translated into 101 languages.

The results were also linguistically assessed by Dr. Muhammed Yaman Muhaisen, a native Arabic speaker, subject matter expert and ophthalmologist at the Eye Surgery Hospital in Damascus, Syria.

Dr. Muhaisen conducted the bilingual language assessment on a randomly selected sample of 50 sentences from the original test set. She was asked to use a scale in which quality ranged from 1 (unacceptable translation) to 4 (ideal translation).

Slator Tongue Industry Market Report 2022

Flagship 100-page report on market size, buyer segments, competitive landscape, sales and marketing insights, language technology and more.

The results

Refined, domain-specific models generated “more idiomatic translations or better capture meaning in the context of public health,” the scientists concluded.

Since some expressions may have multiple valid translations, the human evaluation resulted in the same score for different translations. The results were nonetheless comparable with respect to translation quality for the two domain-specific configurations.

Here are some language examples (English only) mentioned in the article for comparison.

  • “non-pathogenic in their natural host” (baseline) versus “non-pathogenic in their natural reservoir hosts” (in the field). The translation in the field was found to be more idiomatically correct in the medical context.
  • “Maternities” (reference) vs “birthing pools” and “birthing baths” (in the field). The baseline was considered a translation error.
  • “serum tests” (baseline) versus “serological tests” (in the field). The translation in the domain was considered more idiomatically correct.

The scientists also mentioned that in some cases, only the baseline and one of the domain systems produced an accurate translation. An example of this was the Arabic translation of “If you wear a mask”, which was incorrect in two of the three configurations.

What’s next for domain-specific machine translation?

The scientists concluded that more research is needed on the use of terminology for domain-specific data generation and propose to experiment with this approach for low-resource languages ​​and multilingual environments. More domain-specific datasets would undoubtedly help further prove their usefulness.

Citing the work of other researchers, the group also highlighted the role of back-translation as a key part of their approach. They added that a study showed that forward translation can lead to quality improvements, but reverse translation yields superior results.

The expert-in-the-loop model also continues to prove its worth in these research endeavors as well as in practical applications. This is the case on both the buyer side and the language service provider (LSP) side. Without the qualified criteria of a field expert, NLP scientists would not be able to properly qualify the results of certain domain-specific TM experiments.

More empirical data will certainly benefit all types of users, but the question remains whether they will have the resources to conduct their own domain-specific machine translation experiments.

The paper on this domain-specific machine translation experiment is among those selected for this year’s Association for Machine Translation in the Americas (AMTA) conference, taking place September 12-16, 2022.