Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Abstract
We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Ona de Gibert Joseph Attieh Teemu Vahtola Mikko Aulamo Zihao Li Raúl Vázquez Tiancheng Hu Jörg Tiedemann University of Helsinki University of Cambridge {first.last}@helsinki.fi, [email protected]
1 Introduction
Machine translation (MT) has achieved remarkable success for high-resource languages, but its application to the vast majority of the world’s languages remains severely hampered by the scarcity of high-quality parallel corpora. Traditional data augmentation techniques like back-translation Sennrich et al. (2016) and pivoting Costa-jussá et al. (2018); Cheng (2019) preserve the human-written target and synthesize the other. The advent of Large Language Models (LLMs) presents a transformative opportunity, as reflected by the growing number of survey papers on the subject Zhou et al. (2024); Ding et al. (2024); Wang et al. (2024); Nadas et al. (2025). LLM-based synthetic data generation, akin to sequence-level knowledge distillation Kim and Rush (2016), opens up the possibility of creating vast amounts of training data even where human-translated resources are virtually non-existent.
This raises the question: Can an MT system trained on LLM-generated data benefit truly low-resource language pairs? To date, there is little to no systematic investigation of (a) generating large-scale synthetic data using LLMs for low-resource languages, (b) evaluating its intrinsic quality, (c) quantifying its downstream impact when training or fine-tuning modern MT systems. This paper provides such a systematic investigation. We make the following main contributions:
-
•
We use GPT-4o111We use the gpt-4o-2024-08-06 model. to generate a document-level synthetic parallel corpus by forward-translating English Europarl Koehn (2005) into seven diverse low-resource languages.
-
•
We assess the corpus quality using both automatic metrics and human evaluation, finding the data to be generally of high quality.
-
•
We comprehensively evaluate the utility of this synthetic data by demonstrating that:
-
1)
Compact MT models trained from scratch solely on this data achieve strong baseline performance (e.g., 49.49 ChrF for English-Georgian, compared to NLLB’s 48.31).
-
2)
Fine-tuning pretrained state-of-the-art (SOTA) systems (OPUS-MT, NLLB-200-1.3B, LLaMA-3B) consistently yields substantial improvements (e.g., average gains of +2.95 ChrF for NLLB and +20.63 ChrF for Llama-3B).
-
3)
Our synthetic data is complementary to existing corpora like HPLT, e.g., leading to further ChrF increases of up to +2.79 when combined (for English-Icelandic).
-
4)
Fine-tuned models that are 10-20 times smaller than state-of-the-art multilingual models perform similar or better than their large counterparts.
-
1)
-
•
We extend our dataset into a multi-parallel corpus via pivoting and, as a case study, we demonstrate that Finnish-Somali translation improves by +14.78 ChrF and +21.64 ChrF when fine-tuning OPUS-MT.
-
•
To promote reproducibility and future research, we introduce SynOPUS, a public repository of synthetic parallel corpora. We also publicly release our datasets222 https://opus.nlpl.eu/synthetic/Europarl.php The data is subject to the terms and conditions defined by the usage policies of OpenAI. and code.333https://github.com/Helsinki-NLP/low-res-lmt
Our results show that LLM-generated synthetic data, even when noisy, can train competitive MT models from scratch and consistently improves pretrained systems, especially for the least resourced languages in the resource spectrum. Our work demonstrates a clear path towards open high-quality MT for underrepresented languages, by harnessing widely available high-resource monolingual corpora and powerful LLMs.
2 Related Work
Low-resource MT.
Low-resource MT targets language pairs with little to no parallel data available Haddow et al. (2022). To mitigate the data scarcity problems, two main lines of research have emerged: (a) transfer learning Zoph et al. (2016) and multilingual training Johnson et al. (2017) and (b) data augmentation Xia et al. (2019). First, transfer learning involves using a model trained on a high-resource language as a starting point for training the low-resource language, while multilingual training proposes to train jointly on multiple language pairs to compensate for the lack of text in a specific language. Second, data augmentation proposes to generate synthetic samples to train on, by perturbing, translating or otherwise modifying existing sentences Fadaee et al. (2017). Below, we focus on data augmentation and recent work that uses LLMs to generate such data.
Classical data augmentation.
The most popular approach for low-resource languages is back-translation, which involves translating the monolingual target-language data into the source language Ko et al. (2021); Khenglawt et al. (2024). The reverse process, forward translation of source-side monolingual sentences, has also been explored, and while less common in MT, proved valuable for LLM pretraining. For example, Wang et al. (2025) used NLLB to forward-translate monolingual corpora in nine languages and demonstrated its value for LLM pretraining. This process also relates closely to sequence-level knowledge distillation Kim and Rush (2016), where compressing a large model involves training a small student model on synthetic data constructed by forward-translating using the teacher model Gordon and Duh (2019).
LLM-based data augmentation.
LLMs have opened new avenues for synthetic data generation, driven by their strong performance in low-resource language settings. Several studies assess the translation performance of LLMs: Claude on Yoruba-English Enis and Hopkins (2024), Claude on 13 low-resource languages of Mali Dembele et al. (2025), and GPT-4 on 3 languages Jiao et al. (2023). These efforts encouraged researchers to use LLMs for synthetic data generation. For instance, Oh et al. (2023) explore different prompting strategies to generate synthetic data for German-Korean translation with ChatGPT. Our work is most similar to Yang and Nicolai (2023), where they exploit data generation for MT between German and Galician with ChatGPT. However, the authors generate source synthetic sentences that are later translated, while we use original English sentences as source data, and experiment on more languages.
Gap addressed in this work.
Despite the above advances, there is still no systematic study that produces a fully synthetic multi-parallel corpus with state-of-the-art LLMs for low-resource languages and evaluates that corpus both intrinsically and on downstream MT. We close this gap by extending Europarl, a multilingual resource with alignments across all the official EU languages, into seven low-resource languages and evaluating its quality and usefulness.
3 Dataset Construction
eu | gd | ka | is | mk | so | uk | |
---|---|---|---|---|---|---|---|
n. after segmentation | 2 167 164 | 2 192 082 | 2 504 071 | 2 370 036 | 2 054 167 | 2 373 145 | 2 359 720 |
n. after lang. id. | 2 160 061 | 2 182 553 | 2 481 357 | 2 362 411 | 2 044 219 | 2 364 985 | 2 351 562 |
n. aligned sentences | 2 138 713 | 2 164 999 | 2 317 070 | 2 348 030 | 2 027 406 | 2 353 915 | 2 341 706 |
Our goal is to study real-world cases instead of selecting common language pairs in an artificially constructed low-resource scenario. Therefore, we conduct a preliminary experiment to help us select the languages to prioritize (Section 3.1). We then forward-translate the English Europarl corpus (Section 3.2), and, in a final post-processing step, we filter out noise to ensure high-quality translations (Section 3.3). Finally, we expand our dataset via pivoting to all existing languages of Europarl (Section 3.4).
3.1 Language Selection
We start by selecting a small set of low-resource languages for which GPT-4o can produce usable translations. To do so, we begin with a list of 204 European minority languages444The list is derived from https://en.wikipedia.org/wiki/Regional_and_minority_languages_in_Europe. and retain only the 39 languages that are supported by the FLORES-200 benchmark Goyal et al. (2022). For each of the 39 languages, we prompt GPT-4o to produce translations of (i) 100 random samples from FLORES-200 and (ii) 20 five-sentence chunks to simulate paragraph-level translation. We explicitly specify the script of the target language in the prompt555Our initial experiments suggested that GPT-4o occasionally produces translations using a script different from that used in the FLORES dataset. This issue was the most prominent in Serbian, which uses both Cyrillic and Latin scripts..
To contextualize GPT-4o’s performance against existing well-performing translation models, we translate the same datasets with EMMA (Ji et al., 2024), using both zero-shot and 3-shot settings, by selecting 3 unused examples from the Flores-200 dataset. Additionally, we compare the results to the best available OPUS-MT model (Tiedemann et al., 2024) per language, selected from the OPUS-MT Dashboard Tiedemann and De Gibert (2023). We compare the performance of the three systems using ChrF Popović (2015). Table 6 presents the results of the pilot evaluation. We proceed to select seven languages that are of special interest to us: Basque (eu), Scottish Gaelic (gd), Icelandic ((is), Georgian (ka), Macedonian (mk), Somali (so), and Ukrainian (uk).
3.2 Synthetic Data Generation
We use the English Europarl666To the best of our knowledge, the Europarl corpus is not subject to any copyright restrictions. Koehn (2005) as the source for generating the synthetic dataset. Europarl, which is derived from the proceedings of the European Parliament, offers well-defined document boundaries and is multi-parallel across 21 European languages. We leverage the metadata within the Europarl corpus to segment the data into paragraphs in a way that each generated translation can be matched back to its exact English source, preserving the multi-parallel structure. Paragraphs are sent in bulk to the OpenAI’s Batch API777https://platform.openai.com/docs/guides/batch. We instruct the model to generate translations for source-target language pairs using the following prompt:
This is an English to TARGET translation, please provide the TARGET translation to this sentence in SCRIPT script. Do not provide any explanation or text apart from the translation.
For instance, in the English-Ukrainian direction, we set the target language (TARGET) to Ukrainian, and the script information (SCRIPT) to Cyrl.
3.3 Data Post-Processing
After translating the data with GPT-4o, we align the translated sequences with the original English sentences to produce parallel datasets. To produce aligned sentence pairs, we must first segment the paragraphs into individual sentences. For every language except Georgian, we use a sentence splitter from the Moses package (Koehn et al., 2007), selecting the language-specific system whenever it exists. Otherwise, we rely on the settings for the closest available language. For Somali we use the fallback to English, which seems to perform reasonably well. For Georgian we apply WtP (Minixhofer et al., 2023) with the sat-3l-sm model for sentence segmentation.
Because of the inherent noise in the translation process, and because of their tendencies to produce hallucinations, the LLMs may make errors in translation. To filter such cases, we apply the HeLI-OTS language identifier (Jauhiainen et al., 2022) to every generated segment after sentence segmentation, and discard those that are not classified correctly. On average, this step removes only about 0.45 % of sentences in each language.
Lastly, we align the cleaned target sentences with their English counterparts using the Yasa alignment tool (Lamraoui and Langlais, 2013), while preserving document boundaries so each sentence can be matched to its document and sentence identifiers. The resulting corpus contains 2–2.3 million aligned sentences. The data statistics are presented in Table 1.



3.4 MultiEuroparl: a Multi-Parallel Document-Level Corpus
An important design decision in our experiments was the focus on an inherently multi-parallel dataset. Since all languages are aligned through English, we can use it as a pivot to project synthetic translations onto existing alignments.
In order to achieve that, we preserve sentence, paragraph, and document IDs from the original dataset during translation. Then, translated paragraphs are sentence-aligned to their input paragraphs and the alignment of English sentences to other existing languages in the original Europarl corpus are retrieved from OPUS. A minor complication is that sentence alignment is not one-to-one in all cases. We expand alignments by including neighboring sentence pairs until we get a match. In the worst case, this would cover the entire paragraph but, luckily, the data is quite well-behaved and aligns rather nicely also across language pairs.
Using the procedures above, we are able to create 147 new language pairs added to the original Europarl corpus, while keeping document information. All of them are now available as training data for non-English-centric MT, a valuable source that comes for "free" due to the multilinguality and metadata of the source data. We study the usefulness of this data in Section 5.4.
4 Dataset Quality Analysis
In this section, we delve into the quality of the generated low-resource data. We first conduct a quantitative analysis (Section 4.1) producing numerical scores for each sentence pair and then proceed to ask native speakers of the target languages to rate a subsample of the dataset (Section 4.2). Finally, we compute inter-annotator agreement scores and correlation metrics.
4.1 Quantitative Analysis
To evaluate the quality of the generated parallel dataset, we compute two neural metrics at the segment level: Bicleaner-AI888We use the bitextor/bicleaner-ai-full-large-en-xx model. Zaragoza-Bernabeu et al. (2022) and CometKiwi Rei et al. (2022). These two metrics are optimized for different tasks and therefore behave differently: Bicleaner-AI is a binary classifier trained to determine whether two sentences are valid translations of each other. In contrast, COMETKiwi is a reference-free Quality Estimation (QE) metric based on COMET Rei et al. (2020), trained to predict human judgment scores (on a 0–100 scale, normalized to 0-1) for machine-translated sentences.
Figures 1(a) and 1(b) show the distribution of Bicleaner-AI scores and CometKiwi per language pair. Looking at the Bicleaner-AI scores, we observe that over 92% of the sentences in Ukrainian and Macedonian fall in the highest bin and over 12% of the sentences for Somali, Georgian and Scottish Gaelic fall in the lowest bin. Although the general trend of CometKiwi is similar to Bicleaner-AI, the results are interpreted differently as CometKiwi reveals the actual quality of the sentences in the dataset generated. We can see that more than 85% of the sentences per language are in the top quality bins, noticing that the sentences with lower quality sentences are in Scottish Gaelic, Somali, and Georgian. However, CometKiwi has not been explicitly trained on any of the languages in our dataset, even though its underlying model, XLM-R Conneau et al. (2020), includes them. The fine-tuning for QE was conducted using data from the WMT General Shared Tasks (2017–2020). As such, these results are zero-shot and should be interpreted with caution.
4.2 Human Evaluation
To further assess the quality of our synthetic dataset, we conduct a human evaluation for five languages999We were, unfortunately, unable to find annotators for Scottish Gaelic and Somali for this submission. . For each of language pair, we randomly sample 100 sentences and ask native speakers to evaluate them. Each language pair was scored by 1–3 annotators. Following the Direct Assessment (DA) protocol Graham et al. (2013) from the WMT2017 Shared Task Bojar et al. (2017), annotators were shown the source sentence and its translation, and were asked to assign a score on a 0–100 scale, using the guidelines provided (Appendix B). All ratings were collected through a custom web interface built with Gradio Abid et al. (2019).
We measure Inter-Annotator Agreement (IAA) using Krippendorff’s alpha (interval level) and compute the z-scored version by normalizing each annotator’s scores to account for differences in scoring behavior. Table 2 shows the details of the human evaluation. The IAA indicates moderate consistency among annotators.
Figure 1(c) presents the results of our human evaluation. Consistent with the findings in the previous section, Macedonian and Ukrainian stand out with high-quality outputs, where human annotators rated over 92% of the data within the 80–100 score range. Icelandic and Basque exhibited more variability, with approximately 40% of data rated as good quality (scores between 60 and 100) and around 30% considered acceptable (scores between 40 and 60). In contrast, Georgian data was considerably lower, with about 40% judged by annotators as being of unacceptable quality (scores below 40).
Correlation scores:
We report Spearman’s rank correlation () between Bicleaner-AI and CometKiwi scores, and human judgments in Table 2. Overall, the correlations with BicleanerAI are weak across most language pairs (), suggesting limited alignment. Georgian shows a relatively higher correlation (), likely due to the greater variance in human scores for this language. We observe that Bicleaner-AI tends to assign lower scores to samples that received very high ratings from human annotators, indicating a potential underestimation of high-quality translations. In contrast, correlations are consistently higher for CometKiwi. This is expected, as CometKiwi is designed to evaluate translation quality (a task more closely aligned with human judgments).
Pair | Ann. Count | z-IAA | (CometKiwi) | |
---|---|---|---|---|
en-eu | 3 | 0.49 | 0.25 | 0.43 |
en-is | 2 | 0.53 | 0.27 | 0.43 |
en-ka | 1 | - | 0.39 | 0.64 |
en-mk | 1 | - | 0.21 | 0.21 |
en-uk | 2 | 0.39 | 0.15 | 0.22 |
Overall, we can conclude that the generated dataset is of high quality, with both automatic and human metrics indicating that most sentence pairs are of good to excellent quality, particularly for Ukrainian and Macedonian. Lower performance is observed for Georgian, Somali, and Scottish Gaelic. This is coherent with the GPT pilot evaluation that we conducted on FLORES-200, as GPT performs the best in terms of translation accuracy on Ukrainian and Macedonian, and worse for the rest of the languages (see Table 6).
Model | Language Pair | Params | ||||||
---|---|---|---|---|---|---|---|---|
en-eu | en-gd | en-is | en-ka | en-mk | en-so | en-uk | ||
Synthetic | 53.00 | 51.10 | 49.91 | 49.49 | 57.72 | 45.10 | 51.71 | 60.6M |
OPUS-MT | 54.99 | 41.60 | 51.97 | 42.69 | 64.45 | 44.20 | 60.14 | 191.6M |
OPUS-MT-ft | 55.68 | 52.07 | 53.80 | 50.16 | 61.99 | 46.35 | 56.98 | |
\hdashline | +0.69* | +10.47* | +1.83* | +7.47* | 2.46* | +2.15* | 3.16* | |
NLLB | 52.05 | 49.94 | 47.98 | 48.31 | 60.13 | 45.90 | 54.44 | 1.3B |
NLLB-ft | 56.32 | 51.81 | 52.93 | 52.75 | 62.32 | 46.14 | 57.13 | |
\hdashline | +4.27* | +1.87* | +4.95* | +4.44* | +2.19* | +0.24 | +2.69* | |
LLaMA | 29.25 | 26.56 | 22.66 | 13.17 | 26.58 | 22.76 | 30.24 | 3B |
LLaMA-ft | 49.85 | 47.01 | 46.12 | 25.06 | 55.60 | 42.31 | 49.68 | |
\hdashline | +20.59* | +20.45* | +23.46* | +11.89* | +29.02* | +19.55* | +19.44* |
5 Leveraging our Synthetic Data for MT
We evaluate the quality of our synthetic data by analyzing model performance both before and after fine-tuning across multiple architectures (Sections 5.1 and 5.2), serving as a proxy for data quality. Furthermore, we compare our dataset to a web-crawled SOTA corpus (Section 5.3) and study the usefulness of MultiEuroparl (Section 5.4).
5.1 Experimental Setup
Data
We focus on the translation direction from English into the low-resource language, as this is typically the more challenging scenario. For all experiments we use the synthetic data as training set, and the FLORES-200 Goyal et al. (2022) development and test sets for model selection and evaluation, respectively.
Models
Since the languages that we are dealing with are not linguistically similar, we train individual models for each target language. We experiment using the following models (more details are provided in Appendix C):
All models are run on four 32 GB NVIDIA Volta V100 GPUs and take less than 9 hours to train.
Evaluation
We evaluate all models before and after fine-tuning. We report ChrF (Popović, 2015) as our main automatic metric, as it has been the standard metric for low-resource MT and it is shown to correlate more closely with human judgments than BLEU (Papineni et al., 2002). We report COMET (Rei et al., 2020) in Appendix E.
5.2 Overall Results and Analysis
Table 3 summarizes the ChrF scores across three experimental conditions: off-the-shelf inference, fine-tuned training, and their performance differentials. We assessed the statistical significance of all the differences using paired Student’s t-tests and paired bootstrap resampling (5000 iterations at 95% confidence).
Effectiveness for Training from Scratch
The 60M parameter baseline, trained exclusively on our dataset, surpasses the out-of-the-box performance of billion-parameter models like NLLB and LLaMA for Scottish Gaelic and Georgian, while nearly matching them for Somali and Basque. This shows that our corpus is rich enough to train functional MT systems without any external pretraining or multilingual transfer.
Impact on Fine-Tuning Pretrained Models
Fine-tuning consistently improves NLLB and LLaMA, confirming that the synthetic data is well-suited for adaptation. OPUS-MT also benefits from fine-tuning in five of seven cases; however, performance drops for Macedonian and Ukrainian, the two highest-resource low-resource pairs in our set. This suggests that when the model is trained on enough real parallel data, it ends up fitting too closely to the synthetic examples.
Quality versus Usefulness
Based on the results from the previous section, we observe high quality for Ukranian and Macedonian, medium quality for Basque and Icelandic, and noticeably lower quality for the rest.Yet, noisy does not mean useless. In fact, Table 3 shows that the languages with the noisiest synthetic corpora also result in the largest downstream gains. When the alternative is no data at all, quantity is better than quality. However, for mid-resource languages such as Macedonian and Ukrainian, where cleaner text is already available, additional synthetic data offers diminishing returns and mainly hurts the performance of these systems. The lower the resource level, the more tolerant MT training is to noise.
Challenges with General Purpose LLMs
LLaMA initially struggles (21-26 ChrF), reflecting its lack of inherent translation capability for low-resource languages. While adapter training yields substantial improvements (+19-29 ChrF), the model still underperforms smaller, translation-specific models. This indicates that while synthetic data allows for adaptation, it cannot fully compensate for mismatches between the model architecture or pretraining objective and the translation task itself In LLaMA’s case, the 3B parameter scale appears unnecessarily large for this specific MT task, and leads to unnecessarily large fine-tuning times.
A practical recipe for exploiting fully–synthetic low-resource data
These experiments point to a clear best practice:
-
1.
Generate in bulk for truly low resource languages. Prioritize volume over perfection, as even noisy data drives significant gains when no alternatives exist.
- 2.
-
3.
Avoid general-purpose LLMs for low-resource MT. Despite LLaMA’s large gains, its inefficient computational costs and inferior translation performance confirm that translation-specific encoder-decoder models leverage synthetic data more effectively.
5.3 Comparison with HPLT v2
To further assess our dataset’s utility, we conduct comparative experiments against HPLT v2 Burchell et al. (2025), a "real" parallel corpus derived from web sources (Internet Archive101010https://archive.org/ and Common Crawl 111111https://commoncrawl.org/).
We train systems for three out of the four overlapping language pairs (English paired with Basque, Icelandic and Macedonian). This decision was motivated by the significant variation in dataset sizes (Table 8), with Ukrainian having approximately ten times more data than the others, therefore, we leave it out. We train three models: (1) the same synthetic baseline as described earlier (synthetic), (2) a model trained on HPLT dataset (HPLT), and (3) a model trained on the concatenation of our synthetic dataset and the HPLT. All models are evaluated using ChrF on the same test set described previously. Table 4 reports the detailed ChrF scores.
Training Data | Language Pair | ||
---|---|---|---|
en-eu | en-is | en-mk | |
Synthetic | 53.00 | 49.91 | 57.72 |
HPLT | 54.63 | 50.60 | 62.09 |
\hdashline | +1.63* | +0.69* | +4.37* |
HPLT | 54.63 | 50.60 | 62.09 |
HPLT + Synthetic | 56.20 | 53.39 | 62.92 |
\hdashline | +1.57* | +2.79* | +0.83* |
Comparable Performance
The models trained on our synthetic data alone perform on the same ballpark as the ones trained on the HPLT dataset, with an average difference of 2.23 ChrF points. The largest performance difference is observed for Macedonian, following a similar pattern as our previous experiments. These results demonstrate that our synthetic dataset is of sufficiently high quality to rival real-world parallel corpora, even when trained from scratch.
Complementary when Combined
Adding our corpus to HPLT yields the best overall performance across all language pairs, with significant improvements. This proves the effectiveness of our synthetic data in low-resource MT. The consistent increases suggest that our data introduces useful diversity and complements the HPLT dataset, as it represents previously unseen material.
Combined beats Transfer Learning
If we compare these results with Table 3, we can observe how training on the combined HPLT and synthetic datasets not only matches the performance of the fine-tuned NLLB for Basque, but surpasses it for both Icelandic and Macedonian, even though the synthetic model is 21.6 times smaller. This highlights the power of data augmentation: enriching real-world corpora with high-quality synthetic data can outperform SOTA transfer learning approaches in low-resource settings.
5.4 Beyond English-Centric MT: The Finnish Use Case
To explore the multilingual potential of our expanded dataset via pivoting (Section 3.4) and move beyond English-centric translation, we train MT models for two additional language pairs: Finnish–Somali and Finnish-Ukrainian. These languages were selected due to their prominence in Finland’s linguistic landscape, where Ukrainian and Somali are among the most widely spoken foreign languages, accounting for approximately 0.7% and 0.5% of the population, respectively Official Statistics of Finland (2024) (OSF). Developing high-quality models for these pairs is therefore both practical and relevant.
Our setup for this experiment is similar to the ones above. We first evaluate the out-of-the-box capabilities of OPUS-MT and NLLB. Next, we train a synthetic baseline (transformer-base), and finally, we fine-tune OPUS-MT and NLLB. We exclude LLaMA fine-tuning from this stage, as previous results have shown that it consistently underperforms. Table 5 reports the results.
Usefulness of Pivoted Data
It is important to note that our synthetic baselines are weaker here than in previous experiments, providing greater headroom for fine-tuning improvements. OPUS-MT obtains clear gains from fine-tuning on synthetic data for the low-resource Finnish–Somali pair (+14.78 and +21.64 ChrF). However, for Finnish-Ukranian, fine-tuning does not improve. NLLB, which already exhibits a strong baseline, sees consistent gains across all directions. Overall, these results highlight the utility of synthetic data, particularly for low-resource language pairs.
Model | Language Pair | |||
---|---|---|---|---|
fi-so | so-fi | fi-uk | uk-fi | |
Synthetic | 38.72 | 35.87 | 42.01 | 46.01 |
OPUS-MT | 26.31 | 15.86 | 50.56 | 55.24 |
OPUS-MT-ft | 41.09 | 37.50 | 49.36 | 53.46 |
\hdashline | +14.78* | +21.64* | -1.21* | -1.78* |
NLLB | 40.33 | 39.66 | 45.59 | 47.78 |
NLLB-ft | 42.26 | 42.31 | 47.61 | 51.62 |
\hdashline | +1.93* | +2.65* | +2.02* | +3.84* |
6 SynOPUS: a New Synthetic Parallel Corpus Repository
The increasing adoption of LLMs in generating synthetic data underscores the growing need to systematically organize synthetic datasets. Although it is well known that many parallel datasets already contain MT content Thompson et al. (2024), when synthetic data is intentionally produced, especially when involving significant financial or computational resources, proper archiving are paramount for promoting reuse, ensuring transparency, and maximizing resource utility. Therefore, with the release of our dataset, we introduce SynOPUS121212https://github.com/Helsinki-NLP/synOPUS, a new repository for parallel synthetic datasets, i.e., data that has been (partially) generated by translating text into other languages using MT systems or LLMs. We invite the community to contribute with their own datasets.
7 Conclusions
In this work, we thoroughly studied the quality and usefulness of LLM-generated synthetic data for low-resource MT. We presented a new synthetic corpus at document-level by forward translating Europarl, a parliamentary corpus, with GPT-4o. Then, we evaluated the resulting dataset both quantitatively and through human evaluation. Furthermore, we investigated the usefulness of this dataset for low-resource MT by: (i) identifying the most effective strategy for training, (ii) comparing our dataset with the public HPLT dataset, and (iii) extending our analysis beyond English-centric MT by generating a multi-parallel corpus via pivoting through English.
Our study highlights a crucial and often overlooked opportunity: the ability to create valuable parallel resources for low-resource MT by leveraging widely available high-resource monolingual data. This challenges the traditional reliance on scarce real target-language data for data augmentation approaches, and opens new directions for scalable MT development.
For future work, we aim to explore optimal methods for combining real and synthetic data, as well as extending our experiments to the document-level and investigating the use of synthetic data for monolingual LLM pretraining.
Limitations
Domain Bias
First and foremost, because of the origin of our source data, which is Europarl, a corpus compiled from parliamentary proceedings, the presented dataset belongs to a very specific domain. This implies that our models may suffer from domain bias and that any system trained on this data may not generalize well to informal, conversational, or domain-specific language; where linguistic style, vocabulary, and discourse structure differ significantly.
Language Coverage
While we focus on seven diverse languages (varied language families, linguistic typologies, written scripts), our approach relies on GPT-4o’s ability to produce a certain language reasonably well. Even though we rely on the results of our pilot study (shown in Table 6), our method may not translate well to other languages. Determining where the threshold lies, that is to say, how well a language must be supported for GPT-generated data to be viable; remains an open question.
Human Evaluation Scope
We aim to provide enough pointers to evaluate the quality of our dataset both numerically and qualitatively. However, our human evaluation is limited to a 100 samples per language pair due to a lack of resources. Furthermore, we use Direct Assessment (DA), a widely accepted but increasingly outdated method. More recent evaluation approaches, such as Error Span Annotation (ESA) Kocmi et al. (2024), offer more fine-grained insights into translation errors, but were beyond our reach for this study.
Ethical Considerations
Hallucinated Content
LLMs are known to generate hallucinated content, outputs that are fluent and well-formed but factually incorrect or irrelevant Vázquez et al. (2025). This phenomenon is a risk in iself, as it can introduce noise and propagate misinformation in downstream MT models. In our case, we observed that in some cases, the model disregards the input and instead generates a response similar to, “You have been trained on data up until October 2023” in the target language. This issue is most prevalent in Georgian, with around 9,000 cases, and Ukrainian, with approximately 4,000 cases. For these languages, we removed each line containing the string “2023”. While such hallucinations appear to be an intrinsic limitation of current LLMs, they highlight the need for careful post-processing and validation when using synthetic data.
Data Contamination
Due to the the closed-source nature of GPT-4o, there is a risk of data contamination, since the model may have already seen our test set (FLORES-200) during pretraining. Recent studies Mansurov et al. (2024) have shown that distilled data may inherit intrinsic biases from the teacher model and this may have an impact on benchmark results. While this risk is inherent to most widely used test sets and cannot be fully controlled, we acknowledge it here for transparency.
Reproducibility
Since our synthetic data is generated using a closed-source LLM, the exact reproduction of our work is not possible. To mitigate this, we publicly release the generated dataset along with all preprocessing scripts and training code.
ChatGPT was used to assist code development for this project.
References
- Abid et al. (2019) Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569.
- Bojar et al. (2017) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, et al. 2017. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the Second Conference on Machine Translation, pages 169–214.
- Burchell et al. (2025) Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu. 2025. An expanded massive multilingual dataset for high-performance language technologies. Preprint, arXiv:2503.10267.
- Cheng (2019) Yong Cheng. 2019. Joint training for pivot-based neural machine translation. In Joint training for neural machine translation, pages 41–54. Springer.
- Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Costa-jussá et al. (2018) Marta R Costa-jussá, Noé Casas, and Maite Melero. 2018. English-catalan neural machine translation in the biomedical domain through the cascade approach. arXiv preprint arXiv:1803.07139.
- Daniel Han and team (2023) Michael Han Daniel Han and Unsloth team. 2023. Unsloth.
- De Gibert et al. (2025) Ona De Gibert, Robert Pugh, Ali Marashian, Raul Vazquez, Abteen Ebrahimi, Pavel Denisov, Enora Rice, Edward Gow-Smith, Juan Prieto, Melissa Robles, Rubén Manrique, Oscar Moreno, Angel Lino, Rolando Coto-Solano, Aldo Alvarez, Marvin Agüero-Torales, John E. Ortega, Luis Chiruzzo, Arturo Oncevay, Shruti Rijhwani, Katharina Von Der Wense, and Manuel Mager. 2025. Findings of the AmericasNLP 2025 shared tasks on machine translation, creation of educational material, and translation metrics for indigenous languages of the Americas. In Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 134–152, Albuquerque, New Mexico. Association for Computational Linguistics.
- Dembele et al. (2025) Alou Dembele, Nouhoum Souleymane Coulibaly, and Michael Leventhal. 2025. The serendipity of claude ai: Case of the 13 low-resource national languages of mali. arXiv preprint arXiv:2503.03380.
- Ding et al. (2024) Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, and Shafiq Joty. 2024. Data augmentation using large language models: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990.
- Enis and Hopkins (2024) Maxim Enis and Mark Hopkins. 2024. From llm to nmt: Advancing low-resource machine translation with claude. arXiv preprint arXiv:2404.13813.
- Fadaee et al. (2017) Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computational Linguistics.
- Gordon and Duh (2019) Mitchell A Gordon and Kevin Duh. 2019. Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation. arXiv preprint arXiv:1912.03334.
- Goyal et al. (2022) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
- Graham et al. (2013) Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Haddow et al. (2022) Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, and Alexandra Birch. 2022. Survey of low-resource machine translation. Computational Linguistics, 48(3):673–732.
- Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3.
- Jauhiainen et al. (2022) Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén. 2022. HeLI-OTS, off-the-shelf language identifier for text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France. European Language Resources Association.
- Ji et al. (2024) Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O’Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, et al. 2024. Emma-500: Enhancing massively multilingual adaptation of large language models. arXiv preprint arXiv:2409.17892.
- Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Junczys-Dowmunt et al. (2018) Marcin Junczys-Dowmunt, Kenneth Heafield, Hieu Hoang, Roman Grundkiewicz, and Anthony Aue. 2018. Marian: Cost-effective high-quality neural machine translation in c++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 129–135.
- Khenglawt et al. (2024) Vanlalmuansangi Khenglawt, Sahinur Rahman Laskar, Partha Pakray, and Ajoy Kumar Khan. 2024. Addressing data scarcity issue for english–mizo neural machine translation using data augmentation and language model. Journal of Intelligent & Fuzzy Systems, 46(3):6313–6323.
- Kim and Rush (2016) Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327.
- Ko et al. (2021) Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzmán, Pascale Fung, Philipp Koehn, and Mona Diab. 2021. Adapting high-resource NMT models to translate low-resource related languages without parallel data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 802–812, Online. Association for Computational Linguistics.
- Kocmi et al. (2024) Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, and Mariya Shmatova. 2024. Error span annotation: A balanced approach for human evaluation of machine translation. In Proceedings of the Ninth Conference on Machine Translation, pages 1440–1453.
- Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pages 79–86.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
- Lamraoui and Langlais (2013) Fethi Lamraoui and Philippe Langlais. 2013. Yet another fast, robust and open source sentence aligner. time toReconsider sentence alignment? In Proceedings of Machine Translation Summit XIV: Papers, Nice, France.
- Mansurov et al. (2024) Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. 2024. Data laundering: Artificially boosting benchmark results through knowledge distillation. arXiv preprint arXiv:2412.15255.
- Meta AI (2022) Meta AI. 2022. NLLB-200-distilled-1.3B: Hugging face model card. https://huggingface.co/facebook/nllb-200-distilled-1.3B.
- Minixhofer et al. (2023) Benjamin Minixhofer, Jonas Pfeiffer, and Ivan Vulić. 2023. Where‘s the point? self-supervised multilingual punctuation-agnostic sentence segmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7215–7235, Toronto, Canada. Association for Computational Linguistics.
- Nadas et al. (2025) Mihai Nadas, Laura Diosan, and Andreea Tomescu. 2025. Synthetic data generation using large language models: Advances in text and code. arXiv preprint arXiv:2503.14023.
- Official Statistics of Finland (2024) (OSF) Official Statistics of Finland (OSF). 2024. Number of foreign-language speakers exceeded 600,000 during 2024. https://stat.fi/en/publication/cm1jg8tr20lco07vwvoif9s6i. Accessed: 2025-04-25.
- Oh et al. (2023) Seokjin Oh, Su Ah Lee, and Woohwan Jung. 2023. Data augmentation for neural machine translation using generative language model. arXiv preprint arXiv:2307.16833.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
- Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. arXiv preprint arXiv:2009.09025.
- Rei et al. (2022) Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and André F. T. Martins. 2022. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Scalvini et al. (2025) Barbara Scalvini, Iben Nyholm Debess, Annika Simonsen, and Hafsteinn Einarsson. 2025. Rethinking low-resource MT: the surprising effectiveness of fine-tuned multilingual models in the LLM age. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 609–621, Tallinn, Estonia. University of Tartu Library.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
- Tapo et al. (2025) Allahsera Auguste Tapo, Kevin Assogba, Christopher M Homan, M. Mustafa Rafique, and Marcos Zampieri. 2025. Bayelemabaga: Creating resources for Bambara NLP. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12060–12070, Albuquerque, New Mexico. Association for Computational Linguistics.
- Thompson et al. (2024) Brian Thompson, Mehak Dhaliwal, Peter Frisch, Tobias Domhan, and Marcello Federico. 2024. A shocking amount of the web is machine translated: Insights from multi-way parallelism. In Findings of the Association for Computational Linguistics: ACL 2024, pages 1763–1775, Bangkok, Thailand. Association for Computational Linguistics.
- Tiedemann et al. (2024) Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raúl Vázquez, and Sami Virpioja. 2024. Democratizing neural machine translation with opus-mt. Language Resources and Evaluation, 58(2):713–755.
- Tiedemann and De Gibert (2023) Jörg Tiedemann and Ona De Gibert. 2023. The opus-mt dashboard–a toolkit for a systematic evaluation of open machine translation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 315–327.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Vázquez et al. (2025) Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovickỳ, et al. 2025. Semeval-2025 task 3: Mu-shroom, the multilingual shared task on hallucinations and related observable overgeneration mistakes. arXiv preprint arXiv:2504.11975.
- Wang et al. (2025) Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Adelani, Yihong Chen, Raphael Tang, and Pontus Stenetorp. 2025. Multilingual language model pretraining using machine-translated data. arXiv preprint arXiv:2502.13252.
- Wang et al. (2024) Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, et al. 2024. A survey on data synthesis and augmentation for large language models. arXiv preprint arXiv:2410.12896.
- Xia et al. (2019) Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. 2019. Generalized data augmentation for low-resource translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5786–5796.
- Yang and Nicolai (2023) Wayne Yang and Garrett Nicolai. 2023. Neural machine translation data generation and augmentation using chatgpt. arXiv preprint arXiv:2307.05779.
- Zaragoza-Bernabeu et al. (2022) Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Marta Bañón, and Sergio Ortiz Rojas. 2022. Bicleaner AI: Bicleaner goes neural. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 824–831, Marseille, France. European Language Resources Association.
- Zhou et al. (2024) Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, and Yuan Wu. 2024. A survey on data augmentation in large model era. arXiv preprint arXiv:2401.15422.
- Zoph et al. (2016) Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575.
Appendix A Pilot Evaluation Results
OPUS-MT GPT EMMA Language Code Sent. Sent. Chunk Sent./3 Sent./0 Tosk Albanian als_Latn 53,9 61,3 63,1 44,7 54,4 Aragonese arg_Latn - 43,9 47,5 38,1 39,8 Aranese arn_Latn - 41,2 45,4 16,7 27,6 Asturian ast_Latn 59,6 59,6 62,8 52,7 53,3 Bashkir bak_Cyrl 40,6 51,4 52,2 24,2 29,1 Belarusian bel_Cyrl 44,4 46,6 49,2 39,0 37,9 Bosnian bos_Latn 58,8 63,9 65,4 51,6 53,3 Catalan cat_Latn 67,4 67,9 69,8 56,7 56,5 Crimean Tatar crh_Latn 35,9 37,6 40,0 13,7 23,2 Welsh cym_Latn 64,9 73,3 73,6 59,4 59,8 Esperanto epo_Latn 59,7 63,2 65,6 60,8 60,5 Basque eus_Latn 55,0 57,3 60,4 41,5 44,0 Friulian fur_Latn 49,9 45,7 50,0 45,4 43,8 Scottish Gaelic gla_Latn 42,6 52,4 55,7 44,5 45,9 Irish gle_Latn 60,5 61,3 61,6 48,9 49,3 Galician glg_Latn 62,5 63,3 66,2 55,7 55,9 Hebrew heb_Hebr 61,3 58,1 59,2 41,1 47,0 Croatian hrv_Latn 61,5 60,9 62,2 50,9 51,2 Armenian hye_Armn 47,5 56,3 56,5 45,7 48,7 Icelandic isl_Latn 53,0 55,6 59,3 41,1 41,6 Georgian kat_Geor 42,7 51,3 42,4 45,3 45,9 Ligurian lij_Latn 43,9 36,2 40,1 34,5 35,3 Limburgish lim_Latn 36,5 43,7 44,6 28,3 29,8 Lombard lmo_Latn 34,1 35,3 38,8 25,3 28,4 Luxembourgish ltz_Latn 55,5 59,4 60,9 51,5 49,9 Macedonian mkd_Cyrl 64,5 65,3 64,2 55,2 56,8 Occitan oci_Latn 64,0 67,5 67,0 51,7 46,0 Sicilian scn_Latn 39,9 45,0 48,7 43,4 43,8 Somali som_Latn 44,2 47,6 53,0 42,7 38,6 Sardinian srd_Latn 50,1 43,4 46,2 40,0 48,8 Serbian srp_Cyrl 63,0 64,1 65,0 35,5 46,7 Tatar tat_Cyrl 43,0 53,6 54,9 21,1 33,9 Turkmen tuk_Latn 42,6 55,1 54,8 22,3 25,8 Turkish tur_Latn 62,8 66,1 67,1 36,8 34,5 Uyghur uig_Arab 37,1 38,1 36,8 30,2 31,1 Ukrainian ukr_Cyrl 60,1 60,1 62,1 48,7 49,4 Northern Uzbek uzn_Latn 12,7 59,3 60,7 30,5 44,4 Venetian vec_Latn 44,6 49,5 52,0 34,7 37,1 Yiddish ydd_Hebr 0,0 40,5 42,6 41,4 51,5 Average 49,2 53,9 55,6 40,8 43,6
Appendix B Human Evaluation Annotation Guidelines
Introduction
You will be asked to evaluate the quality of machine-translated (MT) sentences by comparing each one directly to its human-written original sentence (the source sentence). You will assign a score, based on how well the translation preserves meaning, fluency, and naturalness. This is what is known as Direct Assessment (DA, Graham et al., 2013). DA elicits human assessments of translation adequacy on an analogue rating scale (0–100), where human assessors are asked to rate how adequately the APE system output expresses the meaning of the human reference translation (Bojar et al., 2017).
In this annotation project, you will be shown 100 samples of source-hypothesis pairs. Your task is to evaluate each translation pair through DA.
Annotation Guidelines
-
1.
Carefully read the sentence pair. Try to understand the intended meaning of the source.
-
2.
Evaluate whether the sentences are parallel or not. Compare the MT sentence with the source. Does the MT output preserve the key meaning of the source sentence?
-
3.
Evaluate whether the target sentence contains fluency mistakes. Is the MT sentence grammatically correct? Are there any strange phrases, broken structure, or missing words?
-
4.
Decide the score based on the scoring scale below.
-
5.
Ensure that you double-check your annotations prior to moving to the next example. Re-read both source and translation. Does the score reflect meaning and fluency? Were you consistent with your previous scores? Adjust the score if needed to maintain fairness and consistency.
Scoring scale
Use the full range of the scale. Do not be afraid to give very low or very high scores when appropriate.
Score | Interpretation |
---|---|
100 | Perfect: grammatically flawless, fluent, and semantically identical to the source. |
85–99 | Excellent: small stylistic or fluency issues; all meaning preserved. |
70–84 | Good: mostly fluent; minor issues in grammar, wording, or slight meaning distortion. |
50–69 | Acceptable: understandable, but multiple issues with grammar, style, or partial meaning loss. |
30–49 | Poor: hard to understand, major meaning lost, broken grammar. |
1–29 | Very poor: barely comprehensible or mostly wrong meaning. |
0 | Incomprehensible: completely unrelated, meaningless, or unreadable. |
Appendix C Training Regimes
-
•
Baseline: We employ a shared 32k SentencePiece Kudo and Richardson (2018) vocabulary trained on the synthetic corpus; other settings follow the original Transformer-base recipe. Mini-batch fitting is enabled to optimize memory usage. Validation every 2500 updates checks perplexity. Early-stopping is employed on the development set, with a patience of 10.
-
•
OPUS-MT-ft: We fine-tune each model without modifying its original tokenizer; appropriate language tags are prefixed at train and test time for multilingual models. Mini-batch fitting is enabled to optimize memory usage. Validation every 500 updates checks perplexity. Early-stopping is employed on the development set, with a patience of 20.
-
•
NLLB-200-distilled-1.3B-ft: We fine-tune the NLLB-distilled-1.3B model with DeepSpeed on four V100 GPUs in FP16 mixed precision. Training uses a per-GPU batch size of 32 sentences, a maximum sequence length of 128 tokens, the Adam optimiser with a 1 learning rate, and runs for up to four epochs. DeepSpeed ZeRO-1 is used for basic tensor sharding; everything else is left on-GPU. Early-stopping is employed on the development set, with a patience of 5.
-
•
LLaMA-3.2-3B-Instruct-ft: We adapted the model with LoRA using the Unsloth framework. We used the quantized 4-bit version of the model, applying LoRA adapters, and we used with prompts designed to mimic a professional translator’s task using Unsloth’s template system. Training was done using SFTTrainer with fp16 mixed precision, gradient accumulation, and 50k training steps with effective batch size of 16 utterances.
Appendix D OPUS-MT Models selected for fine-tuning
We select the best available OPUS-MT model based on the OPUS-MT Dashboard Tiedemann and De Gibert (2023), by looking at the BLEU score on the FLORES200 dataset.
- •
- •
- •
- •
- •
-
•
en-uk: eng-zle/tf-big
- •
-
•
fi-so: mul-mul/tf-big
-
•
fi-uk: fin-zle/tf-big
-
•
so-fi: afa-fiu/tf-base
-
•
uk-fi: zle-fin/tf-big
Appendix E COMET scores for MT training
Model | Language Pair | Params | ||||||
---|---|---|---|---|---|---|---|---|
en-eu | en-gd | en-is | en-ka | en-mk | en-so | en-uk | ||
Baseline | 81.51 | 78.04 | 80.16 | 80.72 | 82.24 | 78.15 | 78.89 | 60.6M |
OPUS-MT | 83.27 | 71.30 | 79.69 | 69.09 | 87.34 | 77.06 | 89.02 | 191.6M |
OPUS-MT-ft | 84.15 | 79.30 | 83.21 | 81.60 | 86.19 | 79.34 | 87.61 | |
\hdashline | +0.88 | +8.00 | +3.52 | +12.51 | -1.15 | +2.28 | -1.41 | |
NLLB | 84.55 | 78.73 | 82.06 | 80.49 | 87.45 | 80.06 | 87.21 | 1.3B |
NLLB-ft | 86.84 | 79.43 | 85.36 | 86.94 | 88.74 | 80.90 | 88.14 | |
\hdashline | +2.29 | +0.7 | +3.3 | +6.45 | +1.29 | +0.84 | +0.93 | |
LLaMA | 40.94 | 44.61 | 36.96 | 33.68 | 42.50 | 43.79 | 50.86 | 3B |
LLaMA-ft | 70.00 | 75.92 | 76.63 | 54.53 | 82.09 | 76.67 | 80.80 | |
\hdashline | +29.06 | +31.31 | +39.67 | +20.85 | +39.59 | +32.88 | +29.94 |
Appendix F Detailed Data Sizes and ChrF Scores for HPLT Comparison
en-eu | en-is | en-mk | en-uk | |
n. sentences | 1,491,873 | 2,694,541 | 3,991,617 | 25,125,019 |