• Ei tuloksia

BLEU scores for low-resource languages

5.5 Translation quality

5.5.2 BLEU scores for low-resource languages

As low-resource languages are a specific topic of interest in the current study, it is justified to review the BLEU scores for them separately from high-resource languages. In the end, the included body of literature had five papers that studied low-resource languages: P3, P5, P23, P24, and P29. The low-resource languages present in the reviewed papers were Finnish, Turkish, and Uzbek. There does not seem to be a specific defined line between low-resource and high-resource languages, but the categorisation in this study was simply whether one or more authors addressed the language as low-resource or not. The lack of definition also meant that datasets were of different sizes, which can also be seen from results.

There are some special factors to concern with low-resource languages. One major issue is that with small datasets the model can accidentally be overfit by including every unique target word in the vocabulary. In the reviewed articles, the authors usually compensated for the small size of corpora by having a smaller vocabulary for low-resource language pairs than for other languages. For example, in P5, the vocabulary for English-to-Finnish was 40K tokens while for other languages it was 200K–500K tokens. Similarly, in P29, the English-to-Finnish vocabulary was 10K tokens, while the English-to-German vocabulary was 30K tokens. However, in P29, the hidden unit proposed by authors had promising results for alleviating overfitting problems that are common with small data, meaning that there are

other ways to avoid overfitting.

Table 25 shows the results for English-to-Finnish. All models were tested on the WMT15, making them fairly comparable. Overall, the scores for this language pair are significantly lower than for high-resource languages. The best BLEU score was 10.20 in P23, however, it is not significantly higher than the other scores since the overall average for English-to-Finnish translation was 9.69 BLEU. The model in P23 is a usual RNN encoder-decoder with an LSTM hidden unit, with the specialty that an attention score is assigned to each individual dimension of the context vector instead of the entire vector at a time. P29, the second best model by a small margin has an RNN with a custom variant of LSTM, PRU, as its hidden unit. In P3, which scored 9.23 BLEU, there is multilingual model, with a shared RNN encoder and target-language-specific decoders and attention mechanisms.

Table 25. BLEU scores of English–Finnish translation

Language-pair Article (id) BLEU (top result) Test data Published

English – Finnish P3 9.23 newstest2015 2017

P23 10.20 newsdev2015 and

newstest2015

2018

P29 9.64 newstest2015 2019

Finnish-to-English translation retrieved slightly different results than the opposite direction, as can be seen from Table 26. P5, which did not include the opposite direction, scored best for Finnish-to-English translation with 13.6 BLEU. It featured a traditional attentional RNN encoder-decoder with GRU hidden unit, but also included a monolingual corpus for Finnish-to-English translation. Experiments in P3 achieved a score 12.61 BLEU for this direction, which is significantly better than its score for the opposite direction, 9.23. P29 also scored better with this direction (12.26 as opposed to 9.64 for opposite direction), but had the lowest overall score for this direction, if only by a small margin to the second best.

Overall, the average for Finnish-to-English translation was 12.82 BLEU.

Two papers, P3 and P24, featured Turkish-to-English translation, as can be seen from Table 27. P3 scored better with a BLEU of 20.9 with its multilingual model, but P24 scored an

Table 26. BLEU scores of Finnish–English translation

Language-pair Article (id) BLEU (top result) Test data Published

Finnish – English P5 13.6 newstest2015 2015

P3 12.61 newstest2015 2017

P29 12.26 newstest2015 2019

equally impressive 18.7 BLEU. The model in P24 utilised transfer learning, i.e., training a model with a high-resource parent model first and then transfering learned parameters in training the low-resource language pair model.

Table 27. BLEU scores of Turkish–English translation

Language-pair Article (id) BLEU (top result) Test data Published

Turkish – English P24 18.7 - 2016

P3 20.9 LDC2014E115 2017

The average for Turkish-to-English translation was 19.8, which is significantly higher than Finnish-to-English with its average of 12.82. Although different language pairs are not di-rectly comparable, this raises some questions. It is hard to pinpoint why this difference is so large, especially when all models present were based on an attentional RNN encoder-decoder architecture. From a linguistic point of view, the difference can be explained simply by the differences between the languages. On another note, in P24, the largest factor is the specialty of the model: the score for the presented model is significantly improved from the baseline system (no parent model) that scored only 11.4 for Turkish-English and 10.7 for Uzbek-English. The reason may also be different training and test datasets, as this was established as a cause for significant score differences with the English-to-Japanese pair.

Same two papers that included Turkish-to-English translation, P3 and P24, featured Uzbek-to-English translation as well. The BLEU scores for this language direction are in Table 28.

For this direction, P24 scored better with a BLEU of 16.8. The model in P3 scored over 4 points lower with 12.33 BLEU. The contrast between the score for Turkish-to-English and Uzbek-to-English in P3 may in part be explained by the fact that the model for the former

was trained with 10 times larger corpora than the latter (784.65K for En-Tr as opposed to 73.66K for En-Uz). The average for Uzbek-to-English pair was 14.57 BLEU.

Table 28. BLEU scores of Uzbek–English translation

Language-pair Article (id) BLEU (top result) Test data Published

Uzbek – English P24 16.8 - 2016

P3 12.33 LDC2014E115 2017