And outperforms Google Translate for the interpretation of literary works
In response to previous studies, GPT models perform in addition to standard machine translation systems, e.g., Google Translate.
These studies mostly focused on sentence-level translation: The default approach utilized in machine translation that translates sentences one after the other with none context.
Translating paragraphs or entire documents represent very difficult challenges for normal machine translation systems. These systems often must split the input or undergo heavy engineering to just accept and leverage an extended input.
Yet, intuitively, and following the workflow of human translators, we are able to expect machine translation systems to perform higher with context, e.g., translating entire documents or paragraphs.
That is where large language models resembling GPT models can shine. They will take as input prompts significantly longer than typical machine translation systems.
Nevertheless it stays to judge the next:
- Whether exploiting more context is beneficial for improving GPT’s machine translation quality.
- The performance of GPT models when translating long text compared to plain machine translation systems.
The evaluation of huge language models for translating paragraphs poses several challenges.
- The automated metrics used for machine translation evaluation are usually not designed for paragraph-level evaluation.
- The evaluation data must not have been seen throughout the training of the evaluated systems.
- The evaluation needs to be conducted on a various set of language pairs to have an accurate overview of the big language model’s translation quality.
- Prompts have to be designed to use a whole paragraph, i.e., not only sentences as done in previous work.
These challenges are all tackled by Karpinska and Iyyer (2023): “Large language models effectively leverage document-level context for literary translation, but critical errors persist”.
On this blog article, I review and comment on their work. We’ll see how their evaluation of GPT-3.5 shows that “LLMs produce higher translations when supplied with paragraph-level context” and may achieve higher translation quality than state-of-the-art neural machine translation systems for very diverse language pairs.
The automated evaluation metrics which can be commonly utilized in machine translation are unsuitable. Their correlation with human judgments is unknown when evaluating paragraph-level translations.
We are able to’t depend on automatic metrics here.
Human evaluation stays the first alternative to have an evaluation of high credibility, so the authors of this study mainly relied on an MQM framework (Lommel et al., 2014):
- Mark translation error spans and categorize them
- Make preference judgments of which of two translations is of upper quality
- Provide free-form justifications for his or her preference judgments.
For this evaluation, they collected a complete of 720 pairs of translated paragraphs for 18 language pairs.
That’s numerous data! I can’t wait to have a take a look at the dataset. It is going to be released on GitHub, here.
For evaluation, this work selected to concentrate on translating literary works. It might look like an odd alternative since most previous work in machine translation concentrate on other genres/domains (news, user-generated texts, …).
Machine translation of literary texts is understudied and intensely difficult, especially with machine translation systems working at sentence-level.
In the sort of text, contextual nuances are essential but can’t be captured if the system translates sentences independently. Often, human translators must restructure entire paragraphs to accurately translate into the goal language.
Translation of literary texts is intuitively a task where a system taking a document or a paragraph as input would perform higher than a system only accepting shorter input.
But a significant constraint that now we have after we evaluate large language models is that the information used for evaluation have to be recent. This is significant for the credibility of the evaluation. Through the use of recently published data for evaluation, we avoid translating texts that would have been used for training the evaluated model, i.e., we avoid data contamination.
On this work, a lot of the translations used for evaluation were published after 2021. These particular translations were very probably absent from the training data of GPT-3.5 which have been trained on data published before 2022 in line with OpenAI.
Nonetheless, the unique texts which can be translated are much older (published from 1884 to 2020). They were very likely seen by the systems evaluated on this work (GPT-3.5 and Google Translate).
Also, while it’s unlikely that the evaluated systems have seen these specific translations, they might have seen other translations in other languages, or the identical language but published earlier.
The info contamination is proscribed but happens. I don’t think there’s a greater solution to completely prevent it for literary texts. But for other genres, resembling news, this is feasible.
That is considered one of the strongest points of this work: The authors evaluated very diverse language pairs.
As source languages, they chose languages from various families: Indo-European (Romance, Germanic, Slavic), Sino-Tibetan, and Japonic. This manner, they make sure that the evaluation will have the opportunity to discover more precisely the strengths and weaknesses of GPT-3.5 in translating languages with different morphological features and writing systems.
The languages to translate used for evaluation are English (en), Polish (pl), Russian (ru), Czech (cs), French (fr), German (de), Japanese (ja), and Chinese (zh).
For the goal languages, they chose languages to create pairs of source-target languages which can be “easy” (similar languages) and “difficult” (dissimilar languages).
As an illustration, Czech-Polish is a straightforward language pair since these languages have so much in common. Then again, Japanese-Polish is an especially difficult language pair since these two languages are from very distant language families with different grammar and writing systems. There are also a really limited variety of machine translation studies for this language pair.
The chosen goal languages for every source language are English (en), Japanese (ja), and Polish (pl).
Probably the most critical steps when evaluating large language models is designing prompts.
There are various possible prompts for machine translation. Ideally, we must always extensively evaluate several of them to evaluate how impactful is the alternative of the prompt.
We also must take note that the conclusions made by a scientific work may only be valid for the very particular prompts that we evaluate.
Including many prompts in an evaluation is expensive since now we have to run the inference with a big language model for every prompt. In practice, it signifies that we are able to only select a limited variety of prompts to conduct the evaluation.
They used 5-shot in-context learning to translate with GPT-3.5. There are 5 examples of translations within the prompt to point more precisely what is predicted from GPT-3.5.
The chosen translation examples have a critical impact on the interpretation quality of a language model. As demonstrated by Vilar et al. (2022), the interpretation quality of examples is what’s an important.
Concerning the example selection, they wrote:
We manually curate the five demonstrations from literary texts for every of the 18 language pairs, leading to 90 total demonstration examples. These demonstrations are sourced from novels that are usually not a part of our translation dataset, leading to potential differences in topic and magnificence […]
This isn’t very detailed. Especially, here I don’t know what “curate” involves. The curation criteria are usually not provided.
Once chosen, they included the examples in three prompts that exploit contexts of various sizes.
Sentence-level Prompt Template
With this template, the sentences of the paragraphs to translate are given to GPT one after the other. That is how standard sequence-to-sequence neural machine translation systems work.
Original text in [SRC LANG]:
source sentence
Translation into [TRG LANG]:
goal sentence
Note: [SRC LANG] and [TRG LANG] denote the source and goal languages, respectively.
Sentence-level Translation with Context Prompt Template
The interpretation continues to be performed at sentence-level however the sentences are given with their context to GPT-3.5: What precedes and what follows the sentence within the paragraph are each within the prompt.
Original text in [SRC LANG]:
source prefix
src sent source suffix
Translation into [TRG LANG]:
goal prefix
trg sent
I discovered this design quite imaginative but in addition dangerous. In my experience, the GPT models can easily be confused if we don’t explicitly define the tags. In this case, I wouldn’t be surprised if GPT just translates every thing including the tags (
Paragraph-level Prompt Template
The template is identical as the primary one, but here they supply entire paragraphs as an alternative of sentences.
Original text in [SRC LANG]:
source paragraph
Translation into [TRG LANG]:
goal paragraph
Now that now we have our prompts, we are able to use them to judge the interpretation quality of GPT-3.5.
This evaluation mainly goals at answering two questions:
- Are large language models resembling GPT-3.5 higher at translation when translating entire paragraphs as an alternative of sentences?
- How does GPT-3.5 perform in comparison with Google Translate when translating entire paragraphs?
For this evaluation, the authors mainly depend on human evaluation using the MQM framework.
In case you are aware of my work, you already know the way critical I will be when writing about machine translation evaluation.
For this work, the authors evaluated their machine translation systems with very high scientific credibility. In case you are looking for an example of an excellent machine translation evaluation, that is considered one of them. Note: I also recommend reading “Prompting PaLM for Translation: Assessing Strategies and Performance” (Vilar et al., 2022) which is one other good example as I detailed in my blog article “How Good Is Google PaLM at Translation?”.
They didn’t depend on automatic metrics but still provide metric scores for more evaluation. All the small print to duplicate the scores are also provided. This is incredibly rare.
They’ve even tested the statistical significance of their human evaluation.
The outcomes:
- GPT-3.5 is best at translating paragraphs than individual sentences
- GPT-3.5 is best than Google Translate
But these results vary across language pairs.
For the German-to-Japanese translation direction, translating individual sentences yields higher results. That is the one exception. In response to the authors, it is because the information used for this translation direction has very long sentences.
What’s most surprising to me is that GPT-3.5 can also be higher than Google Translate when translating individual sentences.
Automatic metrics also yield very similar results: COMET, BLEURT, BERTScore, and COMET-QE all agree that GPT-3.5 is best than Google Translate with any of the three prompt templates.
The paper presents a really prolonged evaluation of their human evaluation. I won’t discuss it more in this text but invite you to read it. It’s very insightful.
The paper has a “limitations” section (Section 7) where the authors discussed the bounds of using GPT models for translation.
The authors note that the interpretation errors made when translating paragraphs are different from the errors made when translating individual sentences.
When translating paragraphs, GPT-3.5 sometimes skips and forgets a part of the content of the paragraph, resulting in incorrect translation. I also observed similar behavior when fiddling with ChatGPT for translation.
This problem might be corrected by fine-tuning GPT-3.5 for machine translation. Note: Let’s not forget that the GPT-3.5 model evaluated here has not been fine-tuned for machine translation.
Aside from that, GPT-3.5 still makes some more common type errors resembling mistranslations and grammatical errors, but much lower than Google Translate, as shown by the evaluation.
I struggled to search out limitations for this work but there’s not less than one for my part.
The impact of the prompt templates isn’t clear. The particular template chosen for paragraph translation performs higher than the template chosen for sentence translation.
But can we conclude with this setting that GPT-3.5 performs higher when translating entire paragraphs?
If we alter the templates, will we still draw the identical conclusion?
We are able to’t easily answer this query. I expect this limitation to be shared by all future work evaluating language models for machine translation.
Also, this work focuses on translating literary texts. We are able to’t make sure that this work’s conclusion would apply to other genres. I’m wanting to read future work that may address this gap.
This work is a milestone in machine translation.
It shows with very high scientific credibility that a big language model can outperform more standard neural machine translation systems resembling Google Translate. It also demonstrates that paragraph-level translation with a big language model yields higher translation quality than sentence-level translation.
With this work and the previous study of PaLM’s translation quality, now we have an increasing number of evidence that the long run of machine translation will probably be based on large language models.