This article is written by Victoria Firsanova, a philologist and NLP researcher
My passion is to challenge myself. Although my domain in Natural Language Processing is Conversational AI, this year I decided to try myself in low-resource machine translation. While collecting materials about the methodology and languages I chose, I discovered how deep the rabbit hole goes, including all the challenges and considerations that comes with it. But don’t fret, I also discovered some solutions too. Let’s get into it!
- What Makes a Language Low-Resource?
- Challenges of Low-Resource NLP
- Challenges of Low-Resource NLP
- Word Representations
- Meta-Learning Helps
What Makes a Language Low-Resource?
In linguistic typology, it is common to distinguish well- and under-described languages. Well-described languages usually attract more researchers; there are plenty of grammars and scientific papers describing the rules and structures of such languages. For example, French, English and German are well-described languages.
In contrast, under-described languages lack documentation. There are plenty of reasons for that. For example, endangered languages are hard to describe due to the lack of native speakers. Another causal factor is when an under-described entity is a dialect of another, more “popular” language. The Bantu languages are one example of an under-described language.
(If you are interested, you can find more about the Bantu languages here.)
In natural language processing, languages are categorized by whether they are high- or low-resource. Low-resource languages lack data that can be used for machine learning or other processing, and high-resource languages are rich in available data.
Under-described languages are low-resource and hard to process, which seems obvious, but some well-described languages can also be low-resource. Researchers and developers often use social media data as food for machine learning models. Consider some well-described language (X) spoken in a multilingual country (N) only. Most X native speakers prefer to use another language (Y) in their everyday life. As a result, the available social media data on X is poor. According to some research, the Belarusian language, spoken in Belarus, Ukraine, Lithuania and Poland, can be called low-resource. Because the Russian language is widespread in Belarus, the data in Belarusian is not as accessible as NLP researchers would like. However, the preservation, research and development of tools in Belarusian are crucially important.
Regardless of the reasons, low-resource NLP matters for culture preservation. This field provides access to the latest technologies for those who speak rare (and marvellous!) languages. And, of course, it is challenging. The more data you have, the better your NLP model performs. Or not?
Challenges of Low-Resource NLP
The alpha and omega of machine learning is data processing, and data is the weak link of low-resource NLP. Depending on the available data on a target language, you might have to work with grammars, several social media posts, or a couple of books. Unfortunately, available resources might not fit your tasks or even your skills.
Challenge #1 Implementation Might Require Experts
Rule-based approaches to NLP are not as dependent on the quantity and quality of available data as neural ones. Nevertheless, they require working with linguistic descriptions, which might lead to a need for significant handcraft work of an expert in a target language.
Challenge #2 Specific Tasks Are Questionable
Consider building a specific NLP tool like a medical dialogue system. We might require a dataset with a particular structure – dialogue lines, for example – and relevant vocabulary. Some creative approaches, like synthetic data generation, might help. But the quality of such a non-trivial model is doubtful.
Challenge #3 Getting Access to the Data
Some languages are represented with grammars and corpora only. Getting access to such sources might require some social activity, for example, getting connected with their authors. By the way, getting to know some culture and language enthusiasts is always a good idea.
Challenge #4 Evaluation
Once you have built your model, you have to evaluate it, but which benchmarks should you use? If your model is one of the first for the chosen language, the question stays open. We tend to compare a new model with previous or similar ones. However, in this case, you can become a sort of a pioneer.
Challenge #5 Limitations
We have to admit that the low-resource NLP models might be limited. The lack of working solutions and available data make it hard to fine-tune models for downstream tasks. That might limit the range of possible tasks we can solve with low-resource NLP tools.
These are just challenges and challenges can be overcome. Fortunately for us, some solutions already exist.
Word embeddings are a form of text representation in some vector space that allows automatic distinguishing of words with closer and further meaning by analysing their co-occurrence in some context. There are plenty of popular solutions, some of which have become a kind of classic. In the context of low-resource NLP, there are two serious issues with those models. The first problem is that one should train such embeddings on large datasets. The second problem is that most of these solutions were evaluated on high-resource languages data, which does not guarantee their efficiency with low-resource tasks.
In this case, we can prioritise cross-lingual models.
Some languages have more in common than others. These languages might be topologically similar, for example, due to geographical factors. Or they might have common roots and belong to the same language family. Perhaps, a model that trains on a diverse language data might learn these commonalities and differences between languages. For example, LASER (Language-Agnostic Sentence Representations) architecture was trained for 93 languages. The model uses bidirectional LSTM encoder and byte pair encoding (subword tokenisation). According to LASER developers, it is a working solution for low-resource NLP.
Machine Translation: Handcraft, Augment, Transfer
I have already mentioned rule-based approaches that are still popular in low-resource NLP but have some serious drawbacks. Handcrafted rule-based machine translation seems to be reliable in low-resource NLP, but it requires a lot of experts, time, linguistic archives and, as a result, money. However, if you have a bi- or multilingual dictionary and linguistic knowledge, you can handcraft grammar and translation rules to build a system.
Data augmentation is another approach that might help in low-resource machine translation. For example, you can augment the data by using the Bible. We can think of the Bible as a multilingual parallel corpus because it contains a lot of similar texts translated into many languages. The Biblical texts have a distinctive style, but it is a fine place to start.
While using some high- and low-resource languages as a source and target languages, we can use the method introduced by Mengzhou Xia and colleagues. They propose to augment the data by back-translating from English to high- and low-resource languages and creating a pseudo-dataset by converting the parallel corpus for English and the high-resource language to a new corpus for English and the low-resource language. We can also refer to other studies that suggest using back-translation and word substitution to synthesise new data for the machine translation model training.
Finally, I suggest using transfer learning. If we want to build a machine learning system for the Karelian language, a low resource language but close to the high-resource Finnish language, we can train a machine translation model that works for Finnish and English and then retrain or fine-tune it for Karelian and English. The model will use the knowledge gained during the training on large-scale Finnish data and transfer them to Karelian data, which might significantly improve the model performance.
Meta-learning allows models to learn analogies and patterns from the data and transfer this knowledge to specific tasks. The number of samples for those specific tasks in the training dataset may vary from few-shot learning to one-shot learning, or even zero-shot learning. And one of the examples of such knowledgeable models is the Generative Pre-Trained Transformer.
Meta-learning allows transferring knowledge to new languages and domains. Applying meta-learning to low-resource NLP might solve problems with the limitations of such models. So, we can transfer knowledge to particular tasks, like text classification or named entity recognition, or certain domains, like medicine, and gain different closed-domain or specific low-resource models.
Cross-lingual models, like LASER and XLM-RoBERTa (a robust BERT-based model trained on multilingual data), might bring freshness to the low-resource NLP world.
For more resources and information on this, it’s worth looking through the NeurIPS 2021 tutorial on low-resource NLP.