Abstract

End-to-end neural machine translation does not require us to have specialized knowledge of investigated language pairs in building an effective system. On the other hand, feature engineering proves to be vital in other artificial intelligence fields, such as speech recognition and computer vision. Inspired by works in those fields, in this paper, we propose a novel feature-based translation model by modifying the state-of-the-art transformer model. Specifically, the encoder of the modified transformer model takes input combinations of linguistic features comprising of lemma, dependency label, part-of-speech tag, and morphological label instead of source words. The experiment results for the Russian-Vietnamese language pair show that the proposed feature-based transformer model improves over the strongest baseline transformer translation model by impressive 4.83 BLEU. In addition, experiment analysis reveals that human judgment on the translation results strongly confirms machine judgment. Our model could be useful in building translation systems translating from a highly inflectional language into a noninflectional language.

1. Introduction

Neural machine translation (NMT) is an active research field with a lot of newly published works [14]. They study different aspects of NMT in order to improve it. In [1], the authors proposed a single model to translate from multiple source languages to multiple target languages. In [2], the author proposed using adversarial input to train the model. Adversarial input is generated from the original input with a small perturbation. In [3], the authors proposed a mechanism to adapt NMT models to new languages and domains. In [4], the authors proposed enriching the training dataset with the predicted sentences. Although these works develop in different directions, all of them are based on end-to-end NMT. End-to-end NMT is a universally applicable translation paradigm. It is complicated from the technical point of view, but very simple from the point of view of linguistics. In contrast to building an effective statistical machine translation system, building a NMT system does not require specialized knowledge of the applied language pair. For all language pairs, regardless of their characteristics, end-to-end NMT takes a sequence of source words from a fixed dictionary, processes, and then returns a sequence of target words from another fixed dictionary. The effectiveness and simplicity in application lead to widespread use of NMT [57]. NMT has become a dominant translation paradigm. In ideal circumstances where all words of languages frequently appear in a large training dataset and computational capacity to train translation models is unlimited, end-to-end NMT will work perfectly. In practice, such ideal circumstances do not take place, so the performance of end-to-end NMT is worsened by rare words and out-of-vocabulary problem, which takes place in all translation tasks, especially for low-resource language pairs. To make NMT capable of translating rare words and out-of-vocabulary words, in [810], the authors proposed novel NMT models representing words as sequences of subwords, which occur more frequently than the words themselves. For example, according to the byte pair decoding method [8], the Russian word “призывают” (meaning: call) is segmented into two subwords “призыва@@” and “ют.” In the subword-based NMT source, sentences as sequences of source subwords are processed, and then, sequences of target subwords are generated. Generated sequences of target subwords are concatenated to form target sentences, based on characters “@@” informing that containing subwords should be attached to the following subwords. Their experiment results showed that subword-based NMT delivered a substantial improvement, compared with word-based NMT for high-resource English-German and English-Russian language pairs. The success of subword representation in NMT was further confirmed in the studies [11, 12] for several high-resource language pairs, such as English-Spanish, English-French, and English-Chinese. Recently, a revolutionary NMT model called transformer [13, 14] with the self-attention mechanism significantly outperformed the best previously reported translation models. In cooperation with subword representation, transformer has established itself as the state-of-the-art translation paradigm.

Although subword-based NMT models are able to work well without considering linguistic characteristics of languages, we wonder whether linguistic knowledge helps NMT systems to work more efficiently. Motivated by works [15, 16] in speech recognition with a similar sequence-to-sequence pattern where an original raw input data in the form of discrete speech signal overtime can be represented as a sequence of features, such as log-Mel frequencies, Mel frequency cepstrum, and the knowledge of morphological rich Russian source language and analytic Vietnamese target language, in this work of building a Russian-Vietnamese machine translation system, we experiment the idea of representing each source word in sentence as a combination of features: lemma, grammatical role in sentence, part-of-speech, and morphological features. The decomposition is only deployed on the Russian source side, but not in the Vietnamese target side. The idea comes to mind, since Russian is a morphological rich language, while Vietnamese is an analytic language which lacks morphological marking of case, gender, number, and tense. A Russian sentence consists of tokens infected from lemmas based on their grammatical roles and part-of-speech tags. Inflected tokens are usually called words. By replacing words in the source sentence by a combination of features, we actually increase their appearance frequency in the training dataset; therefore, reduce the severity of rare word and out-of-vocabulary problem in inference. For example, in the training dataset, we have two Russian words, “пришёл” (meaning: arrived) and “полюблю” (meaning: fall in love), and in the testing dataset, we have two other words “приду” (meaning: will arrive) and “полюбил” (meaning: fell in love). In this case, end-to-end NMT systems are going to recognize two words in the testing dataset as unknown. However, there are close linguistic relationships between words in the training and testing datasets. Both Russian words “пришёл” and “приду” are inflected forms of the same lemma “прийти.” The training Russian word “пришёл” is inflected from the lemma “прийти,” since it is a verb of muscular gender, in singular number and in the past tense, while the testing Russian word “приду” is inflected from the lemma “прийти,” as it is a verb in singular number and in the future tense. A relationship is also found for two words “полюблю” and “полюбил.” The training Russian word “полюблю” is inflected from the lemma “полюбить,” since it is a verb in singular number and in the future tense, while the testing Russian word “полюбил” is inflected from the lemma “полюбить,” as it is a verb of muscular gender in singular number and in the past tense. If we decompose all these words into features, then in the testing phase, we will have lemmas and grammatical features which are well-known in regard to the training dataset.

In total, this work is dedicated to building a novel transformer-based NMT model taking a sequence of vectors of linguistic features from source words and predicting a sequence of target words.

The rest of this paper is organized as follows. A brief overview of related works is given in the following section. The third section outlines a novel methodology of source-word decomposition for neural machine translation. The fourth section describes materials and methods used in the work. The fifth section shows and analyses experiment results. The final section lists our conclusions from this work.

This section briefly examines a variety of translation unit representation methodologies used in machine translation systems for several language pairs containing at least one inflectional language, such as Russian, Czech, German, and English.

Before the emergence of NMT, phrase-based SMT used to be very popular. There is a wide range of literature studying phrase-based SMT. Among these studies, factored phrase-based SMT models [1719] are the most worthy to mention in our work. In factored phrase-based SMT linguistic features, such as lemma, part-of-speech and morphological features are integrating into the surface form of word. The factored phrase-based SMT systems improve translation quality over standard SMT systems for multiple language pairs, such as English-German, English-Spanish, and English-Czech. The approach gained further popularity, after it had been implemented in the famous SMT tool called Moses [20]. In [21, 22], the authors continued to apply and develop the approach and achieved good results. Although the approach belongs to a group of obsolete statistical translation paradigms, its success in integrating linguistic features inspires us to take advantage of linguistic information in redefining the translation unit in modern NMT paradigm.

In recent years, there has been growing interest in integrating linguistic features into NMT architectures. In [23], the authors proposed a novel factored subword-based neural model based on recurrent neural networks that learns source translation unit embeddings, leveraging subword embedding, subword-tag embedding, lemma embedding, part-of-speech embedding, dependency label embedding, and morphological label embedding. They used many different linguistic features in addition to subword itself to take advantage of high-resource characteristic of the English-German language pair. They found that the factored subword-based neural model notably improved translation for the high-resource English-German language pair. Our preliminary experiments with the state-of-the-art transformer NMT model confirm their finding. Our subword-based transformer model combining linguistic-feature embeddings with subword embedding outperforms a standard subword-based transformer model for the Russian-Vietnamese language pair. However, we believe that we can further improve the system, considering our context of the low-resource and linguistically distant language pair. Due to totally different morphology of Russian and Vietnamese, in the training dataset, the number of unique words of highly inflectional Russian in the source side is multiple times larger than the number of unique words of noninflectional Vietnamese in the target side, which leads to a great probability that a Russian word in reference phase is unseen in the training dataset. To solve the problem, instead of integration, we suggest to use replacement for training from Russian into Vietnamese. Specifically, we calculate source translation unit embedding using only linguistic-feature embeddings without word or subword embedding.

Word representation in [24] bears a close resemblance to our translation unit representation. The authors used a combination of lemma and part-of-speech tag to represent a word in translation for multiple language pairs: English-German, English-Turkish, English-Czech, and English-Latvian. The main difference from our technique lies in the side of translation. They applied their technique in the target side of a NMT system, while we redefine the source-side translation unit. According to their method, a source word is translated into a vector of lemma and part-of-speech tag. Based on that vector, the system predicts a surface form of target word. Obviously, their approach is geared towards translation into an inflectional language. In our case of translation from Russian into noninflectional Vietnamese, we could not take advantage of their technique.

Recently, in [25], the authors introduced an approach of modeling word formation in transformer-based NMT for the English-German language pair. They segmented both English source words and German target words as sequences of vectors of subwords and linguistic subword tags. They reported an improvement over a standard system. Unfortunately, their approach is language-specific, since they deployed a morphological analyzer for English-German language pairs only. Although their approach is interesting, it is not applicable for our task, as we have not found any similar subword taggers for our Russian-Vietnamese language pairs.

3. Source-Word Decomposition for Neural Machine Translation

3.1. Base Transformer Model

Our feature-based transformer model is based on the original transformer model [13], which is the state-of-the-art NMT model. The transformer model has the encoder-decoder architecture. In this work, we make a novelty by modifying the embedding representation in the encoder; therefore, in the following, we describe that part of the encoder in more detail. If you are interested in the general architecture of the transformer model, you can read the original paper [13].

The encoder of the model maps a sequence of source words from a fixed dictionary into a sequence of individual embeddings of a fixed size . The process of mapping is as follows. First, the encoder looks up each source word in a dictionary of embeddings and retrieves its embedding vector of a fixed size . Next, the encoder looks up the position of in another dictionary of positional embeddings and retrieves a positional embedding , where . Finally, the encoder adds embedding with weighted by a factor . Applying the mapping process for all source words, the encoder generates a sequence of combined embeddings .

Considering the relationship between words in the sentence with the self-attention mechanism, the encoder transforms the sequence of individual embeddings into a sequence of context-aware continuous representations .

Based on the sequence , the decoder of the model generates a sequence of target words from another fixed dictionary . The decoder generates one target word at a time, using previously generated target words as additional input. Mathematically, the transformer model can be represented as a composition of functions as follows.

In the above equations, notations and stand for trainable functions, while notations and are the dictionaries of trainable embeddings with dimension .

3.2. Source-Word Decomposition

Unlike the basic transformer model, which take source words from a fixed dictionary as input, our proposed transformer model takes tuples of linguistic features:(1)Lemma from a fixed dictionary (2)Dependency label from a fixed dictionary (3)Part-of-speech tag from a fixed dictionary and(4)Morphological features from a fixed dictionary

In the place of corresponding source words , consider the fact that the Russian source word is, in fact, a surface form of a lemma, which is inflected on the basis of its grammatical role and part-of-speech. In the other words, a source word can be viewed as a combination of lemma, dependency label, part-of-speech tag, and morphological features.

The grammatical role of a word in sentence is expressed through dependency label assigned to the word. Dependency labels are presented in the study [26]. Part-of-speech types and corresponding tags of words are listed alphabetically in Table 1, customized for use with Russian from Version 2 of Universal Dependency (https://universaldependencies.org/u/pos/index.html).

Each part-of-speech follows its own morphology rule. For instance, from a lemma “любовь,” which is an inanimate noun of feminine gender, we can generate many surface forms according to grammar rules for a noun which has 6 cases (nominative, accusative, genitive, dative, instrumental, and prepositional), two numbers (singular and plural). A noun in Russian also has gender and animate features. Each gender (masculine, feminine, and neuter) has its own inflection rule. Similarly, each animate feature (animate and inanimate) has its own inflection rule. All surface forms inflected from lemma “любовь” are presented in Table 2.

Replacing source words with a combination of the features, we replace the input dictionary of the cardinality with the tuple of input dictionaries of the summarized cardinality . From the above analysis, we can see that the summarized cardinality is many times smaller than the cardinality of dictionary of source words—combinations of the features. In the extreme counting, the cardinality of dictionary of source words as combinations of the features is the product . The reduced input dictionary actually helps to reduce the severity of rare word and out-of-vocabulary problem in inference.

Example of applying source-word decomposition to the Russian word “последние” (meaning: last) in a sentence is given in Table 3. The application results in a vector of linguistic features: “последний” (lemma), “amod”(dependency label: an adjectival modifier of a noun), “ADJ” (part-of-speech tag: adjective) and “Animacy = Inan, Case = Acc, Degree = Pos, Number = Plur” (morphological features: inanimate, accusative case, positive degree of comparison, and plural number).

3.3. Feature-Based Transformer Model

In order to decompose source words into tuples of features, we make changes in the encoder of the transformer model, so that it takes vectors of linguistic features as inputs in place of source words. The modified encoder requires four sequences of linguistic features including lemma , dependency label , part-of-speech , and morphological features , for . Each linguistic-feature tag is considered as a string in a corresponding dictionary . Trainable embeddings of all linguistic-feature tags are looked up in corresponding dictionaries by the modified encoder. As proposed in [23], we apply concatenation operation on embeddings of all linguistic-feature labels of each source word. The concatenation results in a concatenated embedding corresponding to each source word in sentence. Positional embedding of each source word is then added to the concatenated embedding to form the final embedding representing source word. Given a sequence of final embeddings representing source words in a sentence, following steps of the modified encoder are essentially the same as in the standard encoder.

Mathematically, the feature-based transformer model can be represented as a composition of functions as follows.

In the above equations, notations for are the dictionaries of trainable embeddings. It is worth to mention that the sum of dimensions of the embeddings is equal to to make the concatenated embeddings compatible with the dimension of positional embeddings .

4. Materials and Methods

4.1. Materials

We created our corpus very much in the same way as indicated in the works [27, 28], which are dedicated to study another low-resource language pair in the form of Chinese-Vietnamese. First, we picked 33,027 Russian sentences from News Commentary data (http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) of shared task: Machine Translation of ACL 2013 Eighth Workshop on Statistical Machine Translation. Next, we translated the Russian sentences into Vietnamese. The translation was carried out as follows. First, we used Google translate service to translate all the Russian sentences into Vietnamese. Then, we corrected the translation results, so they not only reflect the meaning of the Russian source sentences but also sound naturally, taking advantage of the fact that we are native speakers of Vietnamese and understand Russian. As a result, we had 33027 Russian-Vietnamese sentence pairs. We then randomly arranged the sentence pairs. From the shuffled corpus, we first took out 1500 sentence pairs to form the testing dataset. After that, we took another 1500 sentence pairs to form the development dataset. The remaining 30027 sentence pairs were used as the training dataset. Summary of the datasets is demonstrated in Table 4 [29]. The summary reveals the huge difference in the dictionary size between Russian and Vietnamese. The number of unique tokens in the Russian training dataset is over 8.5 times larger than the one in the Vietnamese side. The difference can be explained by the fact that Russian is a morphological rich language, while Vietnamese is a noninflectional language. On the other hand, the summary also highlights the difference in average sentence length between Russian and Vietnamese. The number of tokens per Vietnamese sentence is over 1.5 times larger than the one in the Russian side. In other words, to express the same idea, on average, we need to use more Vietnamese words than Russian words.

4.2. Methods

To evaluate feature-based NMT, we performed four experiments. In each experiment, we built and assessed a NMT model. Input and output of each model are presented in Table 5. We used deep learning library PyTorch [30] to build the NMT models with the required input and output by altering an implementation (https://github.com/bentrevett/pytorch-seq2seq) of the state-of-the-art transformer. The source codes of the proposed feature-based and baseline NMT models are provided at GitHub page (https://github.com/ThienCNguyen/Russian-Vietnamse) of the first author.

In the first experiment, we built the baseline W2W model which takes a sequence of Russian words as input and predicts a sequence of Vietnamese words as output. The W2W model is comprised of an encoder and a decoder. The encoder has a 256-dimensional embedding layer, 256-dimensional hidden states, three sublayers consisting of 8-head self-attention layer and 512-dimensional feedforward layer, and dropout layers with level = 10%. The decoder has a similar configuration as the encoder. We used tokenized training and development datasets to train the W2W model. We tokenized Russian sentences in the training and development datasets to produce corresponding sequences of Russian words by using space delimiters between Russian words. We tokenized Vietnamese sentences in the training and development datasets to produce corresponding sequences of Vietnamese words by using a tool provided in [31]. Using Adam optimizer with a learning rate = 0.0005 as reported in [32], we trained the W2W model in 20 epochs of the training dataset. Then, we chose the parameters of the model providing the least cross-entropy loss in the development dataset.

In the second experiment, we built the S2S model which takes a sequence of Russian subwords as input and predicts a sequence of Vietnamese subwords as output. The S2S model has the same configuration and optimization procedure as the baseline W2W model. To produce sequences of subwords for building the model, we tokenized sentences in the training and development datasets by using a tool provided in [33].

In the third experiment, we built the SnF2S model which takes a sequence of Russian subwords and their features (subword tag, lemma, dependency label, part-of-speech tag, and morphological label) as input and predicts a sequence of Vietnamese subwords as output. Subword tag is one of four types: B, I, E, and O, corresponding to four types: beginning part, inside part, ending part, and full word. Linguistic features of a subword are the same as the ones of containing words which are generated by a deep learning tool Stanza [34]. The SnF2S model is an improvement on the model proposed in [23]. We substituted recurrent neural networks with the state-of-the-art transformer. The SnF2S model also has a similar configuration and optimization procedure as the S2S model except for the encoder embedding layer and dimension of hidden states. The encoder embedding layer is composed of six embedding sublayers: 352-dimensional subword embedding, 7-dimensional subword-tag embedding, 117-dimensional lemma embedding, 12-dimensional dependency label embedding, 12-dimensional part-of-speech-tag embedding, and 12-dimensional morphological label embedding. We chose the dimension of embeddings, following the ratio recommended in [23]. We applied 512-dimensional hidden states to make them compatible with embedding dimension.

In the second and third experiments with sequences of Vietnamese subwords as output, we applied a postprocessing of concatenating subwords to form a sequence of words in the same way as proposed in [23].

In the fourth experiment, we built the proposed feature-based NMT model called S2F, which takes sequences of features of Russian source words (lemma, dependency label, part-of-speech tag, and morphological label) as input and predicts a sequence of Vietnamese words as output. In turn, the S2F model has a similar configuration and optimization procedure as the SnF2S model except for the encoder embedding layer and dimension of hidden states. The encoder embedding layer is composed of four embedding sublayers: 190-dimensional lemma embedding, 22-dimensional dependency label embedding, 22-dimensional part-of-speech-tag embedding, and 22-dimensional morphological label embedding. We chose the 256-dimension for hidden states to be compatible with embedding dimension.

In all experiments, we used the same assessment procedure for all NMT models. First, we fed Russian sentences of the testing dataset to the models. Then, we compared the predictions by the NMT models with reference Vietnamese sentences in the testing dataset in terms of the lowercase BLEU score which is calculated by the natural language toolkit NLTK [35].

5. Results and Analysis

Primary translation results are provided at GitHub page (https://github.com/ThienCNguyen/Russian-Vietnamse) of the first author. In this section, we analyze translation results for Russian-Vietnamese. We compare the performance of the proposed feature-based NMT with baseline NMT models. We also present human judgment of translation results.

5.1. Machine Judgment

Figure 1 shows the corpus-level BLEU scores of translation results from the testing dataset by the NMT models. We can observe that, among the baseline models, the SnF2S model yields the best result. In comparison with the word-based W2W model, the subword-based S2S model improves by 2.54 BLEU. Compared with the subword-based S2S model, the subword-based feature-added SnF2S model provides an improvement of 3.83 BLEU. This result suggests that we should compare it with the SnF2S model which is the strongest baseline model in order to prove the effectiveness of our proposed model. In comparison with the strongest baseline SnF2S model, our feature-based F2W model outperforms by an impressive 4.83 BLEU. Nevertheless, on the sentence level, the proposed F2W model does not always prove itself better than the SnF2S model. Among 1500 sentences in the testing dataset, the F2W model worsens the translation quality in 41.13% cases, while it improves the BLEU score in 57.6% cases. Detail of the comparison is presented in Figure 2.

5.2. Human Judgment

In addition to machine judgment, we also applied human judgment on translation results by NMT models. We made human analysis to have a more complete assessment on translation results. Specifically, we randomly picked 5 cases from the testing dataset. Here, we present the selected cases and human analysis on translation results. Description of each case consists of a Russian source sentence, its meaning in English, Vietnamese reference, translation results by NMT models, and their corresponding sentence-level BLEU scores.

Table 6 shows translation results by NMT models from a simple source sentence. Although the source sentence is simple, two models, W2W and SnF2S, give wrong translations. Their translations with the meanings “Europe is still in place of Barack Obama, Barack Obama has gone” and “Europe is still impressed with the tragedy of Barak Obama” are far from the initial meaning of the source sentence. On the other hand, two NMT models, S2S and the proposed F2W, perform pretty well for this source sentence. The meaning of the translation by S2S is “Europe is still impressed with the impression of the visit to Obama,” which is close to the meaning of the source sentence. The result still has a flaw. Repeated phrase “ấn tượng”g(meaning: impression) in the translation result may make it more difficult for us to catch the meaning. Compared to other models, the proposed F2W model gives the best translation result. The meaning of the translation is “Europe is still impressed with the visit of Barack Obama,” which bears the closest resemblance with the meaning of the source sentence.

Table 7 shows translation results by NMT models from a more complicated source sentence where the subject has a singular form but plays a plural role. The quality of translation by NMT models is very different in this case. Three baseline models, W2W, S2S, and SnF2S, provide translation results with the wrong meanings “but most books and Stalin’s red is a positive light in light,” “But most of Stalin’s book and press pretend,” and “But most of Stalin’s books and authors were light cakes under the light.” At the same time, our proposed F2W model gives a translation identical to the gold reference.

Table 8 shows translation results by NMT models from a complex source sentence where an infinitive clause is used as subject complement. This long complex sentence is a challenge for NMT models. There is no translation model that gives a good enough result for this case. Furthermore, quality of the translation results by NMT models for this example is the perfect reflection of overall machine judgment on NMT models. Among the baseline models, the SnF2S model gives the best result. Specifically, it partly translates the key phrase “красных хмеров” (meaning: Khmer Rouge) into “đỏ,” while other baseline models mistranslate the phrase. Compared to the best baseline SnF2S model, the proposed F2W model also partly translates that key phrase and improves translation by successfully translating the other key phrase “дипломатических усилий” (meaning: diplomatic efforts).

Table 9 shows translation results by NMT models from a sentence where the proposed F2W model slightly worsens the translation quality in terms of the BLEU score. In comparison with the best baseline SnF2S model (62.69 BLEU), the proposed F2W model provides a slower BLEU score (60.29 BLEU). From the human perspective, the meaning of the translation result by the F2W model (meaning: we operate with the private sector, not compete with it) is very close to the meaning of the translation result by the SnF2S model (meaning: we work with the private sector, not competing with it) and the reference itself.

Table 10 shows translation results by NMT models from a sentence where the proposed F2W model significantly worsens the translation quality in terms of the BLEU score. In comparison with the best baseline SnF2S model (55.39 BLEU), the proposed F2W model provides a far slower BLEU score (20.69 BLEU). Nevertheless, from the human perspective, the translation result by the F2W model (meaning: from the Persian Gulf, the oil and gas import region in the United States) partially reflects the meaning of the reference, while the SnF2S model mistranslates the Russian source sentence. The meaning of the translation result by the SnF2S model is “a third of oil exported only to the United States.”

In overall, both machine and human judgments prove the superiority of the proposed feature-based transformer model in comparison to other available transformer translation models for translating from Russian into Vietnamese.

6. Conclusions and Perspectives

In this paper, we have successfully integrated linguistic knowledge into the state-of-the-art transformer translation model. We have introduced the feature-based transformer model, which replaces source words by combinations of their features comprising of lemma, dependency label, part-of-speech tag, and morphological label. We have empirically compared the proposed model with other baseline models. Experiment result for the Russian-Vietnamese language pair shows that our model outperforms other models by great distances.

Based on the translation results and our knowledge of the investigated Russian and Vietnamese languages and their relations to other languages, we strongly recommend the feature-based NMT model for building systems translating from highly inflectional synthetic Slavic languages including Russian, Belarusian, Ukrainian, Polish, Bulgarian, Czech, and Serbian into noninflectional analytic languages, such as Vietnamese and Chinese.

Data Availability

The dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.