Abstract
Relying on large-scale parallel corpora, neural machine translation has achieved great success in certain language pairs. However, the acquisition of high-quality parallel corpus is one of the main difficulties in machine translation research. In order to solve this problem, this paper proposes unsupervised domain adaptive neural network machine translation. This method can be trained using only two unrelated monolingual corpora and obtain a good translation result. This article first measures the matching degree of translation rules by adding relevant subject information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Secondly, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of a third language that is not parallel to the current two languages during the process of translation into the target language. Experimental results show that better results can be obtained than traditional statistical machine translation.
1. Introduction
At present, with the gradual deepening of international exchanges, people’s demand for language translation is increasing day by day [1, 2]. However, there are so many kinds of languages in the world, and the Internet has become the most convenient platform for obtaining information, and users have an increasingly urgent demand for online translation [3]. There are many kinds of languages on the Internet, each language has a lot of ambiguity, and the language is changing all the time, which puts higher requirements on translation services [4, 5]. In the prior art, in order to realize automatic machine translation, the currently commonly used techniques are methods based on the neural network [6, 7] and methods based on statistical machine translation [8, 9].
The former is neural machine translation (NMT). The latter is statistical machine translation (SMT). Iswarya and Radha [10] used an unsupervised method to achieve cross-language embedding and trained a good word-to-word model. On the basis of this work, Imankulova et al. [11] generated a pseudoparallel corpus for training through noise reduction and reverse translation and obtained good experimental results. Lee et al. [12] used a character-level decoder to improve the quality of morphologically rich language translation. Morente-Molinera et al. [13] selected granular information of words and characters in the encoder and used multiple attentions on the decoding side to make information of different granularities collaboratively help translation. Zhang et al. [14] modelled the similarity between language pairs in the same language family. Their encoder was composed of character-level one-way RNN and word-level two-way RNN and used a top-down hierarchical attention mechanism to obtain words first. Park et al. [15] proposed regularization of subwords, using a unary language model to generate multiple candidate subword sequences, enriching the input of the encoder to enhance the robustness of the translation system. Zhao et al. [16] introduced the representation of multigranularity BPE to obtain the semantic representation of vocabulary on average. Zhang et al. [17] believed that the encoder word vector layer, decoder word vector layer, and decoder output layer have different functions, so the choice of BPE granularity for different layers should also be different. Zhang et al. [18] used noise-reducing autoencoders and adversarial training to map the two languages to the same implicit space and iteratively trained translation models in both directions. Wang et al. [19] first pretrained the word vectors and implemented an unsupervised translation model using autoencoder and reverse translation. Dabre et al. [20] believed that the previous unsupervised translation model uses a shared encoder to encode the semantic representation of different languages, which can easily lose the respective characteristics of different languages, thereby limiting translation performance. Therefore, they proposed that each language should use its own encoder for modelling and only share the weights of the last few layers of the encoder and the first few layers of the decoder.
However, the data-weighting method in the prior art assigns weights to those sentences based on their similarity to the corpus in the domain. The above existing technologies are inseparable from the serious problem of the annotated corpus, and the original training needs to be combined. However, the data-weighting method in the existing technologies assigns weights to those sentences according to the similarity with the corpus in the domain. The serious problems in the existing technology without annotation corpus need the original training corpus segmentation of several small elements leading to increase in the number of model parameters such as complicated operations, which makes the neural network performance of the machine translation is inefficient and cannot accurately obtain between the various areas of adaptive [21, 22]. In order to solve the above problems, this paper proposes an adaptive neural network machine translation in the unsupervised field. In this paper, the matching degree of translation rules is measured by adding relevant topic information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Finally, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of the third language which is not in line with the current two languages in the process of translation to the target language.
2. Machine Translation Related Technologies
2.1. Machine Translation Framework
At this stage, statistical machine translation is divided into a generative noise channel model [23, 24] and a discriminative log-linear model [25, 26]. We assume that the source sentence is s and the target sentence t.
2.1.1. Noise Channel Model
The noise channel is proposed based on the coding idea in information theory. In this model, the machine translation task is regarded as the information transmission process of the target sentence e being transformed into the source language s after passing through a noise channel. The process of searching for t that maximizes the translation probability is as follows:
According to the Bayesian principle, the above formula can be converted to
The translation model based on the noise channel cannot use more knowledge than the source sentence and target sentence in the translation process, and the importance of the language model and the translation model is fixed and cannot be adjusted according to the actual situation.
2.1.2. Log-Linear Model
The translation system based on the log-linear model decomposes the translation probability into a series of combinations of features:
The translation system based on the log-linear model is very flexible, and some additional descriptive features can be added as needed, such as the number of words contained in translation candidates, the number of rules. Figure 1 shows the construction process of a translation system based on a logarithmic linear model. We can see that the machine translation system contains three parts of data: training data, development data, and test data. The language model is trained on large-scale monolingual training data. The translation system obtains bilingual word alignment information through machine learning methods on bilingual parallel training data and extracts translation rules and estimates their probabilities. The translation system adjusts the feature weights by minimizing error rate training on independently developed data. System performance evaluation is based on existing models and weights to translate test data and evaluate its performance.
2.2. Unsupervised Domain Adaptation
Effective feature extraction is a common basic element of various machine learning methods. As shown in Figure 2, suppose the current layer is a p-dimensional vector and the previous layer is a q-dimensional vector . First, construct a p-dimensional output layer O, and initialize the parameters of the two layers randomly. Given the input I, the hidden layer state H, and the output layer result O′, then, use the difference between O and O′ as the loss for backpropagation to update the parameters of the two layers. The single hidden layer neural network constructed in this way can be understood as a process of encoding the input I to obtain the hidden layer H and decoding the hidden layer G to obtain the input I. If q < p, the parameters obtained by such training can compress O while minimizing the coding loss. If q ≥ p, then we need to add a regularization factor to the loss function for sparse coding or dimension upgrade.
In the research of domain adaptation, deep learning algorithms mainly learn the intermediate representations between input and output. The motivation behind these intermediate representations is that the results of these intermediate representations can bring better cross-domain machine learning performance. Since deep learning can carry out unsupervised training, massive open domain data can be used to learn the topic information representation of this domain. Deep learning is one of the fastest-growing fields in the field of machine learning in recent years. It has made breakthroughs in many natural language processing applications and is a direction worth trying.
3. Machine Translation Algorithm Based on Unsupervised Domain Adaptation
3.1. Sequence-Dependency Structure
This paper uses Transformer as the basic structure to create an encoder and a decoder for each language and share the three principles of parameter training of some layers. A training task is established between English, French, and German at the same time, and the training model is obtained. For example, when training English⟶French and English⟶German tasks, because French and German have similar language structures, useful semantic and structural information can be jointly learned from different target languages.
The main feature of Transformer is that it does not rely on RNN or CNN, but only uses the self-attention mechanism to achieve an end-to-end translation model. The self-attention mechanism is to perform attention calculation on each word in a sentence and all other words in the sentence. The purpose is to learn the dependencies within the sentence and capture the internal structure of the sentence. The structure diagram of Transformer’s architecture is shown in Figure 3.
The encoder and decoder of Transformer are both multilayer network structures, and both the encoder and decoder contain M identical layers. In the encoder, each layer contains two sublayers, namely, the self-attention mechanism layer and the feedforward neural network layer. In each layer of the decoder, there are 3 sublayers. In addition to a mask self-attention mechanism layer and a feedforward neural network layer, there is also a multihead attention mechanism to the decoder output. The residual connection is used between the sublayers, and the method of residual connection can be expressed by the following formula:
Among them, represents the output of the i-th sublayer and represents the function of the layer.
3.2. Bilingual Single-Task Model
In this paper, s and t are used to represent the set of sentences in the source language and the target language; Ms and Mt are, respectively, the language models trained by the monolingual of the source language and the target language; and Ms⟶t and Mt⟶s are used to represent the source language. The process of the single-task model is mainly composed of the following three steps to the predicted probability of the target language and the translation model from the target language to the source language.(1)Initialization: The initialization of the model is roughly divided into two ways—the first method uses word2vec to train the word vectors of the two languages separately and then maps the word vectors of the two languages to the same latent space by learning a transformation matrix. In this way, a bilingual vocabulary with good accuracy can be obtained. The second method uses byte-pair encoding (BPE) of words as subword units. The advantage of this is that while reducing the size of the vocabulary, it eliminates the problem of “UNK” in the translation process. In addition, compared with the first method, the second method chooses to mix and scramble the two monolingual corpora to learn word vector features together. The source language and the target language can share the same vocabulary.(2)Language model: In the bilingual single-task model, the noise reduction autoencoder of the language model minimizes the loss function as Among them, indicates that the sentence s belongs to the expectation of the cross-entropy loss of S and K (a) indicates the sentence after adding noise a to the existing sentence s; the method is to exchange the positions of some words in the sentence or delete some words. Language model: the training process of is essentially taking the sentence K (b) added with noise as the source input sentence, and the initial sentence s as the target input sentence.(3)Reverse translation: The process of reverse translation is a process of training pseudoparallel sentence pairs as parallel sentence pairs. The training loss function is shown in formula (7).
The reverse translation process is to treat both (K′ (b), b) and (K′ (a), a) as parallel sentence pairs for training, transforming unsupervised problems into supervised ones. Repeating (2) and (3) is the complete bilingual single-task model training process.
3.3. Multilanguage and Multitask Model
The multilanguage multitask model is the model obtained by training multitask under the Transformer architecture. Assuming that there are currently 3 languages L1, L2, and L3 monolingual corpus, which are not parallel to each other before, the multitask model includes 6 training tasks, namely, L1⟶L2, L2⟶L1, L1⟶L3, L3⟶L1, L2⟶L3, and L3⟶L2. Inspired by the research of Yang et al. [27], in order to distinguish the semantic structure of each language while learning the useful structural information contained in the other language, this research establishes an encoder and a decoder for each language, but share some of the parameters of the layer. The optimization process of the parameter λ is shown in the following formula:
Among them, M = {1, 2, 3, 4, 5, 6} is the index of the translation task; U is the number of sentence pairs; and a and b are the sentences in the source language and target language in the current translation task. Such parameter settings enable different language pairs to learn useful information in other languages.
In order to strengthen the role of shared latent space, this paper trains a generative adversarial network G to establish a three classification task between three encoders corresponding to three languages. Its role is to predict the category of the current coding language. Convert the cross-entropy loss shown in the following formula:
Among them, ED (s′) represents the prediction result of the currently encoded sentence s′ through the encoder of the L language, s′ may come from the source language or the target language; is the parameter for generating the confrontation network G; and L ∈ {L1, L2, L3}.
3.4. Topic Similarity Model
The topic model is a statistical model used to discover abstract topics in the fields of machine learning and natural language processing. The topic similarity model measures the degree of similarity between the translation rules and the topic distribution of the language to be translated. In order to calculate the similarity between the translation rule and the language to be translated, we need to assign a distribution probability to the subject at the same time for the source language and target language of the translation rule. Use this probability to characterize the distribution relationship between the source language and target language of this rule on each topic.
If s is used to represent the source language part of the translation rule, t is used to represent the target language part of the translation rule, topic_s is used to represent the topic set of the source language, and topic_t is used to represent the topic set of the target language, then for any translation rule, there will be two distributions of rules to topics: P (topic_s|s) represents the topic distribution probability of the source language part of the translation rule in the source language, P (topic_t|t) represents the topic distribution of the target language part of the translation rule in the target language probability.
In the topic similarity model, you can choose the Hellinger Disaster (HD) to calculate the similarity of the topic between the translation rules and the document to be translated. Among them, the HD similarity evaluation method is a symmetric algorithm, which has been widely used to compare the similarity between two distributions. Assuming that the distribution P (topic|s) from the translation rule to the topic and the distribution P (topic|t) from the document to the topic is given, the formula for calculating the similarity between the two can be written as
Obviously, by comparing all the translation candidates and the HD of the language to be translated, the similarity between the translation candidates and the translated language can be obtained. In information theory, the smaller HD distance represents the greater similarity, because our task is to find the translation with the greatest similarity between the selected language and the translated language as the final translation result. With the addition of the topic similarity model, our goal is to select the translation rules that are most similar to the translated language to realize the basis for adaptive translation using topic information.
3.5. Machine Translation Model and Process
In phrase-based statistical machine translation, the source language sentence s = {s1, s2, …, sn} is translated using a logarithmic linear model. By comparing all translation candidates with the HD distance of the language to be translated, translation candidates can be obtained by the similarity between the translated language and the translated language, finding the translation with the greatest similarity between the selected and translated language after translation as the final translation result. The target translation with the greatest similarity t = {t1, t2, …, tn}:
Among them, Dis (s, t) is the characteristic function and λn is the characteristic weight.
The machine translation algorithm in this paper includes three stages of processing, training, tuning, and translation. As shown in Figure 4, it is necessary to prepare training data, monolingual target language corpus, development set, and test set.
The training data is bilingual translation corpus, mostly sentence alignment. After preprocessing and word alignment, various translation rules are obtained, including phrase translation table, ordering probability, and maximum entropy ordering parameters.
For the monolingual target language corpus, you can use the target language side of the training data, or you can add more monolingual data, mostly at the sentence level, to train the language model. In addition to the various translation rules and language models generated during the training process, the operation of the decoder also requires feature weights. The process of tuning is to select feature weights on the development set.
The development set is a collection of sentences in the source language, and each source language sentence has one or more reference translations in the target language. Tuning on the development set usually uses minimum error training. It requires the decoder to continuously iterate the current feature parameters, automatically calculate and compare BLEU scores, and then change the weights to decode again until the upper limit of the number of iterations is reached or the translation system is stable. This is a problem of multidimensional parameter optimization. The decoder can implement the translation process by using the translation rules, language model, and feature weights obtained during the training process.
Use the test set to perform the translation and perform BLEU scoring to observe the translation effect of the translation system.
4. Results and Discussion
4.1. Experimental Setup
The experiment selects 10 million single sentences in English, German, and French from the WMT2007 to WMT2010 corpus. The experiment uses Adam as the optimizer, the deactivation rate (dropout) is set to 0.1, the dimension of the word is set to 512, the maximum sentence length is 175, and sentences with more than 175 words will be intercepted by the superlong part. The training step is 3.5 × 105 and the rest of the model parameters are set to the default parameters of the Transformer model. In the three-language multitask translation model, the vocabulary of the three languages is shared and the BPE operand is set to 85000. The fast text tool is used to train the cross-language word vector learning for the subworded training set.
In the evaluation of phrase-level translation, if a translation result candidate is the same as any one of the standard answers, we consider it to be correct. In the evaluation of sentence-level translation, the evaluation index of the translation result uses the case-insensitive 4-element BLEU value and uses the bootstrap resampling method to test the significance of the evaluation result.
4.2. Performance Comparison between Single-Task Model and Multitask Model
Figure 5 summarizes the translation performance of the single-task model and the multitask model on the test set. It can be seen from Figure 5 that the multitask model of this article has improved in the four translation tasks, but the effect of the improvement is quite different. In the two translation tasks of English⟶German and German⟶English, the test results showed that the BLEU value improved less. In the two translation tasks of German⟶French and French⟶German, the performance improved significantly, and the BLEU value increased by about 2.88 and 3.01 percentage points, but on the two translation tasks of English⟶French and French⟶English, the translation performance of the multitask model decreases.
In the multitask model of this article, a shared vocabulary is used for multiple languages, so it is particularly important to choose an appropriate vocabulary. In this regard, this paper has also done several experiments for comparative analysis. The experimental results are shown in Table 1. It can be seen from Table 1 that when the BPE operands are 85000 and 90000, the experimental results are relatively good, but the BLEU values of the two sets of BPE operands are not much different. And in the case of some language pairs, the BLE value of the BPE operand of 90000 is lower than the BLE value of the BPE operand of 85000. It is estimated that when the size of the vocabulary is further increased, the improvement of the experimental results is not significant. Therefore, in the final model of this article, the size of the BPE operand is selected to be 85000.
In order to compare the training speed, this study counts the parameters that need training in the experiment. In the bilingual single-task translation task, the order of magnitude of the parameters is 1.3 × 108, while in the multilanguage and multitask translation model of this article, the total parameters are about 1.7 × 108. The number of parameters of the multilanguage translation model is only 1.3 times that of the bilingual translation system, which is much smaller than the sum of the parameters of the 6 tasks that are trained separately. Compared with the single-task model, the total training time of the multitask model is approximately reduced by half. In order to compare the translation performance and convergence speed of the two models more intuitively, on the German⟶French and French⟶German translation tasks where the translation effect has changed the most, a line graph is used to compare the bilingual single-task model and the multilingual proposed in this article; the effect of the multitask model is shown in Figure 6.
4.3. The Influence of the Number of Retrieved Documents and the Length of the Hidden Layer
We compared the effects of the number of retrieved documents and the length of the hidden layer on the accuracy of the translation model, and the results are shown in Figure 7. We found that for most results, the number of retrieved documents achieves the best translation accuracy when N = 10. This result confirms that the topic similarity of the information retrieval method is very helpful for determining topic information, and then it is helpful for choosing translation rules to an important role. However, in the experiment, when N is large, for example, when N = 50, the translation performance drops sharply. This is because as the number of retrieved documents further increases, topic-irrelevant documents will be introduced into the learning of the neural network. Irrelevant documents will bring topic-irrelevant real words, which will affect the performance of neural network learning.
Another important factor is the length L of the hidden layer vector in the neural network. In neural network learning, this parameter is usually adjusted by experience. In Figure 7, it can be seen that when L is small, the accuracy of the translation system is relatively high. In fact, in the case of L ≤ 600, the difference in translation performance is very small. However, when L = 1000, translation accuracy is worse than the other cases. The main reason is that the number of parameters in the neural network is so large that it cannot be learned well. We know that when L = 1000, there are a total of 100000 × 1000 parameters between the linear and nonlinear layers of the network. The current training data size is not enough to support this kind of network parameter level training, so the model is likely to fall into the local optimal and unacceptable topic representation information.
4.4. Phrase- and Sentence-Level Translation Performance
In the phrase-level translation process, Table 2 shows the accuracy rates of the top 5 phrase translation result candidates. It can be seen from the experimental results that our method and the methods proposed in literature [15], literature [19], and literature [20] are significantly better than the single-task translation model, which proves that our method is obtaining the latest translation. There is a great advantage in knowledge.
In the sentence-level translation evaluation, we tested the translation quality of different types of text and compared it with other algorithms. The experimental results are shown in Figure 8. Although the translation method in this article does not use any pretrained model, its translation results are comparable to traditional machine translation results based on massive training data. This shows that the translation knowledge obtained by the algorithm in this paper is very efficient.
5. Conclusion
Each has its own characteristics and flexible forms, making automatic language processing, including machine translation between languages, a difficult problem to be solved. At the same time, how to provide users with high-quality translation services has become a difficult problem to solve. Therefore, this article measures the matching degree of the translation rules by adding relevant subject information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Then, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of a third language that is not parallel to the current two languages during the process of translation into the target language. Finally, simulation experiments prove the effectiveness of the proposed algorithm. Experiments show that the algorithm used in this paper is significantly better than the comparison algorithm method, and only using part of the training data can achieve a better translation effect than the original training data, which improves the translation performance while reducing the translation system training and decoding costs.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The author declares no known conflicts of financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Scientific Research Program Funded by Shaanxi Provincial Education Bureau: Xi’an Tour Text Translation Strategy Research in Terms of Prototype and Model Theory (Program no. 18JK0298).