Abstract

Translation templates are an important cause of knowledge in machine translation (MT) systems. Their quality and scale directly influence the performance of MT systems. How to obtain high-quality and efficient translation templates from corpora has become a hot topic in recent study. In this paper, a tree to String alignment template (TAT) based on syntactic structure is proposed. This template describes the alignment between the source language syntax tree and the target language string. The syntactic structure, a large number of construction tags, and variables are introduced into the template, which enables the syntactic model to deal with discontinuous phrases and has the ability of generalization. Templates can be used in syntactic statistics, case-based, and rule-based MT systems according to different decoders. ATTEBSC algorithm is a basic method to learn translation templates by comparing sentence pairs. It demands that sentence pairs be constructed in a precise comparison structure ahead of time, but there are no strict guidelines on how to do it. In this paper, we propose a method to calculate the specific comparison scheme using the longest common subsequence (LCS) and use the normalized LCS distance to screen sentences with high similarity and then use the ATTEBSC algorithm to automatically remove the template. Experiments show that this method is easy and effective, and many expensive templates can be learned.

1. Introduction

MT methods can be roughly divided into four categories: (1) rule-based MT; (2) example-based MT; (3) statistical MT; and (4) template-based MT. TBMT. Template-based MT is a hybrid strategy that combines empirical methods with rationalist rules. It can be seen essentially as a synthesis of the previous three methods. The basic idea is to use bilingual translation template to achieve automatic translation from source language to target language. A translation template can be regarded as a generalized translation instance pair, which is usually composed of several constants and variables, including the source language template and the target language template, and the target language template constitutes the corresponding translation of the source language template under certain conditions. In general, a template can be formed by replacing some corresponding parts of a translation instance with variables. Template-based MT systems need a large and extensive template library, and the construction of template library needs efficient and good template extraction algorithms. There are two traditional template extraction methods: (1) the algorithm based on grammar analysis, which is characterized by the need for grammar analysis, and the quality of template generation is good. The disadvantage is that it is difficult to enumerate the grammar rules of various languages, and it is difficult to achieve; (2) the algorithm based on sentence comparison is characterized by no need for grammar analysis. It is simple to generate templates by comparing sentences in bilingual alignments, but the quality of templates is not good and inefficient [1, 2]. In this article, through improving the second template extraction methods, a kind of automatic extraction of from English-Chinese bilingual sentence alignment library translation template’s new strategy is put forward and realized, the basic thought is to improve the operational efficiency of the traditional algorithm by the sentence clustering and template variable control technology and template automatically extracts accuracy, neither needs bilingual dictionary, also do not need to syntactic analysis [3, 4].

MT refers to the use of computers to translate one natural language into another. It includes human-assisted MT, machine-assisted MT, and fully automatic translation. The system used to complete the MT process is called a MT system. The history of MT research dates back to the 1930s. However, due to the low level of technology at the time, these ideas did not materialize. After the electronic computer came out in 1946, the idea of automatic language translation by computer was put forward. It was not until 1949 that Warren Weaver formally raised the issue of MT with his memorandum “Translation.” Over the past 60 years, with the rapid development of science and technology, the cultural exchanges between different nationalities are more and more frequent, the language barrier problem is more and more serious, and the demand for MT is increasingly urgent. The research on MT has greatly deepened people’s understanding of language, knowledge, and intelligence. Although the current situation of MT is still far from people’s expectations and market needs, researchers’ enthusiasm for MT research is still high. The unremitting pursuit of automatic and high-quality MT is one of the ultimate goals and inexhaustible source of computational linguistics research [5, 6].

There are three main methods used in MT: rule-based method, statistics-based method, and instance-based method. In recent years, MT methods based on templates are also emerging. In all these methods, there are various translation templates, which play an important role in providing translation knowledge and are an indispensable part of the translation system [7, 8].

MT is an attempt to realize the automatic translation from one natural language to another natural language or the complete or partial automation of the translation process from one human language to another human language by using computers. MT system is a computer system that generates target translation from source language with human or unattended assistance. Its core is the automation of translation process. You can see from the above definition that MT research object is human use of natural language communication, rather than artificial language such as computer programming language, can be in the form of text or voice, but usually in the form of text is given priority to, because the discourse translation involves speech recognition and synthesis, which is relatively independent research field. MT uses a computer, and its processing is automated [9, 10]. MT is a branch of natural language processing (NLP) to realize translation of natural language. MT is also closely related to computational linguistics (CL) and natural language understanding (NLU). NLU is the core content of computational linguistics and the basis of MT. Since NLU is also a branch of artificial intelligence (AI), MT is also an application of AI, which can be regarded as a process of applying human linguistic knowledge to natural language processing [11, 12].

The adjustments of the paper are as follows.

Section 2 defines the related work of MT and creates a syntax tree of the source language. The instance-based approach and MT-based method are discussed.

2.1. Rule-Based Approach

Before the 1990s, the mainstream method of MT has been rule-based MT (RBMT), also known as the traditional MT method. The basic principle of the rule-based approach is to first analyze the source language and generate a syntax tree of the source language. Then, the syntax tree is converted into the syntax tree of the target language by the translation rules. Finally, the translation is generated according to the translation generation algorithm. The transformation rules can be regarded as a translation template with high generalization ability.

The rule-based method, by recognizing and analyzing the phenomena of a certain language, constantly summarizes its rules, imitates the formation of rules of language, generates rules of grammar and semantics, and saves them in the rule base of the system. In the process of translation, the system uses these rules to analyze the source language, get an internal intermediate language, and then through transformation, get a structure of the target language, and finally generate the translation.

Rule-based translation is the most mature and widely used MT technology so far. However, to build a practical rule-based MT system, it is often necessary to build various knowledge bases, describing the lexical, syntactic, and semantic knowledge of source and target languages, and even describing the world knowledge unrelated to language knowledge. However, it is extremely difficult to describe and establish these knowledge bases. First, the knowledge base must be created and maintained by many trained experts. To make matters worse, as the size of the knowledge base continues to grow, it becomes difficult to ensure that newly introduced knowledge does not contradict old knowledge. Therefore, knowledge acquisition becomes the bottleneck of traditional MT methods.

2.2. Instance-Based Approach

In the mid to late 1980s, some researchers recommended corpus-based MT methods. Unlike traditional methods, corpus-based methods collect large-scale bilingual data for mutual translation and undertake translation based on these data, rather than performing in-depth language analysis. The corpus-based approach has two parts: one is called instance-based MT and the other is statistics-based MT.

Example-based MT (EBMT) was proposed by the famous Japanese scholar Maku Nagao in 1984. The starting point of case-based MT is to avoid complex syntactic and semantic analysis of the source language as far as possible, so as to reduce the probability of errors. And rule-based method is the method based on the instance while also using existing corpus for translation, but not directly deal with rules, but implicit in the parallel case database, using the phrase template-based MT research into the corresponding relationship between sentence analyses rather than of rules, thus reducing the rule analysis caused by the error. During translation, the instance library is searched for an example sentence that is most similar to the input sentence, and the input sentence is translated in the same way as the example sentence in the instance library. In the case of successful matching, the case-based MT method has high accuracy and fast translation speed and successfully imitates the process of human translation.

The notable feature of case-based translation is the high accuracy rate of the same or similar texts. And with the increasing of instance library, its translation effect becomes more and more obvious. For the existing same chapters, paragraphs, and example sentences in the library, it can be directly translated and high-quality translation is obtained, while for the very similar chapters, paragraphs, and example sentences in the library, it can also get high-quality translation through the method of analogy as long as a little modification in the translation process.

The instance-based MT approach also features many issues. For similarity calculation, if similarity of word class or phrase level is calculated, we need to mark the translation memory bank itself first. It is also very difficult to define a similarity criterion for selecting the most suitable similar sentence. Furthermore, as translation memory expands, a high-speed query matching technique is required, and translation memory redundancy is ensured while growing the size and matching rate of translation memory.

2.3. Template-Based MT Method

Rule-based MT (RBMT) starts from humanism and is based on rules, which is good at grasping universal laws in language. Corpus-based MT (CBMT) starts from empiricism and relies on large-scale real text for translation, which is good at various special phenomena of natural language. In view of the advantages of both methods, the hybrid strategy of combining empirical methods with traditional methods based on rationalist rules has become the consensus of many MT researchers in the world. Template-based MT (TBMT) comes into being in this context.

Template-based MT improves the accuracy of translation, and template extraction is the key to ensure accurate translation generation. For a TBMT system to have good translation results, the template library must be large enough to ensure that the template library covers a wide range of areas. However, to construct such a large-scale template library by the manual method, the cost is unacceptable, and it is difficult to guarantee the coverage of template library. At present, large or small bilingual corpora are stored in many EBMT systems, and bilingual aligned corpora are very rich in resources [13].

Template-based MT is good at describing language collocation knowledge, special sentence (or sentence fragment) translation pattern, and other language knowledge, which is more convenient for describing personalized knowledge in language. However, the application of template-based methods in MT systems often needs to break the principle of “independent analysis and independent generation” in the transformation-based model. Translation templates often complete source language analysis and transformation generation in one stage. Its advantage is that the analysis and processing of local language phenomena can be done in one step, reducing some intermediate links; its disadvantage is that it will reduce the boundaries of source language analysis and transformation generation, leading to some clash [14].

Translation templates were originally developed on the basis of the instance-based translation method, which can further abstract the instance translation knowledge and group several similar instances into one translation template, thus greatly reducing the scale of the instance library [15]. The definition and extraction methods of translation templates vary with different translation models: in instance-based MT systems, translation templates, as an important resource, increase the coverage of instances and improve the efficiency without affecting the translation quality. Template learning methods can be roughly divided into the following categories:(1)The clustering generalization of specific words forms the translation template. This method can improve the coverage of instance translation but the variables are limited to a few words.(2)Bilingual syntactic analysis and word matching results are used to extract translation templates at various syntactic levels. Such methods require a robust bilingual parser and an accurate bilingual counterpart.(3)Learning by Analogy. Analogical learning is the simplest and most successful method, but it also has some problems. The goal of current study is to define a high-quality translation template and to obtain it with high quality. In the training process of statistical MT, the consistency of alignment determines the reliability of the aligned corpus itself.

Statistical MT also adopts a “translation mode-align template-hold word” correspondence. The current statistical MT system has three core technologies: translation model, language model, and find algorithm. The basic principle is that the translation model is used to restrict the word correspondence between source language and target language, and the target language model is used to drive the search process. Many efficient translation systems are syntactic-based statistical MT systems, which have evolved from the basic word-based translation model to a more complicated model, phrase alignment template-based translation model, and syntactic based translation model and accompanying search algorithm [16].

Template extraction is the key to ensure accurate translation generation in template-based MT. Although it has been proved that the description ability of the template-based method is equivalent to that of the method based on production rules, both methods have advantages and disadvantages in practical application. For the linguistic phenomena of rules in language, the rule method has its advantages, which can not only describe these phenomena effectively but also has a wide coverage. For irregular linguistic phenomena in language, the template approach shows the advantage of describing them in more detail and thus more accurately than rules [17].

The definition and quality of templates largely determine the performance of MT. Therefore, defining a high-quality translation template and obtaining it with high quality have become the focus of current research. In early MT systems, translation templates were often extracted from corpus by hand. As the corpus gets bigger and bigger, this artificial method becomes more and more difficult and leads to more and more errors. The research of automatic acquisition of translation templates from corpus is of great practical significance. Figure 1 shows an example of inconsistent alignment [18].

The main source of alignment inconsistency is discrepancies between and within annotators, which are frequently generated by inconsistent and ambiguous alignment and labeling standards.

3. Example-Based MT

Example-based MT (EBMT) is based on the recognition that previous translations are always reliable and that there is always some information and translation knowledge that can be reused for new translations. This knowledge is derived from our own translation experience, so it is very effective, especially for the same kind of document or sentence.

EBMT’s main research contentment is to make use of earlier translation knowledge and results and effectively change the translation according to the difference between input and example sentences. A simple EBMT system flow is shown in Figure 2.

The method in this paper is based on the two-level template structure. In the process of definition, we refer to Zhang Jian’s work and make targeted extensions. A complete translation template consists of the following parts: the source language template, the target language template, the corresponding relationship between the slots of the two language templates, some other attributes, and so on. The definition of monolingual template in this paper refers to the representation method of frame knowledge in the field of artificial intelligence. Each monolingual template is composed of frame and slot, in which frame is composed of some fixed words and slot is composed of variables and constraints on it. The definition is as follows.

Definition 1. (Sentence template (ST)).where is a word and is a slot in the sentence template. In our definition system, slot mainly represents a subblock, which can contain only one word.

Definition 2. (Chunk template (CT)).where is the word and is the slot in the subblock template. The subblock template contains only one slot in our definition because the subblock template used in this paper is mainly noun block, so the slot of the subblock is defined as the final word of the subblock template. If the final word part of speech is not a noun, the last noun is taken as the slot of the subblock template.
When calculating the probability of phrase translation, it will cause the problem of low probability differentiation if the calculation only depends on the frequency of phrase template occurrence. Therefore, CMU proposed the following formula for calculating probability of phrase translation based on IBM Model 1:This method has some disadvantages; that is, if there is no appropriate corresponding word in the corresponding target language phrase template, the translation probability of the whole phrase template is very small, but many function words or auxiliary words in the Chinese phrase template make this situation very serious. A small phrase module is formed so that the original phrase template translation pair can be regarded as a combination of several small translation pairs. The formula becomeswhere i represents the serial number of the small phrase template blocks divided into, k represents the serial number of the word contained in the i-th block. By combining the two methods and adjusting the frequency of occurrence with phrase template, the phrase template translation probability calculation formula is as follows:where indicates the frequency of the template.
After template extraction, probabilistic statistical features are added to make the template applicable to syntactic statistical MT systems or to adjust the matching results by using probabilistic information in the use of templates in instance-based or template-based MT systems. The probability calculation method of the template is as follows:Let us take the English sentence as an example and write the two English sentences in the following form: and . A subsequence of is obtained by deleting m-s characters of character . LCS represents the longest common subsequence of strings and , and the length of LCS is expressed by . The value of can be calculated by the following recursive formula:Its ; Max indicates the maximum operation.
If Len is used to represent the length of the string , then we can also define the following normalized LCS distance from :Obviously, represents the best score of and , which is also a sign that X and Y have been matched. In order to obtain the matching sequence of and after the matching is completed, we must record the matching path during the construction of the matrix, that is, returns to the value that makes it take the maximum value. The matching path is shown in Figure 3.
For the sentence , the word sequence is , where is the length of sentence , then the probability of -element language model can be expressed as follows:where to is the beginning of the sentence and is the end of the sentence The language model probability of decomposed sentences is as follows:All language model probabilities are carried out in accordance with the definition in Formula (9), and the calculated results are uniformly converted into logarithmic forms for convenience of calculation. According to the segmentation method in this paper, the segmentation position K of any clause group must meet the following conditions:Therefore, the recursive definition of probability score of clause group language model can be obtained as follows:According to the definition in Formula (12), the maximum linguistic model scoring position and the optimal segmentation position of the sentence will be found.

4. Experimental Analysis

In this experiment, we use Chinese as the source language and English as the target language. The Linguistic Data Consortium provided the majority of the Chinese-English bilingual corpus, and the experimental data were Chinese News Translation Text Part L (LDC2005T06). The particular statistical information is shown in Table 1.

The main task of this experiment is to verify the correctness of phrase template extraction, without verifying the translation ability of the template. Therefore, the template extraction results are evaluated by manual judgment instead of MT evaluation tools. We filtered out certain terms due to faults in Chinese segmentation, syntactic analysis, and word alignment and eventually extracted 437,298 templates. We corrected 100 sentences manually and calculated the correct rate of template removal.

Due to the different sizes of the tested and manual corrected corpus, the accuracy of the test is not considered in the calculation of the similarities and differences of the calculation probabilities. We compared the results without bias treatment with those with bias treatment, and the results are shown in Table 2.

To verify the performance of the algorithm, we run the algorithm on two corpora of 4000 sentence pairs and 10000 sentence pairs, respectively. The average length of English sentences in Corpus 1 is 64 characters, and the average length of Chinese sentences is 24 characters. The average length of English sentences in Corpus 2 is 67 characters, and the average length of Chinese sentences is 27 characters. The test results are shown in Table 2. System 1 uses the traditional string comparison to realize the ATTEBSC algorithm, while system 2 uses LCS to realize the ATTEBSC algorithm.

As can be seen from Table 3, system 1 generates a large number of useless templates, while system 2 can filter out a large number of pairs with low similarity in the process of searching for matching sequences. Therefore, the accuracy of obtaining templates is greatly improved compared with system 1, and the efficiency of template extraction is also significantly improved.

In order to make a more intuitive analysis of the experimental results, we give the histogram results in Table 3. Its histogram is shown in Figure 4.

Because this article is to establish instance to sentence segmentation clause library, the final segmentation fragments should be split at the clause, so in this paper, so in this paper, in long sentences, it is anchored by English-Chinese punctuation and has high punctuation alignment accuracy, which is very much in line with the definition of anchor words, and the punctuation alignment accuracy statistics are in Table 4.

It can be seen from the algorithm that the establishment method of mapping between trees is to carry out hierarchical traversal of two syntactic trees synchronously and establish replacement mapping for different nodes. The template was created according to the mapping results of the syntax tree, as shown in Figure 5.

As you can see from Figure 5, the resulting template after substituting different parts is as follows: #1 and #2 can be as #3 and #4. In syntax tree matching algorithm, the mapping of replacement nodes will be obtained after the matching, and then the translation template can be obtained according to the mapping and word alignment results.

Creating a matching template mainly accomplishes two aspects: (1) first, replace the replaced part according to the word alignment result; (2) assign a nonterminal node number to the replacement node. The purpose of assigning a nonterminal node number is to allow for iterative matching processing and the generation of a template graph.

5. Conclusion

Based on syntactic statistical MT templates, this paper introduces the research status of generalization of translation templates and analyses their characteristics and defects. The purpose of this paper is to propose a generalization template for different MT problems with different template extraction methods and different template characteristics: a string of Chinese-English alignment template based on phrase structure tree. This translation template provides a feasible method for establishing a unified, perfect, multifunctional, and multipurpose translation template library. In this paper, the efficiency and performance of the ATTEBSC algorithm are improved by using the LCS algorithm to find the matching sequence of two translation instances and using normalized LCS distance to screen sentences with high similarity. Although the ATTEBSC algorithm based on LCS has a good running efficiency and can extract more useful templates, because of the availability of pointless templates, it is required to manually select the templates in advance if they are to be implemented in the MT system. There are one-to-many and many-to-one relationships between the nonterminals in the corresponding target language template and the nonterminals in the source language template, so it can reflect the actual natural language phenomenon more reasonably than the traditional template definition. The template is further generalized, so that the translation template has certain representativeness, the scope of use is expanded, and the translation ability is enhanced [1318].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.