Abstract
The data-driven learning model is a newer learning method. It not only provides learners with rich, diverse, and real-language big data, but it also creates an ideal learning environment for them because of its corpus-based teaching and learning characteristics. This paper will look at how to analyze and research data-driven learning in college English translation classes, as well as describe the data-driven approach. This paper presents the problem of data-driven translation teaching design, then expands on the concepts and learning methods of deep learning, and conducts case design and analysis of college English translation teaching. The results of the experiments show that there are numerous issues in traditional translation instruction, some of which are caused by students and others by teachers. The overall situation of the experimental group is better than that of the control group after the application of the college English translation teaching design, indicating that the DDL method is better than the traditional method (), and the overall effect of DDL is better than that of the traditional method.
1. Introduction
To meet the needs of college English reform, teachers are aware of the importance of college English reform and the need to change the style of their current curriculum, and web-based and computer-based activities are highly encouraged in current curriculum requirements. Modern educational principles emphasize learners and require learners to participate in the construction of their own knowledge system. Therefore, with computer-assisted language teaching and learning, data-driven learning seems to be an appropriate approach, mixing interest knowledge and curriculum design that requires reassessment of teachers.
The ongoing evolution of the information age presents an excellent opportunity for the advancement of education. The use of open data-driven learning in college English translation classes has sparked a new wave of reform in the field. The primary goal of data learning is to assist students in discovering and summarizing language phenomena as well as observing a large number of them. It differs from traditional teaching in that it emphasizes students’ autonomy in learning, uses real language as the observation object, and emphasizes self-discovery and exploration as part of the learning process. Such DDL activities increase attention and grammar awareness, which can help language learning in the long run. Translation teaching based on data-driven learning [1, 2] can help learners develop their ability to learn independently and improve their cognitive and metacognitive skills.
The innovations of this paper are as follows. (1) This paper combines data-driven learning with college English translation teaching and introduces the theory and learning methods of deep learning in detail. (2) In the face of two translation teaching methods, use traditional teaching and DDL teaching to design experiments, respectively. This paper compares the performance of the two methods by evaluating the experimental results and draws the conclusion that DDL is superior to traditional teaching.
2. Related Work
With the rapid development of corpus linguistics in the late 1960s, corpora became familiar to researchers as a potential resource in language teaching. Corpora are increasingly being used in education, with one example being language learning through direct contact with corpora, also known as data-driven learning. In the context of translation education in Chile, Singer presents the implementation of teaching units based on Singer’s data-driven learning (DDL) recommendations [3] within a task-based (TB) framework. According to student and lecturer feedback, the DDL-TB method is appropriate for language teaching in translation education. He also made some additional suggestions for future DDL-TB projects. His data, on the other hand, are quite limited. To identify purpose-specific words taught in data-driven vocabulary learning activities, Otto proposes a three-part approach. He discusses one disadvantage of the system and the time it takes to implement it, as well as two significant benefits. He investigates the role of words in civil engineering discourse and identifies words with no obvious connection to engineering (for example, existing or used), which coaches may overlook. His performance, on the other hand, is not particularly impressive. Student factors such as vocabulary adequacy, strategy use, and functional memory were investigated in relation to successful corpus-based second language vocabulary learning, according to Lee. The participants’ second language vocabulary level and working memory were found to play a role in vocabulary acquisition and maintenance in the study. He did not, however, account for the impact of other factors in the experiment. In a data-driven learning task, Kim analyzes the trajectories of six English as a Foreign Language (EFL) learners to identify synonyms. Qualitative analysis of trajectories shows that intermediate learners focus on meaning and find the correct answer without knowing the core meaning, and advanced learners focus on structural differences, sometimes testing their previous knowledge on relevant data, by comparing the six participants with significantly different trajectories in distinguishing between synonyms. He did not, however, provide any figures for the phenomenon [4]. The Crosthwaite study describes how a dedicated corpus query and data visualisation platform was integrated into a large postgraduate subject dissertation writing programme at a Hong Kong university to track students’ corpus usage. The findings reveal significant interdisciplinary and inter/intrauser trends and variations in corpus users’ use of specific corpus features and query syntax. His research content, on the other hand, is not sufficiently novel [5]. In CRNs with imperfect perception, Xu et al. investigated the fundamental problem of decentralised secondary users (SUs) performing multichannel perception and access. They formulated the channel perception and access process as the multiarmed bandit problem (MABP), on which they proposed a big data-driven online algorithm, in order to deal with large-scale sampled data. Their algorithm is logarithmic and asymptotic in finite time, according to theoretical analysis and simulations. Their algorithm, on the other hand, is inefficient [6]. Data-driven learning has cognitive and affective benefits, according to Moon and OH. (DDL). According to research, learning from negative classroom factors helps students, particularly those with lower levels, improve their grammatical awareness and increase their motivation to learn. However, their method is insufficiently detailed [7].
3. Application Methods in Data-Driven Teaching
3.1. Concept Introduction
3.1.1. Corpus
A corpus is of Latin origin and refers to any text in oral or written form. Today, it is commonly used to represent large collections of text representing different domain-specific information. The text is stored in the computer and presented to readers and researchers in a readable form.
According to the above definition, corpora can reveal real language phenomena and overcome the limitations of intuition and introspection, both of which are poor guides for language learning. Natural and frequently occurring real language is represented in the corpus, at least in terms of collocation, frequency, prosody, and diction. As a result, corpus is capable of overcoming subjectivity and one-sidedness. In computer-generated corpora, however, the analysis results of correlation lines and frequencies are more reliable, scientific, and accurate. The advantage of corpus is that it provides a large database of naturally occurring texts that can be analyzed in terms of real textual structures and patterns. Researchers, teachers, and learners do not rely solely on intuition to find explanations that fit the evidence because it is impossible for people to experience all instances of language use. A corpus in computational linguistics typically consists of millions of words. The benefits of corpus include a large amount of information, quick retrieval speed, and accurate retrieval, which offers a new way to solve the problems of “what to teach” and “how to teach” in English teaching [8].
3.1.2. Data-Driven Learning
Data-driven is recognized by more and more people at the same time as the rapid development of educational big data. Data-driven generally refers to the method of discovering, analyzing, and solving problems through data mining [9] and data analysis. The author used “data-driven” to search for keywords on CNKI, and the time ended in December 2021. The attention index of “data-driven” keywords is shown in Figure 1.

For the past fifteen years, researchers have used large-scale corpora to study the languages that are actually used. These corpora significantly improve the quality of teaching reference materials. A new approach to language learning has emerged where students work with “raw” information taken directly from corpora, and it is called DDL, or data-driven learning. Generalization plays an important role in the learning process of DDL. Students need to actively participate in learning activities and outline language rules. A generalization curriculum will be completely foreign to many Chinese students trained in an educational system that emphasizes memory rather than generalization or theory generation.
DDL goes beyond traditional foreign language instruction by combining corpus linguistics research with modern information technology and applying it to second language instruction. Students actively use the material in the corpus to find meaning and language rules on their own, according to the data-driven learning theory, which helps to improve comprehensive skills. Learners must seek, identify, and infer language rules from the context in order to discover the grammatical rules of the material being studied. When students are encouraged to follow this model, they examine authentic language material and come to their own conclusions. Students will be able to effectively acquire language in this manner. Students learn rules and definitions in a traditional classroom by deducing them from the teacher’s deductive approach and reference books. As a result, this inductive method complements the deductive method currently in use.
DDL involves setting up situations in which students can answer questions about language on their own by studying corpus data, and setting up language learning situations may vary by purpose. Teachers and students can use the original index tables and observe them, not necessarily knowing what they will find, but exploring rules, patterns, and meanings. Alternatively, teachers can carefully select and edit index lines and possibly create material to reveal linguistic features.
3.1.3. Advantages and Disadvantages of Data-Driven Learning
Data-driven learning enables learners to use computer network technology and related software tools, such as search engines, to improve students’ cognitive processes in modern information teaching. Of course, this is also very important for their future social development. Therefore, we are changing the problems that exist in college English translation classes. We change the current teaching status and adopt advanced teaching methods to meet the high-performance requirements of single translation, so that students can master language learning skills, improve autonomous learning ability, and cultivate system management ability and individual and autonomous learning.
Corpus are considered beneficial for language teaching, mainly because they provide learners and teachers with a variety of real target language inputs and information about the frequency of use of certain language items and the most common word pairs or phrases. The advantages and disadvantages of data-driven learning are shown in Table 1.
3.2. Deep Learning and Shallow Learning
The study of learning based on deep neural network [10–12] is called deep learning [13, 14]. Figure 2 plots two different neural networks.

(a) Simple neural network with only one hidden layer

(b) Deep neural networks with more than two hidden layers
Deep learning is clearly defined as the goal of cultivating superior thinking, where learners critically construct the basics of the subject based on intrinsic motivation to learn positive emotions, attitudes.
Shallow learning is a type of learning that is simple and mechanical. Trainees passively acquire new knowledge through repeated memorization based on external learning incentives, and their behaviours are passive treatment and lack of knowledge interaction. In order to pass the test and for other purposes, the trainee studies the content of the test and memorizes it as personal and irrelevant facts. Long-term knowledge retention and flexible application are difficult to achieve. This study compared deep learning and shallow learning in terms of goal level, learning motivation, learning goals, learning behaviour, learning style, knowledge system, reflective state, transfer ability, terms of focus, cognitive results, learning environment, evaluation methods, emotional attitudes, and so on by sorting out the research status at home and abroad, as shown in Table 2.
It should be noted that emphasizing deep learning does not deny shallow learning, and learning is a continuous process from shallow learning to deep learning [15]. As shown in Figure 3, shallow learning creates a platform for deep learning. The richer the background knowledge of students, the more connections between knowledge can be constructed, which can only be achieved after multiple thinking steps.

3.3. Learning Methods
3.3.1. Logistic Regression
The simplest regression is linear regression, such as , which represents the relationship between the independent variable and the dependent variable . However, the robustness of linear regression is poor, mainly because the sensitivity of linear regression is consistent in the whole real number domain, and the classification range needs to be in [0, 1]. Based on linear regression, logistic regression applies a logistic function, but because of this logistic function, logistic regression has a favorable place in the field of machine learning [16].
The assumptions of the logistic regression model are where represents the feature vector and represents the logistic function. A commonly used logic function (and also the function mainly used in this paper) is the sigmoid function, and its expression is shown in formula (2):
The graph of this function is in Figure 4:

Therefore, the assumptions of the logistic regression model are as shown in formula (3):
For the LR model, the classification is based on the following: for a given input variable , the probability of the output variable being 1 is calculated according to the selected parameters, namely,
The ReLu (Rectified Linear units) function is a popular activation function that has gradually replaced the sigmoid function [17]. The function graph of ReLu is shown in Figure 5, it does not tend to saturate with the gradual increase of the input .

3.3.2. Support Vector Machine
Support Vector Model (SVM) is a learning model developed by many researchers based on statistical machine learning theory. In recent years, extensive research on vector support machines has made support vector machines one of the most important advances in the field of text classification [18]. Support vector machines (SVMs) are developed from optimal grading surfaces with linear separability. The basic idea is to find a hyperplane ordering that satisfies the requirements and keep the points in the training set as far away from the ordering surface as possible, and the distance between them is called the classification interval. The optimal classification surface requires that the classification line not only correctly separates the two classes (with a training error rate of 0) but also maximizes the class separation between the two classes, as shown in Figure 6.

Assuming that the training data can be separated from an error-free hyperplane . The classification hyperplane that is farthest from the sample points of the two classes should obtain the best generalization ability. The optimal hyperplane will be determined by the few sample points (called support vectors) closest to it. The classification hyperplane with sample interval θ is
The classification superlevel for SVM optimization problems should be normalized as follows, at least , λ and can be scaled. The hyperplane is represented as
The distance to the support vector on the hyperplane is ; therefore, the optimization problem is expressed as
According to the square programming method in the optimization theory, the problem is transformed into a double Wolfe problem to solve, constructing the Lagrange function:
In the formula, is the Lagrange multiplier.
According to the optimization principle,
That is
The Wolfe dual problem of the original optimization problem is obtained by operation:
Its solution is the optimal solution to the initial optimization problem. An optimization algorithm can be used to solve ; the parameter can be calculated according to the Kansh-Kuhn-Tucker condition:
So the optimal hyperplane is
For linear inseparable classification problems, the input can be assigned to a high-dimensional feature space by a nonlinear function, and linear classification is performed in this space, namely,
4. Experiment and Analysis of Translation Instructional Design Based on Data-Driven Learning
4.1. Questionnaire Survey Results
This research conducted a questionnaire survey on 284 non-English majors, including 168 boys and 116 girls, who came from different majors, including civil engineering, computer, and fashion design. The purpose is to make the data universal and authentic.
According to Figure 7(a), the question “Are you good at English translation?” is raised in the questionnaire, in order to investigate the English ability of the students. According to the data, among them, 28 people chose to be good at English translation, accounting for 9.86%, and 56 people chose to be relatively good at English translation, accounting for 19.72%. There are 90 people who choose to be average, accounting for 31.69%, and 110 people who choose not to be good at it, accounting for 38.73%. From this, it can be seen that among non-English majors, fewer students are able to translate English smoothly, and students generally have little interest in English learning.

(a) Students’ English translation ability

(b) Problems in the classroom

(c) Own English translation problem
According to Figure 7(b), it can be seen that in response to the question “What problems do you think exist in the current college English translation class (multiple choices)?”, 84 people chose the class as boring, accounting for 29.58%. There are 206 people chose not to feel that students are the center of the classroom, accounting for 72.54%, and 42 people chose the content is difficult, accounting for 14.79%. There are 142 people who choose the lecture mode that lacks novelty, accounting for 50.00%, and 184 people who choose the lack of interaction, accounting for 64.79%,
It can be seen from Figure 7(c) that 156 people chose to have insufficient basic skills in English and Chinese, accounting for 54.93% of the question “What aspects do you have in English translation (multiple choices)?” in the questionnaire. There are 192 people chose not to understand some typical differences between English and Chinese, accounting for 67.61%, and 178 people chose to lack basic translation strategies, accounting for 62.68%.
It can be seen that the college English translation classroom needs to be transformed, and the classroom based on data-driven learning, the construction of a parallel corpus for translation teaching, which can better serve students, better promote students’ learning, and provide learners with authentic English.
4.2. Instructional Design
Most of the traditional translation teaching adopts the teacher-centered evaluation method of translation skills. Under the guidance of teachers, students learn, and students master passive translation knowledge, but cannot actively cultivate translation skills. In addition, they have their own professional courses, and it is impossible to spend a lot of time on corpus screening like English majors. According to the survey, DDL has been divided into three steps: asking questions, solving questions, and summarizing [19, 20]. These are carried out under the leadership of teachers, in order to be more in line with modern learning concepts, and for students to truly master knowledge. We have made relative adjustments and changes on the basis of the original three steps, trying to create a student-centered, bottom-up teaching mode of discovery and translation. This process is summarized as (teacher) corpus presentation-(student) observation and discussion-(student) reporting results-(teacher) guiding comment-(student) practice consolidation-(student/teacher) comparative evaluation. The design process is specifically shown in Figure 8.

This instructional design transforms traditional teaching into a progressive teaching process. It uses an introduction to real expectations and predefined problems, followed by collaborative learning, observation, discussion, analysis, and student reporting. Then, under the guidance and in-depth explanation of the teacher, students will effectively combine translation theory and translation practice at the same time. By analyzing the gap between the actual translation text and the core text, students and teachers finally understand translation knowledge and students master translation learning skills independently. In the whole process of teaching, it is not only necessary to ensure the continuous input of language but also to allow students to exert greater language ability [21].
4.3. Teaching Applications
According to the courses in the first semester of the sophomore year of the college English final exam, students from two classes of the second year of the same level of non-English majors in a university were selected as the subjects of this experiment. There were 50 students in each class, and a total of 100 students participated in the experiment, and they came from various non-English majors. The 50 people who participated in the posttest and the experimental group participated in the questionnaire survey. The tests and questionnaires of these students were collected, and the questionnaires were all valid questionnaires. Firstly, the data obtained from the posttest is described and analyzed by SPSS, and then, the results of the questionnaire are summarized and discussed. The British National Corpus (BNC) was selected for this experiment. This is a collection of 100 million words of written and spoken language samples from a wide range of sources.
The entire class was treated as a group of participants to reduce between-group effects. Two groups of participants were exposed to two different teaching methods in the experiment. As the control group, class A received traditional translation instruction, while class B received college English translation instruction based on data-driven learning as the experimental group. The experiment was conducted with the same teacher in both groups to reduce the possibility of the experimental results being influenced by differences between the two groups. The target words were chosen based on the results of the pretest, and the students were given 120 words to write down the definition and part of speech of each word. Based on the pretest results, 30 target words were chosen based on the criteria that none of the students responded correctly to them.
The corpus-based materials were distributed three days before the experiment to the experimental class, and they were allowed to discuss but not to refer to books. On the lab day, students will be asked to share their findings with their classmates, and the teacher will offer suggestions or opinions based on their findings. In the control group, the teacher presented the words to the students and explained their meanings, parts of speech, and associations with other words using the traditional method. The participants were then asked to complete sentence-building exercises. In addition, related learning introduces strategies to assist them in learning. Both methods were used under the same conditions throughout the experiments, removing the possibility of external inferences.
Immediately after the experiment, a posttest was performed to test the different effects. In addition, all students in the experimental class were asked to complete a questionnaire. In this experiment, the independent variable was different vocabulary teaching methods, and the dependent variable was the fractional posttest. Therefore, the results of the test are used to indicate the difference between the two learning methods by means of a paired sample -test. For accuracy and efficiency, the posttest results were input into the computer and analyzed with SPSS. The data analysis results are shown in Table 3.
5. Discussion
First of all, through the study of relevant knowledge points of literature works, this paper initially masters the relevant basic knowledge and analyzes how to conduct research on college English translation teaching based on data-driven learning. The depth concept and related technical algorithms are expounded, the learning methods in logistic regression and support vector machines are explored, and the applicability of data-driven learning in translation teaching is analyzed through experiments.
Data-driven learning is a way to increase input and focus on learners. It provides real data that facilitates conscious learning and is the best activity for increasing self-awareness. At the same time, this learning method can not only stimulate students’ autonomous learning but also promote students’ learning ability and improve learning effect [22].
Through experimental analysis, this paper shows that the DDL method is better than the traditional method (). Paired samples -tests indicated a significant difference between participants’ performance and DDL-treated words and traditionally treated words (). So, the trend of significance shows that DDL does play a role in vocabulary learning. Therefore, DDL works better overall than traditional methods.
6. Conclusions
As a new learning method, data-driven learning has been widely used in college English translation teaching. Data-driven learning provides a variety of learning resources for translation teaching and solves the problems of outdated content and insufficient knowledge in college English translation teaching. This way of learning not only helps students master the study skills of English translation but also helps to cultivate students’ autonomous learning ability and meet students’ individual learning needs. Properly used in students’ foreign language learning, DDL is bound to see a broad prospect and have a profound impact on language acquisition. Therefore, it is necessary to study the correlation with other factors, whether it can be generalized, and what form it needs to be further explored. For today’s talent society, the comprehensive quality of students has also been improved to a certain extent, and it is also spread among modern students. In these respects, vigorously promoting data-based learning also requires front-line teachers to understand the concept of classroom teaching and must combine teaching methods with modern information technology to optimize the overall environment for English translation teaching.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The author does not have any possible conflicts of interest.