Abstract
Molecular dynamics simulation is an effective method for related research in the microscopic world, mainly related to molecular atomic research, and it has a wide range of applications in physics, medicine, and other disciplines. Meanwhile, with the expansion of data to be studied, the performance of molecular dynamics simulation in terms of calculation speed and other aspects can no longer fully meet the needs of today’s data analysis and calculation. On the basis of the original method, parallel calculation and optimization improvement of the method have become the focus of researchers. This paper will use machine learning (ML) and data mining algorithms to optimize the performance of molecular dynamics simulation and use Restricted Boltzmann Machines (RBM) and K-Nearest Neighbors (KNN) in the molecular dynamics simulation system. Through the parallel optimization experiment of KNN, the effectiveness of KNN optimization is obtained, and finally the molecular dynamics simulation optimization experiment is designed. Through comparative analysis with the simulation system before optimization, it is concluded that when the number of particles is 4096, the efficiency ratio of force calculation running time is the highest of 31.15%. When the number of particles is 512, the running time efficiency ratio of the motion trajectory equation is up to 30.28%. When the number of particles is 256, the efficiency ratio of running time tending to balance judgment is the highest of 36.96%. All the results show that the performance of the optimized simulation system has been improved. The experimental results are in line with expectations.
1. Introduction
With the common development of social science, technology, and culture, the research in the fields of physics, medicine, chemistry, biology, and other disciplines has become more and more in-depth, and the requirements for the methods used in the research have become higher and higher. Molecular dynamics simulations are widely used in these fields. Molecular mechanics simulation can be used to study the physical or chemical properties of particles, and even some hypothetical experiments that are difficult to complete in reality can be completed. For the existing research environment and research needs, this can effectively reduce the time cost and material cost of research. Therefore, molecular dynamics simulation has been paid more and more attention by researchers in recent years, and it has developed rapidly in multidisciplinary fields. However, in the process of rapid development, a new round of challenges has followed. The expansion of research data scale and the deepening of research problems have prompted the need for molecular dynamics simulation to be faster, more effective, accurate, and more convenient. The operational role of machine learning and data mining algorithms in large-scale data has been confirmed by research in multiple disciplines. Through machine learning and data mining, the calculation process can be made more accurate, and the calculation can also be parallelized, which can effectively save the computing time.
Molecular dynamics simulation simulates the movement of particles according to the mechanical principle formulas of physics and extracts certain samples from the simulation system regularly and intermittently through the simulated particle motion trajectory and the simulation process. Then it is determined whether the state of the simulated system tends to balance. When it can be determined that the simulated system has reached an equilibrium state, the microscopic and macroscopic properties such as various physical and chemical properties in the particle system can be analyzed according to the relevant results of the simulation. In the process of simulation analysis, the scale of calculation and analysis is huge, and the complexity of calculation is related to the number of particle systems to be analyzed. The number of particles in the particle system to be studied usually starts at tens of thousands. Therefore, the optimization and improvement of the calculation process of the calculation model is the research direction of many researchers. In this paper, the performance of molecular dynamics simulation is optimized through machine learning and data mining algorithms. On the one hand, it expands the application fields of machine learning and data mining and provides a reference for the optimization of machine learning and data mining algorithms in other fields. On the other hand, it provides ideas for the optimization of the calculation process of molecular dynamics simulation and provides data support for the application and improvement of molecular dynamics simulation in practice, which has certain theoretical and practical significance.
Starting from the current situation of molecular dynamics simulation, this paper combines machine learning and data mining algorithms to optimize the calculation process of molecular dynamics simulation according to the new challenges existing in molecular dynamics simulation in actual operation. Through comparing the relevant data analysis before optimization experiment, it is concluded that the optimized molecular dynamics simulation performance is better, which verifies the effectiveness of the optimization method used in this paper. The innovation of this paper is to apply the new technology in the era of big data to the molecular dynamics simulation system and to combine the methods of machine learning and data mining. Relevant methods suitable for molecular dynamics simulation in the two algorithms are extracted and combined with the characteristics of molecular dynamics simulation system.
2. Related Work
Molecular dynamics simulation obtains the macroscopic properties of matter through the observation of microscopic particles, so this method is widely used in disciplines such as biochemical physics and related disciplines. Researchers in these related fields have done a number of studies on this method. Li et al. investigated the vapor-liquid interfaces of fourth-generation refrigerants using molecular dynamics simulations and provided predictions about their vapor-liquid equilibrium and interface properties derived from the simulations [1]. Liang et al. investigated the effects of defects and temperature on the mechanical properties of hexagonal boron nitride sheets (h-BN) containing randomly distributed defects through molecular dynamics simulations and discussed the reasons for the results [2]. Using molecular dynamics and related types of models to simulate the collective behavior of molecules in condensed phase systems, Afzal et al. provides unique insights into a detailed molecular-level understanding of the mechanisms behind the dissolution behavior of ASDs [3]. Chaban et al. studied how the ionic volume affects the molecular dynamics properties of double-layer capacitors [4]. Several studies have shown the practicality of molecular dynamics simulations, but at the same time, molecular mechanics simulations need to be improved for computational processes with large particle numbers.
Machine learning and data mining algorithms are algorithm models driven by the development of computer science in recent years. These two algorithms are widely used and applicable in many disciplines. Helma et al. used ML to design the action in the game, to extract the relevant features of the game through the recurrent network, and to observe the combined skills in the game action [5]. Mullainathan and Spiess obtained the possibility of applying ML methods in the field of economics through the performance analysis of ML [6]. Zhou et al. believed that the emergence of big data has promoted the development of ML. They obtained the opportunities and challenges of ML in the field of identification through theoretical analysis [7]. Chaurasia and Pal used data mining technology to predict the risk of breast cancer [8]. Mohammed’s et al. main research presents recent advances in developing artificial intelligence and machine learning methods [9]. Tan et al. used data mining technology to analyze the prescription rules of acupuncture and moxibustion treatment for impotence recorded in ancient books and realized the combination of modern methods and traditional culture [10]. From this point of view, ML and data mining are mainly used in the predictive analysis of data. They are algorithms with cross-relationships. In the field of algorithms, what needs to be dealt with at present is the problem of training accuracy.
3. A Brief Introduction to Molecular Dynamics Simulation and Algorithms
3.1. Molecular Dynamics Simulation
Molecular dynamics simulation is a method of molecular motion simulation using multidisciplinary knowledge, mainly through Newtonian mechanics to simulate the motion trajectory of molecules. According to the different states of molecular motion trajectories in the simulation process, relevant samples are extracted, and integral calculation is performed. Other macroscopic properties in the molecular system are further calculated from the integral results [11]. The development disciplines and application areas of molecular dynamics simulation are shown in Figure 1.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.001.jpg)
As shown in Figure 1, molecular dynamics simulations can be used in biology, materials, medicine, and other fields. This is related to the method characteristics of molecular dynamics simulation, which can break through conventional experimental methods to obtain the characteristics of molecules and thus obtain the law of occurrence of substances. This is undoubtedly the cornerstone of many disciplines, such as medicine and biology, which have many molecular studies.
The calculation process of molecular dynamics simulation is mainly carried out by thermodynamic calculation in physics, so there are potential energy parameters in this method. In the calculation, the potential energy parameter can be expressed by the cut-off radius, and the interaction between the particles in the particle system can be calculated by the cut-off radius. In the actual calculation process, if the distance between the particles is much larger than the truncation radius, the force between the two particles can be ignored [12, 13]. The force between particles is calculated using the nearest mirror method [14]. At the same time, in the actual research process, the number of particles studied is often huge. In the simulation calculation process, the trajectory of each particle of the particle system is not calculated, but a part of the particles is extracted for simulation calculation through feature selection. In this process, periodic conditions need to be introduced. This concept is used to solve the problem that the selected particles are affected by the surrounding environment, and the influence is mainly replaced by boundary conditions [15]. The nearest mirror method with periodic boundary conditions is shown in Figure 2.
![(a)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.002a.jpg)
(a)
![(b)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.002b.jpg)
(b)
As shown in Figure 2, the meaning of the nearest mirror method is that when calculating the interaction force between molecules, the particle closest to the particle is selected to calculate the force. For example, there are particles numbered 1, 2, 3, and 4 in the same system. When calculating the interaction force between particle 3 and particle 2 in this system, particle 2 in another system closest to particle 3 is selected for calculation. At the same time, when calculating the interaction force between particle 2 and particle 3, the calculation is performed by selecting particle 3 in another system closest to particle 2. Using this method to calculate the interaction force between particles can eliminate the error caused by the boundary effect. The number of particles in the whole system will not change due to the motion of the particles, and the density between the whole particles will not change.
From Figure 2, the periodic boundary condition means that when the total particles of the system are in motion, some particles exceed the scope of the system due to the irregular motion process. Then on the contrary, there will be a corresponding number of particles in the system entering the system so that the number of particles in the system remains unchanged. Assuming that particle 2 in one system moves from this system to another system, then another particle 2 will enter the system on the opposite side, and the cycle goes on and on. In this way, the number of particle cells in each system can always be kept in balance, which is very convenient to calculate.
In the calculation process of molecular dynamics simulation, in order to maintain the accuracy of the calculation results and save the calculation time, an important step is to design the integration step size [16], that is, the sampling frequency in the system sampling. If the frequency of selection is too high, the system will perform too frequent computing operations, resulting in an excessively large amount of data to be processed, which not only wastes time, but also makes it difficult to extract key features, because when the step size is small, it means that the state change of the particles is not obvious. If the frequency is too low, the generated data is prone to omission, which reduces the accuracy of the system. In general, the selection of the integration step is based on one-tenth of the shortest period of each degree of freedom in the system.
In the calculation process of the simulation, it is usually necessary to define the initial coordinates and velocity of the particles. Therefore, in the initial stage, the particles are placed together in a space according to the crystal structure, and then the simulation operation is performed. The crystal structure and simulation flow are shown in Figure 3.
![(a)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.003a.jpg)
(a)
![(b)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.003b.jpg)
(b)
As shown in Figure 3, each particle in the particle system is arranged according to the crystal structure, and the coordinates of each particle are defined. Through the calculation of state parameters using formulas during system operation, each parameter value is obtained to get the side length of the simulated system, and then the particles are placed in three-dimensional coordinates according to the crystal structure [17, 18]. The entire simulation process can be roughly divided into three parts: initialization, balance determination, and results. First, the relevant parameters of each particle are given, and then the force of the particle is calculated. The particle motion trajectory is calculated according to the Newtonian mechanics formula, and finally it is determined whether the system tends to be in equilibrium, and the final physical and chemical properties can be obtained.
The calculation of the interaction force between particles needs to take into account the force of the particles. Assuming that there are M particles in the particle system, the number of steps to calculate the force is as
The time complexity of calculating the number of force steps based on this is as
The three-dimensional coordinates of each particle are calculated according to the crystal structure, and the distance between particles is calculated according to the coordinates of the particles. Assuming that there are two particles a and b, the actual distance between the two particles is as
In formula (3), represents the distance between particle a and particle b in the x-axis direction. represents the distance between particle a and particle b on the y-axis, and represents the distance between particle a and particle b in the z-direction.
The distance between two particles is calculated according to the nearest mirror method. represents the vector distance calculated using this method and the distance in the x-axis direction is calculated using the nearest mirror method as formulas (4) and (5):
In formulas (4) and (5), l represents the side length of the simulation box, and m represents the density of particles. The calculation of the y-axis and z-axis directions is also calculated in this way.
The actual distance between the particles calculated by the nearest mirror method is compared with the set truncation radius. When the truncation radius is smaller than the distance between the particles, the interaction force between the two particles can be completely ignored; that is, the force at this time is 0. When the truncation radius is greater than the distance between the particles, it means that there is a force between the two particles, and the force needs to be calculated in the calculation. The influence of the force on the potential energy needs to be judged. The condition that the truncation radius needs to meet is as
Assuming that the actual distance between particles is 48, the truncation radius distance is 24. The force calculation formula is as
First the position and dynamic properties should be set in order to calculate the motion trajectory of the particle. Assuming that the time of each small step of the particle motion is , the approximate values of the distance d, velocity , and acceleration a between particles are as
In formulas (8)–(10), e, f, , and R represent the values in the Taylor series, respectively. The n in the formulas represents the total number of expansions required, which is related to the size of the number of particles.
In the actual movement process, the finite difference method is usually used to calculate the speed and acceleration. There is a relatively simple frog leaping method. The calculation formula is as
In formulas (11) and (12), m is the mass of the particle system. F represents the resultant force acting on the particle. Its formula is as
In formula (13), U represents the total potential energy in the system.
The balance judgment of the system state is mainly determined according to whether the system energy oscillates within a certain value. In general, the set judgment condition is whether the total energy of the system swings around 3/2NKBT∗10%. The time complexity of balancing judgment is as
3.2. Machine Learning and Data Mining
In this process of machine learning, various knowledge and skills can be obtained, and the social science laws of the human world are explored. The carrier of this learning process is the computer, which mainly reflects the learning ability through the computer software system. The structural model of machine learning and data mining is shown in Figure 4.
![(a)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.004a.jpg)
(a)
![(b)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.004b.jpg)
(b)
As shown in Figure 4, machine learning mainly provides external information from the environment. The learning unit learns after receiving the external information and incorporates the information into the existing knowledge base or creates a new knowledge base. The system then completes the relevant tasks according to the content in the knowledge base, and at the same time, the system feeds back the information after processing the relevant tasks to the learning unit for further study [19]. Data mining is to first select the target data in the database, preprocess the data to obtain the processed data, and then reflect the laws or patterns through data mining. It will screen out useful rules or patterns from the results of data mining according to certain evaluation criteria and finally express the filtered rules or patterns through visual knowledge [20].
3.3. Restricted Boltzmann Machine (RBM)
RBM is modeled and analyzed through an energy function and is divided into visible layer and hidden layer. It is a kind of learning algorithm that is mainly used to train data through the calculation of probability distribution to solve the problem that the internal representation of human beings is difficult to define. Its network structure is shown in Figure 5.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.005.jpg)
As shown in Figure 5, RBM is composed of a multilayer network structure. After the data is input, the corresponding result will be output through the continuous calculation of the hidden layer. Assuming that the weight between the visible layer unit and the hidden layer unit is , the bias of is , and the bias of is , then the energy function is as
Modeled according to the energy function, the joint probability distribution between the hidden layer and the visible layer is as
In formula (16), G represents the partition function.
The value edge distribution of the visible layer is as
The corresponding conditional probability distributions of the visible layer and the hidden layer are as
The logistic function R needs to be introduced in order to calculate the activation probability of a single node. Its expression formula is as
3.4. K-Nearest Neighbor (KNN)
According to the distance, from the given unknown samples, the k closest samples are selected from the known samples. Based on the selected k samples, the class value of the unknown samples for training is inferred. Such methods can deal with different samples to be classified by establishing different function approximations [21, 22]. For some complex objective functions, it can also be described by a relatively simple local approximation, and since the KNN method mainly relies on limited surrounding samples rather than the method of discriminating the class domain to determine the class to which it belongs, it is more convenient to operate. For sample sets to be classified that overlap or overlap more, the KNN method is more suitable than other methods, which is the advantage of KNN algorithm. The schematic diagram and flowchart of the KNN algorithm are shown in Figure 6.
![(a)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.006a.jpg)
(a)
![(b)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.006b.jpg)
(b)
As shown in Figure 6, the KNN algorithm selects the most recent samples for data analysis and uses the class attribute with the most frequency as the class attribute to be tested.
4. Molecular Dynamics Simulation Performance Optimization Model Experiment Based on Machine Learning and Data Mining Algorithm
4.1. Molecular Dynamics Model Performance Optimization Method Design
The calculation time is first considered in the optimization of the model, so the calculation program is designed to be parallelized [23]. Since this optimization will use the KNN algorithm, KNN is optimized to realize the parallelization of molecular dynamics simulation in the processor. The direction of improvement is to save time and cost and speed up the search. According to the characteristics of the KNN algorithm, the algorithm is optimized from these two aspects. The first step is to reduce the dimension of the training set, and the second step is to block the big data. In general, the optimization is carried out from two aspects: classification accuracy and classification efficiency. The specific optimization directions are shown in Table 1.
According to the content of Table 1, the way of optimizing the classification accuracy of KNN can be improved from the weighting direction. The first is to perform feature weighting according to the number of attributes in the sample data set, and different attributes are based on their frequency in the sample. The second is to assign the weight of the sample according to the calculated distance. In addition to the improvement in classification accuracy, optimization can also be made in terms of classification efficiency, mainly to improve the computational efficiency of the algorithm by reducing the computational time [24]. Since the calculation of high-dimensional data will lead to prolonged calculation time, it is possible to consider reducing the dimension of the sample. Meanwhile, the search speed of neighbor samples is accelerated during operation. This paper will optimize the algorithm from the classification efficiency.
The schematic diagram of the optimized KNN algorithm is shown in Figure 7.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.007.jpg)
As shown in Figure 7, the optimized KNN mainly performs parallel computing. The effect verification of the optimized KNN is performed through the data set analysis in the machine learning library and comparing with the model before optimization. First, the classification accuracy at different k values is compared, and the results are shown in Figure 8.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.008.jpg)
As shown in Figure 8, under different k values, the classification accuracy before and after optimization does not change. That is to say, although the optimized KNN speeds up the classification, it does not affect the classification accuracy.
Then the running time of the algorithm is compared, and the result is shown in Figure 9.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.009.jpg)
As shown in Figure 9, compared with the KNN algorithm before optimization, the effect of the optimized KNN is more obvious when the data set is slightly larger. As can be seen from the figure, when the data set is 500 M, the optimized KNN time is 116.5 s shorter than that before optimization. Comparing the running time of the optimized KNN and the optimized parallelized KNN, it can be seen from the data in the figure that the running time after parallelization is greatly shortened.
The feasibility of the KNN parallelization method is proved through the optimization experiment analysis of the KNN algorithm. When the efficiency of classification increases, although the running time is greatly reduced, it has no effect on the accuracy of the results. In this paper, the molecular dynamics simulation will be optimized according to the optimized KNN algorithm and RBM. Firstly, the probability distribution of the cells in the particle system is calculated by RBM. The three-dimensional coordinates of each particle cell are obtained, and the distance is calculated. KNN is used to classify the particles in this process.
When the model optimized based on the KNN algorithm and the model using the KNN algorithm are calculated, the operation of the CPU core can also reflect the gap between the two. The two models are running and the CPU running load is shown in Figure 10.
![(a)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.0010a.jpg)
(a)
![(b)](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.0010b.jpg)
(b)
Through the operation of the two CPUs, it can be found that, without the KNN algorithm, the operation of the computer is relatively high, reaching 76%. In the case of using the KNN algorithm, the computer’s CPU running memory is only 62%. After adopting the KNN algorithm, the operation of the computer is greatly reduced and the burden of the computer is reduced. It also shows that the model under the KNN algorithm is optimized.
4.2. Molecular Dynamics Simulation Optimization Experiment
This paper will conduct relevant experimental analysis according to the previous optimization model based on ML and data mining. The experimental indicators are designed according to the important parameters in the molecular dynamics simulation mentioned above, because the performance of the molecular dynamics simulation is mainly the calculation of the interaction force between the particles and the judgment of the balance and the solution of the motion trajectory equation. The experiment will analyze these aspects. The experiments will be analyzed according to different particle number systems.
The experiment will be divided into two groups: molecular dynamics simulation and optimized molecular dynamics simulation. Among them, the molecular dynamics simulation is group A, and the group B is optimized by ML and data mining. The experimental results are compared and analyzed to verify whether the performance has been optimized. Each indicator is explained by the time of calculation.
The time results of the force calculation are shown in Table 2.
It can be seen from the results in Table 2 that the calculation time of the interaction force between particles in the optimized simulation system is significantly reduced, indicating that the added algorithm makes the program run more efficiently.
The time results of solving the motion trajectory equation are shown in Table 3.
From the results in Table 3, the optimized molecular dynamics simulation significantly reduces the time required to solve the particle’s trajectory equation during the calculation process.
The running time of the approach to equilibrium judgment is shown in Table 4.
As shown in Table 4, in the case of different particle numbers, the judgment time of tending to equilibrium is not long compared with the previous two indicators, but the running time of the optimized simulation system is greatly reduced.
In order to better compare the performance before and after molecular optimization, the operating efficiency ratio before and after optimization is calculated. The efficiency ratio is the ratio of the difference between the optimized running time and the preoptimized running time to the preoptimized running time, and the results obtained are shown in Figure 11.
![](https://static-preview.hindawi.com/articles/misy/volume-2022/4553446/figures/4553446.fig.0011.jpg)
As shown in Figure 10, the efficiency ratios before and after optimization are positive numbers. In the force calculation, when the number of particles is 4096, the optimized effect is the best, and the efficiency ratio at this time reaches 31.15%. When the number of particles is 8192, the lowest efficiency ratio is 10.52%. In the running of the motion trajectory solution, when the number of particles is 512, the efficiency ratio at this time is the highest at 30.28%. The lowest is when the number of particles is 8192, and the efficiency ratio at this time is 8.03%. In the judgment that tends to balance, when the number of particles is 256, the efficiency ratio is the highest at 36.96%. When the number of particles is 4096, the efficiency is the lowest at 8.12. From these data, the optimized model has a certain optimization effect on these main properties of molecular dynamics simulation. The difference is that when the number of particles is different, the effect of performance optimization is different, which may be related to the properties and methods of the calculation of these three indicators in the particles [25]. All in all, the optimization scheme in this paper is feasible.
5. Conclusions
In this paper, starting from the current situation of molecular dynamics simulation, it is proposed that the current simulation method can be optimized in terms of calculation time. Combined with the theoretical analysis of molecular dynamics simulation and the generalization of the algorithm in the field of Internet computer, the algorithm of machine learning and data mining is proposed to optimize the performance of molecular dynamics simulation. Through the parallelization and optimization design of KNN algorithm, the possibility of parallelization of KNN algorithm is confirmed, and the optimized KNN algorithm and RBM are applied in molecular dynamics simulation. The optimization model is designed and experimentally analyzed. The calculation time of the interaction force between particles in the molecular dynamics simulation, the calculation time of the trajectory of the particle motion equation, and the running time of the simulation system tending to balance judgment are used as test indicators. By analyzing the results of these three indicators before and after optimization, it is concluded that the optimized simulation system can reduce the running time in all three performances. According to the experimental results, the efficiency ratio curves of the three indicators before and after optimization are drawn. The curves show that the efficiency ratios before and after optimization are different under different particle numbers. Among them, in the running time of force calculation, when the number of particles is 4096, the highest efficiency ratio is 31.15%. In the running time of solving the motion trajectory equation, when the number of particles is 512, the efficiency ratio at this time is the highest at 30.28%. In the running time tending to balance judgment, when the number of particles is 256, the efficiency ratio at this time is the highest at 36.96%. The research shows that, in different particle number systems, the best effect of the optimized model’s performance optimization occurs in different particle number systems, which may be related to the difference in how the three properties are calculated and related attributes. All in all, the performance optimization of molecular dynamics simulation based on machine learning and data mining techniques in this paper is effective.
Data Availability
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Conflicts of Interest
The authors state that this article has no conflicts of interest.
Acknowledgments
This work was supported by Higher Vocational School Program for Key Teachers from Department of Education of Henan Province, China (2019GZGG042).