Abstract
Predicting the remaining useful life (RUL) of a battery is critical to ensure the safe management of its manufacture and operation. In this study, a comprehensive investigation of the effect of data partitioning methods on RUL prediction was performed. To confirm the generality and transferability, the charge–discharge information of cathode materials with different chemical elements was adopted from previous research, including lithium iron phosphate, lithium nickel cobalt aluminum oxide, and lithium nickel cobalt manganese oxide cells. Among the partitioning procedures, the method of adding predicted data from the surrogate model to the training set exhibited the best accuracy, with an average mean absolute error (MAE) of 47 cycles. In contrast, the slide BOX method, which only used certain cycles before the test set as the training set, exhibited the worst MAE value of 60 cycles. In conclusion, the proposed data partitioning method could be implemented to predict the RUL of batteries to develop next-generation cathode materials with improved performance and stability, shorten the quality assessment time, and achieve stable predictive maintenance.
1. Introduction
Lithium-ion batteries (LIBs) have numerous advantages, including high energy density and power values and a low self-discharge rate. For this reason, LIBs have been implemented in various fields, including electric vehicles, smartphones, and energy storage systems [1–3]. However, the repeated charging and discharging of LIBs reduces the available battery capacity and increases the impedance, which significantly degrades battery performance. This degradation behavior varies depending on the charging/discharging rate, temperature, and voltage conditions [4, 5]. Thus, assessing the health conditions (state of health (SOH) and remaining useful life (RUL)) is a fundamental task in the field of battery research to achieve successful commercialization. SOH indicates the degree of degradation and available capacity of the battery. This value is initially 100% for the freshly manufactured battery but gradually decreases after a repeated charge-discharge cycle. Normally, 80% of SOH is used as an end-of-life of the batteries, and RUL indicates how many cycles are remaining to reach that point. Usually, a battery-cycling test is conducted to determine the possible aging process during the manufacture of LIBs. However, an inspection to identify potential degradation needs to repeat full charging and discharging cycles, which requires a significant amount of time (note that hundreds of cycles should be performed). Therefore, many studies have focused on SOH and RUL estimation and prediction to overcome this obstacle using various computational and analytical approaches [6–9].
Many approaches have been suggested for estimating the SOH and RUL, which can be divided into two categories. The first method is to implement the charge and discharge cycling behavior using the battery model, i.e., model-based methods. Two representative methods using such approaches are an equivalent circuit model and an electrochemical model. An equivalent circuit model (ECM) [10, 11], which is widely used to estimate the operation of batteries. An ECM can derive the battery status and parameters using a small number of inputs, including the open-circuit voltage and ohmic resistance. Therefore, this type of model is simple, and the derivation of battery parameters and state estimation can be achieved easily and quickly. For example, to simulate battery dynamics, Eddahech et al. [12] constructed an ECM using the internal resistance, state of charge (SOC), and temperature based on electrochemical impedance spectroscopy (ELS) measurements, and the RUL of the battery was estimated. In another paper, EIS estimation was conducted based on the fractional order equivalent circuit model (FOM) that can be implemented online. A regression model was constructed based on the estimated EIS spectrum, and RUL prediction was followed. As a result, the FOM method shows better prediction accuracy than the conventional integerorder models [13]. However, the simplicity of an ECM model makes it difficult to obtain more comprehensive battery parameters. In addition, an ECM model has the disadvantage that the battery operating conditions need to be designed considering simple safety constraints for the maximum and minimum voltage and current. However, these simple constraints cannot prevent the safety problems with LIBs caused by degradation [14].
For electrochemical models, which are another model-based methods, it implements the internal chemical and physical reactions of the battery cell and requires information on the actual electrode material and electrolyte configuration [15–18]. This model is widely used in research on the SOC, SOH, and RUL because it has the advantage of providing various complex parameters compared to an ECM. For example, a simple electrochemical model was introduced to accurately estimate the SOH of a battery, and the battery’s physical parameters were generated using the model parameters. It was used to accurately estimate the SOH of the battery [19]. An enhanced single-particle model- (eSPM-) based electrochemical technique was developed to estimate the RUL of LMO-NMC battery cells under different operating conditions [20]. A new mechanism for predicting RUL of lead-acid battery was developed through the combination of electrochemical model and particle filtering framework; this methodology is also applicable to lithium-ion batteries and fuel cells [21]. In addition, an electrochemical model based particle filter framework for lithium-ion batteries was developed and it exhibited great RUL-prediction accuracy for aged LFP-graphite-based LIB material [22].
The second method, which is rapidly emerging and replacing the previous approaches, is a data-driven approach, which has various advantages. These include the ability to estimate the battery condition without predefined modeling. With the recent development of machine learning and deep learning, this method has been extensively applied and used to predict the SOH and RUL using simple features such as the current, voltage, temperature, and time from the battery cycling database [23–31]. For example, a machine-learning pipeline was designed to predict the SOH of 179 battery cells, with a root mean squared error (RMSE) of 0.45% [26]. A model that effectively performed RUL prediction by combining empirical model decomposition and the autoregressive integrated mobbing average model was also developed [27]. An RUL prediction model was constructed by combining the sliding-window gray model and linear optimization resampling particle filter (LOPRF), and it was found to be more effective in RUL prediction than a standard particle filter or basic LOPRF model [32]. In addition, the LSTM and GRP models were employed simultaneously for predicting RUL, and they outperform the individual approach [33].
As previously discussed, the data-driven method is widely used to estimate and predict battery conditions. However, the partitioning required before applying such method has not been fully investigated. Configuring the training set for the construction of the prediction model is one of the most critical steps and determines the model accuracy and accessibility. Therefore, many studies have been conducted to show the improvement of prediction accuracy according to the train data configuration. For example, the data augmentation and supplementary variables has increased prediction accuracy and reduced computational speed [34]. In the field of predicting the RUL of a battery, prediction accuracy improvement could be achieved through the integration of physics-based and machine learning. Discussions have been made on the feasibility, advantages, and limitations of various integrated architectures [35]. As an example, Thelen et al. developed a hybrid model that mixes the RUL prediction model with the model for correcting RUL prediction errors, leading to an improvement in RUL prediction accuracy [36]. Based on the previous researches, it is important to mention that the composition of the battery and the data partitioning method is various. In this regard, a systematic analysis is critical, how the prediction accuracy using data-driven approaches could be affected depending on which of data partitioning methods is applied since the training database configuration is a core component determining the prediction performance. In this respect, the appropriate partitioning method could be selected according to the cell configuration and data availability for developing the next-generation cathode materials. Therefore, we collected cycling data according to various battery compositions and introduced several partitioning methods. Subsequently, with the implementation of the machine-learning model, the predictive performance changes after applying these partitioning methods were compared by predicting the RULs of various compositions of battery cells. Such analysis clearly confirms that the effective data partitioning method can be selected when conducting RUL prediction of batteries with various compositions using the result of current work.
2. Methods
In this study, we used battery-cycling data for LFP, NCA, and NMC cells provided by Sandia National Laboratories [37]. The data were obtained under various temperature conditions, charging rates, and discharge rates, as listed in Table S1 (Supporting Information (SI)). The RUL was calculated by setting the point at which the SOH provided in the relevant experiment reached 80% of the end cycle, and data for 4 LFP cells, 12 NCA cells, and 11 NMC cells were extracted.
Several partitioning methods used in recently published papers were compared. They can generally be divided as follows: (a) some cell cycling data are used for the training set, while other cell cycling data are used for the test set [26]. This method has the disadvantage of requiring significant cycling data to obtain an accurate prediction. (b) The initial data of one cell cycle are used as a training set, and the remaining cell cycle data are used as a test set [23]–[25]. (c) The data from a certain number of cycles before the cycles to be predicted are used as a training set for predicting the capacity [38]. Methods (b) and (c) have the disadvantage that the initial cycling data must be obtained first and are then used to train the model again for application to a new cell. In addition to these three methods, we considered two other possible partitioning methods. Thus, we compared five partitioning methods, which are described in Figure 1. (i)Random method (RND): RND integrates all battery charging/discharging cycling data into one data frame regardless of battery cell type. Then, randomly extract training and test sets using one integrated cycling data. This work randomly selected 80% of integrated cycling data and used them as a training set, and the remaining 20% data were used as a test set(ii)-Separate method (SEP): unlike RND, SEP extracts battery charging/discharging cycling data without integrating it. For example, if we have charge/discharge cycling data for 10 battery cells, the data of some battery cells is used as a training set and the rest as a test set. In this paper, 75% data of battery cell were used as a training set, and the remaining 25% of battery cells were used as a test(iii)-Initial method (INI): INI uses the data of one battery cell after the charge/discharge cycle. The training set is constructed using the initial % cycle data, and the remaining cycle data (100%-%) are used as test data(iv)-Slide BOX (BOX): like INI, BOX uses charge/discharge cycling data for one battery cell. Training set consists with the initial % data from the cycle data of the battery cell. The range of data to be used for training is then changed and predicted without increasing the data for the training set. For example, in the case of a battery cell tested for 100 cycles, data from 0 to 20 cycles are trained and the remaining cycles (21 to 100 cycles) are predicted. After that, the range of the training set is changed to the range of 10 to 30 cycles of data and predict 31 to 100 cycles(v)-Predicted data adding method (ADD): ADD also uses charge/discharge cycling data for one battery cell, similar to INI and BOX. The RUL for the next cycle is predicted using the initial % cycle data in the training set. The ML model is then trained again by adding the predicted RUL value to the training set, and the RUL for the next cycle is predicted again

3. Results and Discussions
To develop an ML model for predicting the RUL of a battery and compare the performances of various data partitioning methods, this study was conducted based on the framework shown in Figure 2. First, we obtained battery charge/discharge data from Sandia National Laboratories with various compositions [37]. The charge and discharge data for 4 LFP batteries, 12 NCA batteries, and 11 NMC batteries were obtained. Feature engineering was then performed to process the cycling data for use as inputs for the ML model. We adopted the feature generation method, which has previously been used to predict battery SOH [26]. We confirmed that features generated using data such as the voltage, current, and time are obtained when charging and discharging a battery. Among these, features related to constant current–constant voltage (CC–CV) charging, which is a charging method for LIBs, were mainly used. Additionally, we calculated and added features representing the mean, maximum, minimum, and standard deviation of temperature conditions during charge/discharge operations from the Sandia National Laboratories cell data. Finally, the RUL of the battery was predicted using the SOH of the generated feature. We set the threshold for calculating the RUL to cycles with an SOH of 80%. After the threshold was set, the number of cycles remaining from each cycle to the threshold was calculated as the RUL of the battery. Because we calculated the battery RUL in cycle units, it was determined that cycle-related features could not be used in the training set. Thus, the cycle-information-related features were excluded from the training set. More detailed explanation on the feature generation and the list of the generated features is presented in Figure S1 and Table S2 (SI). We checked the correlations between each of the created features and the RUL, and the results are shown in Table S3 (SI). The highest correlation was obtained for the feature related to the constant current charge time, and the features related to the lagged cumulative discharge showed the lowest correlation.

First, we compared the performances of various ML algorithms using PyCaret to select an accurate model for RUL prediction [39]. For the comparison of the predictive accuracy of ML algorithms, we evaluated using the RND method. All the cycle data of the 27 battery datasets (LFP, NCA, and NMC) were split in a random fashion, where the training and validation sets were divided using a 7 : 3 ratio by performing cross-validation five times. As shown in Table S4 (SI), when trained using a randomly divided dataset, tree-based models that are traditionally used in the ML field show high accuracy. Each tree model shows similar accuracy, but ET shows the highest accuracy with a mean absolute error (MAE) of 2 cycles. The linear model shows lower accuracy than the tree model. (It is noted that in the case of the INI, SEP, and ADD methods, the linear model exhibits great performance because RUL is normally decreased in a linear fashion depending on the charge and discharge time and cycle.). So, we chose ET and least absolute shrinkage and selection operator (LASSO) regression because it showed a high prediction accuracy among each tree and linear models. In addition, in recent related studies, it was found that support vector machines are often used for RUL predictions (linear SVR); thus, this model was also employed [40, 41]. In addition, we conducted hyperparameter optimization of the selected LASSO using PyCaret and linear SVR using grid search. Finally, the ET basic model was used because ET already showed very high prediction accuracy in the previous model; thus, further hyperparameter was not necessary. The factors considered when tuning the hyperparameters of the each model are shown in Table S5 (SI).
3.1. Random (RND) Method
We compared five combinations to determine how the prediction results related to the battery composition. In the first case, the training used all the battery cell data, regardless of the composition. The second case involved training using NCA and NMC cell data with similar compositions. Finally, we used each battery cell alone. Training and test sets were randomly extracted from each of the five combinations at a ratio of 8 : 2. To confirm the solid performance of the surrogate model, we randomly changed the configuration of the train and test sets with ten different random states. Then, the cross-validation was performed through ten predictions, and the prediction accuracy was obtained by averaging those results. The prediction accuracy is shown in Figure 3(a) and Table S6 (SI). (RMSE values are provided in Figure S2). First, it was confirmed that the RND showed a high prediction accuracy overall. In Figure 3(b), it can be seen that the prediction was very similar to the actual RUL, except for some splashing cases. The highest accuracy was observed when predicting values for the NCA cells. In this case, the MAE showed a high prediction accuracy, with an average of 1.4 cycles. The lowest prediction accuracy was obtained when making a prediction for the LFP and NMC cell, with an MAE of 3.2 cycles. When prediction was performed without dividing the data according to the battery composition, combining the NCA and NMC cells produced an MAE of 2.8 cycles. When the prediction was performed using all of the data together, an MAE of 3.1 cycles was obtained. The MAE showed that the lower prediction accuracy was obtained when using the LFP or NMC cell data alone, with a higher prediction accuracy obtained when using the NCA cell data alone. However, in the case of the LFP data, the average number of cycles to reach the RUL threshold (approximately 3,500 cycles) was much greater than that with the NCA or NMC data (approximately 300 cycles). Therefore, considering the degree of MAE error compared with the total cycle count, it cannot be concluded that the LFP prediction was less accurate than those for the other cells. For further clarification, the mean absolute percentage error (MAPE) is calculated as shown in Table S6, SI. It shows that MAPE for LFP cells was 0.0049% while 0.025% for NCA and 0.0276% for NMC. This means the performance of LFP can be still predictable. The RND method, which showed such high prediction accuracy, is a very general partitioning method and is easily accessible. However, in this case, there was a problem in that the overall charge/discharge test of the battery had to be performed first to make predictions. Therefore, it was determined that there will not be many cases where the RND method could be used in practical applications.

(a)

(b)
3.2. Separate (SEP) Method
To overcome the shortcomings of the RND method described above, we proceeded with the prediction using SEP. In the case of SEP, some cell data are added to the training set, and the remaining cell data are added to the test set to proceed with prediction. For example, in the case of NCA cells, the data for 9 out of 12 cells were used for the training set, and the data for the remaining 3 cells were used for the test set. A 3 : 1 ratio was used for the training and test sets by dividing the data for 4 LFP, 12 NCA, and 11 NMC cells. The RUL prediction was conducted using the ET model, which showed the highest accuracy in model selection.
The prediction results are shown in Figures 4(a) and 4(d). The highest prediction accuracy was achieved for the NCA cell, with an MAE of 86 cycles. When trained using the LFP data alone, the prediction accuracy was 111 cycles. The lowest prediction accuracy appeared when the prediction model was constructed by combining the data for all the cell compositions, with an MAE of 180 cycles. Hence, in general, surrogate model construction by training LFP, NCA, and NMC cells separately showed better prediction accuracy. However, in the case of the NMC cells, it was confirmed that the prediction improved when the model was trained in combination with the NCA cells. In this case, when predicting using the NMC data alone, the MAE was 145 cycles, and when predicting using the NCA and NMC data together, the MAE was 130 cycles, indicating an improvement in the prediction accuracy of 15 cycles. This could have been due to the expansion of the training set using cycling data for the NCA and NMC cells, which have similar chemistries.

(a)

(b)

(c)

(d)

(e)

(f)
We identified the cause of the relatively low prediction accuracy of the SEP model when tested under various conditions such as different cell temperatures and charge/discharge rates. Thus, we created a prediction model using the data for only one cell for each composition as the test set and the remaining data as the training set. For example, in the case of the NCA cells, the datasets for 11 of the 12 cells were used for the training set, with the data for the remaining cell used for the test set. The results are shown in Figures 4(b) and 4(e), and Table S7 (SI). (Please note that because the data for only four LFP cells were available, this result was already obtained from Figure 4(a).) It was found that the NCA cells showed an MAE of 88 cycles, a level similar to the MAE in Figure 4(a). Figure 4(e) shows the prediction results with the highest prediction accuracy among the NCA cells. The NMC cells showed a value of 137 cycles, which was slightly lower than that shown in Figure 4(a). However, the model that used the NCA and NMC cell data together showed an MAE of 110 cycles, which was an improved prediction accuracy. The NCA and NMC results listed in Table S7 (SI) show that in the case of NCA, most of the cells showed an error below the MAE value of 100 cycles. Among them, three cells had MAE values of 100 cycles or more. To clarify the potential reasons for this prediction error, the distribution of each of the features is compared. It is important to note that the distribution of the discharge capacity for NCA cells 1, 2, and 6, shown in Figure S3 (SI), shows different behavior compared to the others. For example, the discharge capacity for NCA cells 1 and 2 is more concentrated at a lower value than the other cells. In addition, for NCA cell 3, its discharge capacity is generally larger than the others. In this respect, the prediction accuracy for those cells could be worse since it is to predict the untrained region. In the case of NMC, it can be observed that more cells exceeded the MAE value of 100 cycles compared to NCA. This was because 11 NMC cells were tested under the seven battery cycling test conditions specified in Table S1 (SI), which was two more conditions than the 12 NCA cells, and the number of cells tested under the same conditions was small.
Here, we can see that significant error occurred when making a prediction for a specific cell. Thus, we attempted to improve the prediction accuracy of the SEP model by adding cells with MAE values of more than 100 cycles to the training set. We identified the difference in prediction accuracy using a cell that showed low prediction accuracy in the corresponding scheme and using the rest of the data in the test set. The results are shown in Figures 4(c) and 4(f). It was confirmed that the NCA cell showed a very large improvement in accuracy compared with the previous two methods. It was confirmed that the MAE and MAPE for the NCA cell were 36 cycles and 0.823%, which was 50 cycles lower than the MAE result shown in Figure 4(a). The model with the highest prediction accuracy for NCA is shown in Figure 4(f). In a model in which predictions were made using NMC cells alone and in combination with NCA data, no significant improvement in prediction accuracy was observed, and MAE values of 139 cycles and 160 cycles were found, respectively. In the case of NCA cell data, only three cells with MAE values exceeding 100 cycles were able to improve the prediction accuracy when added to the training set, but in the case of NMC, it was difficult to obtain high accuracy because 7 out of 11 cells had MAE values exceeding 100 cycles.
We conducted a study to confirm the differences between LFP, NCA, and NMC when making prediction for each chemistry and combination. In summary, LFP showed better prediction accuracy than NMC, even though it had the longest cycle. Similar to the RND method, the NCA showed the best prediction accuracy. In particular, in the case of NCA, it was found that when cell data with poor prediction accuracy were fixed to the training set and used, the prediction accuracy was significantly improved. In the case of NMC, it was confirmed that the prediction accuracy was low because the cycling test was performed under various battery charging/discharging conditions compared with other battery chemistries. To overcome this, a slight improvement in the prediction accuracy could be obtained by training using the data for NCA and NMC together, because these have similar chemistries. These results indicated that SEP required a large amount of data for cells with a chemistry similar to that of the cell to be predicted, as well as the charging/discharging conditions. Therefore, a variety of databases is required to use the SEP method.
3.3. Initial (INI) Method
Among the recently used partitioning methods, INI was useful in overcoming the need for many databases for SEP to predict the RUL of the target cell. It uses the initial portion of each cell’s data as a training set and the remaining data as a test set. We compared the prediction performances of three prediction algorithms (ET, LASSO, and linear SVR). As shown in Figure 5, the ET model exhibited a very large MAE value of 1,567 cycles when 10% of the LFP cell data was used to train the set. This confirmed the reason to examine the prediction results shown in Figure S4 (SI). Looking at the results when using INI with the ET model, when the initial 20% of the data was used as a training set, the remaining 80% of the test set was not accurately predicted. In other words, the tree-based model could not predict the region within the range that was not being trained. Therefore, the RUL value, which was not included in the training set, was unpredictable. Therefore, the ET model was not suitable for the INI method. Among the remaining two algorithms, the highest prediction accuracy was achieved using the linear SVR model on average.

(a)

(b)

(c)

(d)
LFP cells showed a high MAE of 440 cycles when training ML models using the initial 10% of the data, but a significantly lower MAE value of 104 cycles was obtained when training using the initial 20% of the data. Likewise, NCA and NMC cells showed higher prediction accuracy as the amount of initial data in the training set increased compared to training using 10% of the initial data. The MAE value of NCA was 123 cycles when the initial 10% of the data was used for training, but the MAE value improved to 87 cycles when the initial 20% of the data was used as the training set. Likewise, the NMC model showed an MAE value of 162 cycles when the initial 10% of the data was used for training, and an MAE value of 100 cycles when 20% of the data was used. Compared to the SEP results described above, where LFP and NMC showed MAE values of 111 and 145 cycles, respectively, when 20% of the data were used for training in INI, the MAE values improved to 104 and 100 cycles, respectively. In the case of NCA, a decrease in predictive accuracy was observed in both MAE and MAPE compared to other cells with different chemical compositions. MAE and MAPE from the INI method using 20% data were 87 cycles and 2.845%, larger than MAE of 36 cycles and 0.823% MAPE from the SEP method. However, for the INI method, it shows a reasonable prediction accuracy, even though the INI method used a much smaller amount of training set, compared to the SEP.
We confirmed that INI showed a high accuracy, despite using less data than RND and SEP. The INI method made predictions using only the initial data of one battery in the training set, and it was confirmed that better accuracies could be obtained for LFP and NMC than when using SEP. When the initial 20% of the data was used for training, LFP and NMC had MAE values of 104 and 100 cycles, respectively, and NCA had a value of 87 cycles, which was 51 cycles better than that achieved with SEP.
3.4. Sliding BOX (BOX) Method
We used BOX to determine whether the prediction accuracy changed when predicting the RUL using only data close to the prediction target cycle. BOX is a method that uses % of the data before each test set of cells for the training set and the remaining data as the test set. We conducted RUL prediction using the linear SVR model, which had the highest predictive accuracy in the previous INI, and the results are shown in Figures 6(a) and 6(b). We used 20% of the data from each cell in the training set and then predicted the test set as the BOX moved.

(a)

(b)

(c)

(d)
The results showed that when predicting using the initial 20% of the data, the accuracy was consistent with that obtained when using the initial 20% of the data with INI. Overall, as the test set decreased, the MAE value also decreased; however, in the case of LFP cells, the MAE value increased when predicting 50% of the data in the second half. Likewise, it can be seen for NCA and NMC that the accuracy increased as the starting and end cycles of the BOX moved back. Compared with the INI, the overall accuracy was low. When predicting the last 90%–100% RUL region for LFP, the accuracy of the BOX model was low on average for the rest of the cells, except that INI had an MAE of 32 cycles and BOX had an MAE of 22 cycles. Of course, this was determined to be caused by a decrease in the amount of data used compared to the INI method. We used the BOX model in anticipation of improved RUL prediction accuracy in certain areas but confirmed that it did not contribute significantly to improving the accuracy.
3.5. Predicted Data Adding (ADD) Method
We found that the prediction accuracy of INI was higher than that of BOX because of the use of more initial data. Therefore, we implemented the ADD method, which was expected to slightly improve predictions. In this method, the RUL of the next cycle is predicted immediately after the initial % of the data is used for the training set, and then the predicted RUL value of the next cycle is used again for the training set. Similar to the BOX method, the predictions were made using linear SVR, and the results are shown in Figures 6(c) and 6(d). It was clearly found that the prediction accuracy was higher when a greater amount of initial data was used. Similarly to the INI results, in the LFP cell, the prediction accuracy was slightly lowered when 50% of the 70% of data was used in the training set. Other than that, the prediction accuracy was higher when a greater amount of initial data was used. When the initial 20% of the data was used, the prediction accuracies for LFP, NCA, and NMC were 103, 57, and 105 cycles, respectively. In terms of the overall average, the ADD model used the predicted RUL again for training, which slightly improved the prediction accuracy, but the effect was not significant. In addition, in the case of the ADD model, the training was performed again every time each cycle was predicted; therefore, the time taken for the prediction was very large. Thus, compared to the other partitioning methods described above, more time was required for prediction.
3.6. Performance Comparison between Partitioning Methods
We summarize the results in Tables 1 and 2, Figure S5, Figure S6, and Figure S7 to compare the INI, BOX, and ADD models, which had similar tendencies among the five partitioning methods described above. These three methods showed similar degrees of prediction accuracy, with ADD showing the highest prediction accuracy when averaged overall. It showed an average MAE of 46.8 cycles. Next, INI showed 46.9 cycles, and finally, BOX showed a slightly higher MAE of 59.9 cycles. This confirms the same trend in MAPE, and it can be seen that ADD and INI show similar levels of MAPE values, and BOX has a slightly lower prediction accuracy. However, in the case of LFP, ADD showed a slightly higher value of 54.2 cycles, and INI showed 55.8. However, in the case of NCA and NMC, the results from ADD were slightly higher with 46.5 and 38.2 cycles of MAE, respectively, in INI. This was because the predicted RUL value used in the training set with ADD was more accurate as a result of the cycle count for LFP being longer than that of either NMC or NCA. Therefore, if the time and battery data were abundant, the ADD model was dominant on average; however, it had the disadvantage of requiring more time than the other methods. If little battery charging and discharging data are available, a slightly lower accuracy might be obtained using the BOX model.
Table 2 summarizes the average prediction accuracy of each partitioning method implemented in this study. Although RND shows the highest prediction accuracy as previously described, it can be judged that actual usage of this method would not be practical for some cases because the entire cycling cell data is randomly divided and predicted. In the case of SEP, it shows lower prediction accuracy for the data except for the NCA cell compared to other models. However, for this method, most data can be utilized thus, it can be inferred that the prediction accuracy can be improved by adding more data and using more representative features. In addition, in the case of the SEP method, if a sufficient amount of data is accumulated, it can be applied to an electric vehicle or an energy storage system to predict when to replace the battery in advance. Finally, in the case of the INI, BOX, and ADD models, it can be concluded that they can be used when the accumulated data does not exist. ADD models can be used if enough time is given to improve accuracy. In addition, the BOX model can be used if there is damage to the accumulated data, or if there is a limit to the available data. All five methods have advantages and disadvantages in terms of prediction accuracy and time. In this respect, an appropriate model should be used depending on the conditions and situations.
4. Conclusions
We compared various partitioning methods to predict the RUL of a battery. Five partitioning methods were compared, along with their advantages and performances. In conclusion, the RND method showed the highest accuracy among the pretreatment methods compared, but it was only possible to use this method under limited conditions. Among the remaining methods (except RND), the highest prediction accuracy was found with ADD, which had an MAE value of 47 cycles. However, in the case of the ADD method, the time consumed was very high compared with the other methods because the predicted RUL value was used for training again. In the case of INI, which is a similar method, only the initial database was used, and it showed almost the same level of prediction accuracy as ADD. BOX showed lower prediction accuracy than the INI and ADD methods but had the advantage of using the least data compared to the other methods. Finally, SEP showed low prediction accuracy, even though more data were used compared to the other methods. However, NCA showed a significant improvement in prediction accuracy when cell data, which were not well predicted, were used definitively for the training set. Using the ADD or INI methods, we demonstrated their high prediction capability regardless of various battery cell types. Therefore, we believe that this method will be sufficiently applicable to the other Ni-rich or Li-rich battery cells. To further improve the prediction accuracy, adding physicochemical properties would be beneficial because they directly reflect the characteristics of various types of battery cells.
Data Availability
The database used in this study is open-source, and relevant information can be found in the references.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program) (2021-0-01539) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. 2020R1F1A1066519 and 2022R1F1A1074339).
Supplementary Materials
Supporting Information: battery cycling conditions, feature list, feature correlation, and machine-learning model prediction score and results. (Supplementary Materials)