Abstract

Aiming at the shortcomings of traditional recommendation algorithms in dealing with large-scale music data, such as low accuracy and poor real-time performance, a personalized recommendation algorithm based on the Spark platform is proposed. The algorithm is based on the Spark platform. The K-means clustering model between users and music is constructed using an AFSA (artificial fish swarm algorithm) to optimize the initial centroids of K-means to improve the clustering effect. Based on the scoring relationship between users and users and users and music attributes, the collaborative filtering algorithm is applied to calculate the correlation between users to achieve accurate recommendations. Finally, the performance of the designed recommendation model is validated by deploying the recommendation model on the Spark platform using the Yahoo Music dataset and online music platform dataset. The experimental results show that the use of improved AFSA can complete the optimization of K-means clustering centroids with good clustering results; combined with the distributed fast computing capability of Spark platform with multiple nodes, the recommendation accuracy has better performance than traditional recommendation algorithms; especially when dealing with large-scale music data, the recommendation accuracy and real-time performance are higher, which meet the current demand of personalized music recommendation.

1. Introduction

With the continuous development of network and information technology, music works have exploded and gradually formed a vast music work library. Personalized recommendation has become a problem that needs to be solved by powerful music platforms [1, 2]. The recommendation service can filter the music data from the library according to the user's interest and behavior and recommend the corresponding music works intelligently under the premise of meeting the user’s personalized needs. A sound recommendation system can save users’ time and energy, improve user loyalty, and attract new users and make music platforms more competitive [35]. However, because there are many conditions to be met for music recommendation, which leads to the traditional single algorithm recommendation with low accuracy, poor real time, and colossal resource consumption, it cannot meet the current demand for a personalized offer of massive music data.

Based on this, some scholars have proposed improvements and optimization based on the original algorithm models such as collaborative filtering, content, and model and have achieved certain achievements. Lampropoulou et al. [6] proposed a music recommendation system based on SVM to accomplish the query of the same type of music resources through music content retrieval and collaborative filtering algorithm to achieve personalized music recommendation. However, the effect of recommendation capability is not satisfactory when dealing with large-scale dimensional data. Tian et al. [7] proposed a hybrid LX recommendation algorithm by integrating the logistic regression method and XG-Boost (eXtreme Gradient Boosting). It verified the effectiveness of the LX model by real music dataset, and experiments showed that the method has high accuracy. The experiments show that the method has high recommendation accuracy. However, due to the high spatial complexity of the XG-Boost preordering process, the implementation process is more complicated, and the computational resources are more consumed. The real-time recommendation capability is not high. Using k-means clustering to find out users with similar attributes and pLSA (probabilistic Latent Semantic Analysis) technology to obtain the type of music users often listen to, the algorithm combines to achieve accurate recommendations with high efficiency [8] The accuracy and efficiency are high.

Although the above recommendation algorithm can accomplish music recommendations well, there is still a large room for improvement. Therefore, this paper proposes a K-means clustering music recommendation method based on AFSA optimization based on the Spark platform. The improved AFSA is used to solve the global optimization-seeking and local extremum problems of the K-means algorithm to improve the accuracy of multicategory clustering. The collaborative filtering algorithm is also used to deal with the similarity between users and improve the clustering effect and recommendation efficiency by the parallel computing capability of multiple nodes of the Spark platform.

2. Spark Platform Introduction

Apache Spark is a distributed computing engine specialized in processing large-scale data, with the advantages of fast data processing, strong stability, and scalability and support for multinode collaboration, which can complete data stream processing, interactive query, and batch processing in real time [9, 10]. As shown in Figure 1, Spark adopts the management model of master-slave nodes, and the nodes have the same computing capability in terms of functional structure. It is responsible for controlling the executor or driver process to complete resource management and task scheduling. The executor refers to the process of an application running on the work node.

Spark platform uses DAG (directed acyclic graph) job scheduling scheme. The dependencies between different data are specified by building a model of data relationships in RDDs (resilient distributed datasets). To facilitate the iterative computation of data, the results of the distributed analysis are usually stored in memory. The RDD is a read-only, partitioned, and immutable dataset. Since Spark uses the RDD computation model to do most of the data computation, the model calls more methods and does not need to focus on the underlying scheduling details [11]. This fundamentally avoids the problem of low computing efficiency caused by multiple iterations in the clustering process, dramatically improves the effect of data clustering and recommendation, and meets the demand for real-time and intelligent music recommendations.

3. K-Means Clustering

K-means clustering is one of the commonly used algorithms in data mining. The algorithm uses the distance between data sample points to measure similarity. The closer the distance, the higher the similarity of the data. Set up a dataset C = {ci, i = 1, 2, …, m} with m sample points; each sample has an N data dimension feature description [12]. The K-means algorithm aims to allocate all the data samples in the n dataset and divide the pieces into corresponding K clusters according to the principle of similarity, cluster D = {dk, k = 1, 2, … , K}. The error of clustering is evaluated using the sum of squared errors, which is computed repeatedly until the sum of squared errors of clustering is the smallest [13]. The algorithm flow is shown in Figure 2.

There are more methods to calculate the distance between data sample points, and the K-means algorithm mainly uses the Euclidean distance to measure the similarity between data samples. The Euclidean distance between two data samples in dataset C can be expressed as [14]

The mean of the N data samples within cluster dk is calculated and considered as the center of mass hk of the cluster, and the expression is

The criterion for clustering sum of the squared errors G as data samples:

The implementation of the K-means algorithm is relatively simple. Still, the number of clusters depends on manual experience setting, which takes a long time when dealing with large-scale data such as music and video, and the accuracy of clustering is not ideal. Moreover, the initial clustering centroid of the algorithm is selected randomly, which is prone to the situation that the clustering results vary greatly. If a relatively isolated point is selected as the initial clustering centroid, it will directly affect the accuracy and efficiency of clustering [15, 16]. Considering the actual application requirements, the K-means algorithm needs to be improved to enhance the clustering accuracy and efficiency.

4. Improved Artificial Fish Swarm Algorithm

4.1. Artificial Fish Swarm Algorithm

AFSA is a bionic optimization algorithm proposed by studying the intelligent behavior of fish. Based on the environmental stimulus information received by the fish, an artificial fish is constructed, and its feeding, gathering, and tailing behaviors are simulated to accomplish automatic optimization of the solution space [17, 18]. In this paper, AFSA is used to select the initial cluster center of mass for K-means clustering. The data sample of the desired cluster is used as the artificial fish. When the fish in the group find food, the other individuals in the collection adjust their positions according to their state and food position and finally obtain the optimal global value, which is the initial center point of the cluster. The principle of artificial fish seeking is shown in Figure 3.

Let Yi(t) be the spatial position of the artificial fish i at the t moment; ΔYi(t + 1) denotes the change in spatial part of the fish i at the t + 1 moment compared to the t moment; then, the random movement behavior of the artificial fish Yi can be expressed aswhere Rand() is a random function and Step denotes the update of the artificial fish stocks. Then, a specific state Yj within the visual of the artificial fish Yi is randomly selected as

Calculate the objective function values Zi and Zj for Yi and Yj. If Zj is better than Zi, then Yi is one step ahead in the direction of Yj. There are many kinds of filtering algorithms:

If Zj is worse than Zi, fish Yi reselects another state Yj in its field of view and determines whether it can advance. After repeated attempts of Tray-number times, if it still cannot get the condition to satisfy the advancement, the random behavior is executed.

Let the food concentration function in spatial coordinates be f(y), the coordinate food concentration be fmax, and the food concentration at the Yi coordinate of the artificial fish be f(Yi). Based on the maximum value of the food concentration f(y) within the field of view of the fish, determine whether the food location Yo is in the visible range of the fish, and if it is visible, perform the trailing behavior:

4.2. Optimization of Artificial Fish Swarm Algorithm
4.2.1. Optimization of Fish School Behavior

In the foraging behavior of AFSA, Yj is any state within the current field of view of the artificial fish. The artificial fish performs random behavior when the selected form cannot satisfy the forward condition. The existence of unexpected behaviors in the algorithm will result in lower accuracy of the foraging results; the convergence process of the algorithm in the later stage is prone to the phenomenon of artificial fish searching in a roundabout way at the global extremum, which leads to invalid calculation. At the same time, when the artificial fish fails in clustering and tail-chasing behavior, it performs foraging behavior, and the artificial fish has more invalid forward behavior, which makes the convergence time of the algorithm longer and the iterative computation more significant. Therefore, this paper optimizes the fish swarming behavior in the original algorithm in the following two aspects to further improve the algorithm’s performance.(1)When the foraging behavior fails, the artificial fish Yi moves one step forward in the direction of the relatively better state in the bulletin board record. The optimized expression for the foraging behavior iswhere Ybetter is the relatively better state in the bulletin board record and is the increment of that state.The optimized foraging behavior discards the phenomenon of artificial fish randomly advancing and provides a better possibility of running for fish movement, which can effectively avoid artificial fish falling into local optimum and can further improve the accuracy of the search results.(2)When the clustering behavior or tail-chasing behavior fails, the artificial fish no longer performs the foraging operation but executes the random swimming behavior, and the optimized unexpected moving behavior is defined and expressed as

By optimizing the forward rule after the failure of clustering and tailing behavior, we can reduce the invalid brash behavior of artificial fish, thus effectively reducing the algorithm running time and fundamentally reducing the computational complexity.

4.2.2. Adaptive Field of View and Step Size

The two parameters of the artificial fish's field of view and step length determine the current search range and the subsequent moving speed. When artificial fish are searching far away from the extreme point, they should have a larger field of view and step to improve search speed. Also, when the fish in the area nearer to the extreme value point are searched, the values of the field of view and step length should be small to ensure the accuracy of the search. Considering the global fixed values of the field of view and step length in the original algorithm, it is easy to fall into the dire situation of blind search [19, 20]. In this paper, the adaptive value of both is accomplished by introducing the decay factor.where is the attenuation factor with forwarding direction to constrain the step length, the field of view, and forward movement of the artificial fish, . Then, the equations of adaptive step length and adaptive decay field of view can be expressed as

The iterative process of AFSA is a continuous process of artificial fish aggregation to the global extremum. By introducing a decay factor, the fish can dynamically adjust the field of view and step size according to the external environmental information during the search process. In the early stage of the algorithm's operation, the artificial fish use a larger field of view and step size to conduct a coarse search in a wide range near the optimal value, thus speeding up the individual search speed and making the algorithm converge quickly. After several iterations, the artificial fish gradually approaches the extreme value. At the end of the iteration, the field of view and step size is adjusted adaptively by the attenuation factor to enhance the local search ability and reduce the oscillation range. The artificial fish can quickly find extreme global points with high accuracy. Especially in the case of multidimensional data, the algorithm of finding the optimal effect with the introduction of the decay factor is more prominent.

4.3. User-Based Collaborative Filtering

After clustering by the AFSA-optimized K-means algorithm, the user’s rating matrix of all resources and services to be recommended can be obtained. The collaborative filtering algorithm can find users with the same rating items in this rating matrix and calculate the similarity to complete practical personalized music recommendations. There are many kinds of filtering algorithms, but PPMCC (Pearson product-moment correlation coefficient) solves the difference of each dimension, and the similarity calculation process focuses on the whole so that it can complete the multidimensional examination of the same user [21]. So, this paper uses PPMCC to calculate the similarity between users. The expression iswhere I(u, v) represents the set of shared items jointly rated by users uu and , ruc and rvc represent the ratings of items ic by users uu and uv, respectively, and and represent the average ratings of all calculated items by users uu and uv, respectively.

Table 1 shows the user-user similarity matrix W calculated by the PPMCC.

4.4. Music Recommendation Process Based on Spark

The recommendation process of K-means clustering mining based on AFSA optimization is shown in Figure 4. First, the requirement analysis is performed for the music personalization recommendation task. The Spark platform deploys distributed nodes of a suitable size to build the K-means clustering operation model. The AFSA algorithm is applied to optimize the centroids of the initial K-means clustering and obtain the clustering results. According to the scoring model between users and users and users and music attributes, the collaborative filtering algorithm calculates the similarity between users. In turn, the personalized recommendation of music resources is completed.

5. Experiment

5.1. Experimental Dataset and Clustering Center K Values

To verify the performance of the AFSA-optimized K-means clustering personalized music recommendation algorithm on the Spark platform, public and proprietary datasets are tested separately. The public dataset is Yahoo Music, a classical dataset for testing recommendation systems which can well validate the recommendation performance of clustering mining. The dataset contains large-scale user and music resource data to validate the recommendation system's performance adequately. The experimental dataset is shown in Table 2. The Spark platform contains one master node and nine work nodes with the same hardware configuration, CPU: I5-11400: memory: Kingston 3600 8gb ∗ 2; and system: IOP 4.2 Version, including Spark Version 1.61.

The clustering center K value is more sensitive to the music scoring matrix and has a more significant impact on the stability of collaborative filtering recommendations. Therefore, this paper differentiates the K values and uses RMSE (root mean squared error) as an evaluation index to verify the recommendation accuracy under different K values [22]. The expression of RMSE iswhere pb denotes the predicted score, qb denotes the actual score, and N represents the number of items. Smaller RMSE values indicate better algorithm performance.

The data from the Yahoo Music dataset were randomly taken and formed into three sample sets of different sizes YM-Data1 (13.82 MB), YM-Data2 (43.39 MB), and YM-Data3 (140.24 MB), and then experiments were conducted with the three sample data, respectively, to derive different K-value recommendation accuracy RMSE values, as shown in Figure 5.

As shown in Figure 5, the RMSE value decreases and increases as the K value increases, and the RMSE value is optimal for the three sample sets of YM-Data1, YM-Data2, and YM-Data3 when the K value is 20, 20, and 22. When increasing the K values, the RMSE values gradually become more extensive, and the recommendation stability worsens. Therefore, in the subsequent training process, the range of K values is set to [20, 22].

5.2. Performance of the Recommendation Algorithm
5.2.1. Public Datasets

The K value is set to 20, and the designed A-optimized K-means algorithm is used for clustering mining. The recommendation algorithm model is used for training the Yahoo Music sample set. All master nodes and work nodes are involved in the computation during training. The recommended performance of the three sample sets is shown in Table 3.

As shown in Table 3, the difference in sample set size has little influence on the recommended usage time and accuracy, and it has good real-time recommendation ability because the data are loaded into the distributed memory of the Spark cluster host and the conversion iteration completes faster, especially when the sample data size is small.

5.2.2. Online Music Dataset

To further validate the recommended performance of this method, a dataset from an online music platform is used for testing. First, four sets of sample sets with different capacities based on music platform are built: KW-Data1 (10.75 MB), KW-Data2 (474.84 MB), KW-Data3 (1.62 GB), and KW-Data4 (6.65 GB). Then, cluster mining is applied to each data sample to get the score between all users and music resource attributes. Finally, the recommended results are obtained by using the collaborative filtering algorithm. KW-Data2 is used as an example to visualize its clustering results to facilitate the observation of the clustering effect, and the result is shown in Figure 6.

To make the results comparable, the algorithm performance was experimented with differentially setting the number of clustering centers for K taking values of 18, 20, and 22. The obtained results are shown in Table 4.

As can be seen from Table 3, the accuracy rate stays above 90% when K values are selected from the optimization algorithm, and the recommendation accuracy rate improves by 20.3%–21.7% compared with the randomly selected K values. Since the Spark platform uses multinode computing, the recommendation time gradually increases with the increase of sample capacity under different K values. Still, there is no rapid growth phenomenon, in line with the actual rule.

5.3. Spark Platform Acceleration Performance Testing

The higher the speedup ratio is, the more parallelized the platform is. The acceleration ratio expression iswhere T1 denotes the recommended time for a single node and Tρ denotes the time used for multinode parallel computation under the Spark platform. The speedup performance at the different number of nodes is shown in Table 5.

From Table 5, we can see that the more the number of worker nodes is, the more significant the speedup effect is due to the excellent parallelization capability of Spark platform, and the larger the volume of the sample set, the more influential the impact of the number of nodes on the speedup ratio.

5.4. Performance Comparison of Different Recommendation Algorithms

Five hundred samples were randomly selected from each online music platform sample set. A new selection set KW-Data5 was constructed to test the SVM classification algorithm proposed in [6], the XG-Boost algorithm presented in [7], and the algorithm in this paper to ensure the comparability of the algorithms and further verify the performance of the optimized K-means clustering based on the improved artificial fish swarm algorithm. The recommendation performance of the three algorithms is shown in Figure 7.

As can be seen from Figure 7, the recommendation accuracy of each algorithm increases with the number of iterations and then gradually stabilizes. However, the recommendation accuracy of this method reaches 96.23%, which is 6.77% and 15.32% higher than that of XG-Boost and SVM, respectively, and the recommendation performance is more excellent. The convergence speed of different algorithms: the method in this paper tends to be stable after 245 iterations; the SVM tends to be stable after 230 iterations; XG-Boost algorithm completes convergence after 295 iterations, which is slower. Therefore, based on the comprehensive recommendation performance comparison results, it can be concluded that the improved AFSA-based optimized K-means clustering method has good recommendation ability, can better balance the relationship between accuracy and convergence speed, and is suitable for handling large-scale personalized music recommendation work.

6. Conclusion

To address some shortcomings of the traditional music recommendation algorithm, the improved AFSA is applied to solve the global optimization and local extremum problems of the K-means algorithm, thus improving the effect of multicategory clustering, completing the personalized recommendation of music by collaborative filtering algorithm, and improving the speed of large-scale data processing by using the parallelized computing feature of Spark distributed platform. The experimental results show that the AFSA-optimized K-Means clustering mining algorithm has good recommendation capability, and the recommendation accuracy is improved by 6.77% and 15.32% compared with the commonly used SVM and XG-Boost methods; the algorithm iterates faster to meet the requirements of real-time recommendation, which is enough to improve the recommendation efficiency of personalized music effectively and can be used as a reference for music, video, and similar fields, with good guiding significance.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.