Abstract
Human music life can be traced back to ancient times. The music art of human society is rich and colorful, which makes the music classification unable to classify efficiently and accurately. Moreover, the classification has become a daunting task. On this basis, this paper studies the method of deep learning for processing music classification. Not only is the design structure of music signal channel classified, but also all connected neural networks associated with the music are investigated to design an appropriate network model. According to different music sequence measurements, the feature sequence mechanism of music design feedback optimization is also investigated. The type probabilities of different calculated orbits are measured by softmax activation function, and the function value of cross loss is obtained. Finally, an Adam optimization algorithm is used as the optimization algorithm of the proposed network model. Subsequently, an independent adaptive learning planning rate is designed. By adjusting the network parameters, the first- and second-order estimates of the calculated gradient are classified. The experimental outcomes prove that the anticipated method can meritoriously increase the correctness of music classification and is helpful for music channel classification. Moreover, we also observed that the number of neurons in the network has also a significant impact over the training and testing errors.
1. Introduction
The creation and performance of early popular music were mostly commercial, and it was carried out in cities and towns, which was different from folk music with strong rural color. At the same time, it does not have the standardization and stability of art music [1]. These in early days, in many cases, were just oral. Therefore, some people say that popular music is different from art music and folk music. This, in fact, generally refers to a kind of music that is easy to understand, relaxed and lively, and easy to spread and has a large audience. Some people say that some particular music is “popular music” [2]. Music genre is an important label to describe music. Music tags play a virtuous part in pinpointing and separating digital music resources [3]. Therefore, from a huge amount of musical data, their identification and classification have become more daunting. Facing the enormous music catalogue, depending on manual explanation for classification will devour significant computational costs, resources, and time. Moreover, we believe that they will still not be able to meet the needs of the current times enriched by big data, Internet of things, and people’s increasing interest in music. Therefore, music classification has gradually become a research hotspot.
At present, scholars in related fields have made theoretical research on the classification of music themes. For example, the authors in [4] proposed an engine system for classifying genres, which aims to replace these features by a new model. The model can also recommend music from vocal music that has been extracted from online music. Their experimental results show that this method not only has certain efficiency but also can effectively modulate speech pitch and construct separation masking based on neural recursion. It should be kept in mind that the voice signals mixed with music can be screened and deleted. The music pitch classification method based on the RNN model can improve the time trajectory of speech and music pitch values. Moreover, this can also determine that the unknown continuous pitch sequence belongs to speech or music. This method has significant classification performance without losing speech noise separation performance. Nevertheless, the previously mentioned approaches still have some complications, such as low classification precision, poor effect, and lengthy computational time.
In order to solve the above complications, a classification method of music genres based on deep learning is proposed in this paper. Using deep learning, the data preprocessing is used to filter the music signals. Furthermore, using a fully connected neural network structure, the extraction of music genre features is completed. Finally, the attention mechanism is used to design a music genre classification network model. The music genre classification effect of the suggested method is better than those of other approaches, which can effectively improve the classification accuracy of the music genre. Moreover, our approach shortens the classification time significantly. The main contributions are as follows:(i)We study the classification of the design structure of music signal channel, and the connected neural network associated with music is designed.(ii)According to different music sequence measurements, the feature sequence mechanism of music design feedback optimization is studied.(iii)The type probabilities of different calculated orbits are measured by softmax function, and the function value of cross loss is obtained.(iv)Finally, an Adam optimization algorithm is used as the optimization algorithm of network model, and an independent adaptive learning planning rate is designed.
The remainder of the paper is organized as follows. In Section 2, we briefly discuss the basic theory of deep learning. Neural and back-propagation (BP) networks along with activation functions are discussed. In Section 3, fundamentals of music signal analysis are illustrated. In the fourth section, we discuss the classification of music genres and feature extraction and propose a neural network model. Experimental discussion and results are presented in Section 5. Finally, Section 6 summarizes the paper and presents directions for future research.
2. Basic Theory of Deep Learning
Deep learning is a branch of machine learning that deals with learning algorithms using deep neural networks. In fact, deep learning methods are developed from artificial neural networks (ANNs). It should be noted that ANNs are the most commonly used and representative model structure in the field of machine learning. Deep neural network (DNN) is a neural network, which is formed from the interconnection of various neurons and weights and may have many hidden layers and neurons [5]. Deep learning can learn higher-level feature expression from complex and large samples.
2.1. Neural Networks
Deep learning is developed from artificial neural networks. Furthermore, neural networks are abstracted from the structure of biological neural networks. In the network, information is transmitted and activated through the interconnection between basic units, known as neurons, which in fact imitates the process of information transmission between the biological neurons [6]. The basic unit of the neural network is called neuron, and several neurons are connected with each other in such a way that communications occur among them [7]. The basic structure of the neuron is as shown in Figure 1.

In Figure 1, is the input signal, and the arrow starting from the input signal represents the connection. Each connection corresponds to a particular weight . After the input signal passes through these connections, it is weighed and summed to obtain (a usual output of the hidden neurons). Finally, the previous output goes through a nonlinear function in order to get output . It should be noted that the nonlinear function is called the activation function that is used to tune the performance of the network [8]. The process of neuron input to output can be described by mathematical expression as follows:
In formula (1), is the bias term of the neuron. Multiple neurons with the same inputs form a hidden layer. The input of one layer of neurons is used as the input of the next layer of neurons, and the basic neural network is formed according to this connection method. The input of a neuron can come from either the input signal or the output of other neurons [9]. The structure of the fully connected neural network is shown in Figure 2.

From bottom to top, as shown in Figure 2, the input layer takes inputs, passing through several neuron layers, and the output layer creates the output. The network structure, in Figure 2, has only one hidden layer, and this type of neural network is also called a single hidden layer feedforward neural network. In deep learning, multiple hidden layers can also be set, and each hidden layer is set with a different number of neurons according to the actual situation to improve the learning capability. The connection weight matrix of each layer and the previous layer is multiplied by the output value of the neuron of the previous layer, and the bias term of this layer is added to obtain a linear output. Subsequently, the obtained linear output then passes through the activation function of this layer performing nonlinear transformation to get the output of this layer of neurons [10]. The process of neurons in each layer from receiving input to calculating output can be described by a calculation formula as follows:
In formula (2), is the linear output vector of neurons in layer , which is calculated from the output vector of neurons in layer , the connection weight matrix of layer , and the bias term of layer . Furthermore, is the nonlinear output vector of the layer neuron obtained by the linear output of layer neuron through the activation function of layer .
Let us again refer to the basic architecture of the neural network, as shown in Figure 2, starting from the input layer, along the direction from input to output. For example, according to the above process, a series of linear and activation operations are carried out for the input vector, connection weight matrix, and offset term of each layer [11]. All these parameters are calculated layer by layer until the target prediction result is obtained at the output layer. This process is a forward propagation process.
2.2. Back-Propagation (BP) Algorithm
The input layer, hidden layer, and output layer are the three components that make up the front end, middle, and end of the BP neural network. It is assumed that x0 = −1; the beginning of the imported input is the input vector, whose formula is x = (x1, x2, …, xi, …, xn)T; the middle of the neural network is the hidden layer, which will slow down training. The output vector is the result of the generated data, and its formula is y = (y1, y2, …, yi, …, yn)T. y0 = −1 can be provided as an additional assumption. The algorithm is a part of a unique programme, and, right now, one of the most cutting-edge fields is neural network. The result of combining the two is BP neural network. The topology of the BP neural network is shown in Figure 3. This research employs the modified BP neural network model to evaluate music classification, which can successfully eliminate the difficulties of instability and slow convergence of the classic model and can comprehensively improve the accuracy of the evaluation findings [12]. Topological structure of BP neural network model is shown in Figure 4.

(a)

(b)

(c)

In this first step, we calculate the error of the output layer according to the error loss function and then transfer it layer by layer to the middle layers in some form and update the parameters of each layer [13, 14]. Through continuous iteration, the error of loss function calculation is minimized and the parameters converge. The back-propagation algorithm adopts the gradient descent method, as illustrated in equation (3), to update the parameters:
In formula (3), is the learning rate, and and are the gradients of the error loss function to the connection weight and the paranoid term , respectively. It can be seen that the key of the back-propagation algorithm is to find the gradient of the error loss function to the parameters [15]. The calculation process is given in the following steps. Step 1: Calculate the loss error according to the target prediction and expected output of the output layer using the following equation: In formula (4), is the loss error, is the target prediction vector of the output layer, is the target expectation vector, and the function denotes the loss function. Step 2: Calculate the error term of layer in the network according to the error loss using the following equation: Step 3: Calculate the error term of neuron in layer according to the chain rule, as illustrated in the following equation: It can be seen from formula (6) that the error term of layer is affected by the error term of layer . In other words, the error of the network will propagate in the opposite direction layer by layer through the back-propagation algorithm. Step 4: Calculate the connection weight of each layer and the gradient of the bias term according to the error term using the following equation: As can be seen from formula (7), the gradient of the current layer connection weight strongly depends on the error term of the current layer neuron and the output of the previous layer neuron. Moreover, it can also be observed that the gradient of the current layer bias term depends on the error term of the current layer neuron. Through substituting the above calculation results into formula (3), the parameter update of each round of the training process can be completed.
2.3. Activation Functions
The activation function achieves delinearization, turning the neural network into a nonlinear model and bringing the network model the ability to solve linear inseparable problems [16]. There are various activation functions that are related to neural network and each function can be replaced with another one in order to boost the accuracy of the model. Few of the well-known and largely used activation functions comprise the tanh function, ReLU (Rectified Liner Units) function, sigmoid function, and the softmax function. Among these, the softmax function is often used in the classification tasks [12, 17]. It should be noted that an appropriate activation function is selected according to the needs of the task and the characteristics of the network layer. The three activation function images are illustrated in Figure 3.
In the next discussion, we offer a brief description and mathematical model of each activation function. In the later sections, we will demonstrate that these functions have impacts on the network accuracy and prediction outcomes.(1)tanh: the tanh function is a hyperbolic tangent function, which maps variables to the values among the range . However, the tanh function has the problem of gradient saturation; that is, the derivative of the function at both ends is almost zero. This easily causes the problem of gradient disappearance in the training process of the neural network back-propagation, which makes the training speed of the network model very slow or difficult to converge. The function’s mathematical expression is given in the following equation:(2)Sigmoid: the sigmoid function image is similar to the tanh function, and the problem of gradient disappearance is also prone to occur. The function’s mathematical expression is given in the following equation:(3)ReLU: the ReLU function is a linear rectification function and a nonsaturated activation function, which can solve the problem of the disappearance of the gradient caused by the derivative tending to zero. The ReLU function sets the negative value to 0 and performs truncation processing. The ReLU function is easier in the process of derivation calculation and can speed up the convergence speed of the network model [18]. The mathematical expression of the ReLU function is given by the following equation:(4)Softmax: the softmax function is generally used in the output layer of the neural network to complete the classification task. In the multiclassification process, the main task and function of the softmax function is to use the original output, calculate a new output, and map the value range to . In this way, the output of the neural network becomes the probability distribution of the target label. The function’s mathematical expression is illustrated in the following equation:
3. Fundamentals of Music Signal Analysis
3.1. Overview of Music Genres
Since the emergence of human beings, music has developed with the evolution of human beings. Under the influence of different periods, regions, nationalities, and cultures, it has gradually formed some unique musical classic characteristics in musical thought, creative principles, artistic personality, and means of expression and techniques, and music types with different styles appeared. These types can be called music schools. Popular music genres include classical, jazz, blues, hip-hop, rock, country, pop, and metal [19, 20]. There is no strict classification standard for the classification of music genres, which is subjective. Music works of the same genre have similar artistic styles.
3.2. Music Features
The features and characteristics of the music genre can be divided into three different types: (i) time domain characteristics, (ii) frequency domain characteristics, and (iii) cepstrum domain characteristics.
3.2.1. Time Domain Characteristics
Time domain features include zero crossing rate (ZC3) and short-time energy (STE). These features can be extracted directly from the waveform of the original signal. The processing process is simple and requires less mathematical calculation. They are widely used in the research of music classification tasks [20, 21]. The two common time domain features are described in detail below:(1)Short-time energy: Short-time energy is the sum of energy in a small window, reflecting the change range of music signal over a period of time. It should be noted that it is generally used to judge the silence in a piece of music, carry out endpoint detection, and identify the beginning, transition, or end of music signal [22]. The calculation formula for the short-time energy is given by the following equation: In formula (12), represents “window function.” The more popular window functions used to calculate short-time energy include “rectangular window” and an improved raised cosine window, “Hamming window” [23]. The calculation formula for window function is given by the following equation: In formula (13), represents the length of the window.(2)Short-time zero crossing rate: If the adjacent voice signal samples carry the opposite algebraic symbols, it is considered that zero crossing will be produced. The level of zero crossing rate directly reflects the number of high-frequency components of music signal. Short-time zero crossing rate is commonly used to detect silent frames in voice time domain analysis. The calculation method of this feature is given by the following equation: In formula (14), represents a discrete speech signal, and is a special function used to represent algebraic symbols. The definition of the function that denotes the algebraic symbols is given by the following equation:
3.2.2. Frequency Domain Characteristics
Common frequency domain features include spectrum centroid (SC), spectrum energy (SE), spectrum bandwidth (SB), and spectrum traffic (SF). The description and calculation formulas of several common frequency domain features are listed below.(1)Spectrum centroid (SC): The spectrum centroid is a commonly used measure. The size of this value represents the size of the frequency component of the music signal. The larger the value, the more high-frequency components and vice versa. The calculation formula is illustrated as follows:(2)Spectrum energy (SE): The frequency domain feature is used to characterize the frequency domain energy of a frame signal of music. The calculation formula for the spectrum energy is as follows:(3)Spectrum traffic (SF): The spectrum traffic is a dynamic feature that represents the spectrum of the music signal. In fact, it is the sum of the squares of the signal differences of all adjacent frames in a discrete frequency domain music signal. The calculation formula is given as follows: In the three above formulas, represents the Fourier transform of each frame of signal. Furthermore, and represent the maximum frequency and minimum frequency of a piece of music in the frequency domain signal, respectively.
3.2.3. Cepstrum Domain Characteristics
The music signal is transformed into frequency domain through Fourier transform, and the frequency domain characteristics are obtained through mathematical calculation and analysis, as discussed in previous sections. Then, take the logarithm of the music spectrum signal and perform the inverse Fourier transform. The audio signal in the frequency domain will be converted to the cepstrum domain, so as to obtain the cepstrum domain characteristics [24, 25]. The most common cepstrum domain features and related formulas are listed below:(1)Mel frequency cepstral coefficient (MFCC): It is one of the most commonly used cepstral domain features, which can well represent the audio signals. The Mel frequency cepstrum coefficient can transform nonlinear relationship into linear relationship. The calculation step of the MFCC is through preemphasis, framing, windowing, fast Fourier transform, and taking the absolute value or the square value. Through the triangular band-pass Mel frequency filter bank, the logarithm of the output energy of the filter is taken and DCT inverse transformation is performed to obtain the characteristics of the dynamic Mel frequency cepstrum coefficient [26]. The relationship between the frequency represented by and the linear frequency represented by is given by the following equation:(2)Linear prediction and cepstrum: Combining the two principles of linear prediction and cepstrum, the all pole model function is defined as illustrated in the following equation [27]: In formula (20), and represent prediction coefficient and prediction order, respectively. Assuming that represents the impulse response of the original music signal without preprocessing and represents the system function, the process of obtaining the cepstrum is to calculate the logarithm of first and then perform the inverse transformation. The calculation process is given by the following equation:
4. Classification of Music Genres
Grounded on the deep learning-based music genre classification method, in fact the music genre characteristics are extracted by preprocessing the musical signals. Furthermore, the music genre classification neural network model is planned according to the fully connected neural network structure. According to the characteristic sequence of the input music genre, the attention mechanism is researched, and the classification network of this article is designed using the attention mechanism to realize the classification of music genres.
4.1. Music Signal Preprocessing
Preprocessing the music signal is a very important stage in the music genre classification method. The preprocessing can make the next extracted features more effective. Moreover, less useful signals and noise can be removed to increase the prediction outcomes and accuracy. The following steps were carried out to preprocess the music signals.(1)Preemphasis: In order to improve the high-frequency resolution of the music signal [28] and in order to perform overall spectrum analysis on the entire frequency band, the preemphasis is introduced. The preemphasis is generally achieved through a first-order digital filter before the feature parameter extraction. The transfer function of the filter is expressed as given by the following equation: In formula (22), parameter denotes the factor of preemphasis that is, in general, considered as a decimal digit nearby to 1. If we suppose that the worth of sample, related to the music genre signal, is at time , then the outcome after the preemphasis phase is as given by the following equation:(2)Framing: In order to smoothly transition between the two frames of signals and to ensure that information is not lost, the framing phase needs to have an overlapping part of 1/3∼1/2 frame length between the two frames. This overlapping fragment is entitled the frame shift. Then, the theoretical calculation formula for the number of frames of a music signal segment is computed as explained in the following equation: In formula (24), characterizes the entire span of the music signal, and symbolizes the length of the frame. Similarly, signifies the total amount of frames, and exemplifies the frame shift.(3)Windowing: After framing all music genre segments, in order to increase the continuity between frames, it is suggested to reduce edge effects and also reduce spectrum leakage. Furthermore, it is also essential and crucial to accomplish the process of windowing on the framed music signal. The commonly used window functions in audio signal processing include (i) Hamming window, (ii) rectangular window, and (iii) Hanning window. The three window functions are defined as follows: These three window functions all have low-pass characteristics, and the main performance is determined by the attenuation of the first side lobe and the width of the main lobe. Since the boundary of the window function of the Hamming window is smooth, the first side lobe attenuation is the most severe, which can meritoriously circumvent the phenomenon of leakage [29]. Consequently, this paper selects Hamming window as the window function.
4.2. Music Feature Extraction
After preprocessing the signal of each music genre, the characteristic of the music genre, namely, MPCC, is extracted. The specific steps for extracting MPCC characteristic parameters of music genre signals are illustrated in the following steps:(1)Accomplish the FFT transformation on every frame of the music genre signal after preprocessing to acquire the spectrum of the frequency.(2)Proceed with the square of the modulus for the FFT-transformed spectrum, computed in previous step, in order to acquire the discrete power spectrum, denoted by , of every music signal.(3)In the third step, pass the power spectrum for filtering through a set of Mel filters using the following equation:(4)Finally, calculate the natural logarithm to acquire the MPCC parameters for each and every music genre signal using the following equation: Subsequently, the range of the frequency in the music signal changes from a little and few hertz to thousands or kilo of hertz, and the transformation is moderately very slow. Therefore, the MPCC parameters extracted from each frame of the music genre signal in this paper are 12-dimensional.
4.3. Design of Network Model for Music Genre Classification
The neural network learning process is listed in Figure 5(a). According to the neural network structure, the design and research of music classification model is shown in Figure 5(b) [15, 16].

(a)

(b)
The input of the input layer processes the music signal through preemphasis, framing, and windowing to extract music genre features. The music genre feature sequence, extracted from the input layer, is their features learned. Similarly, the influence on the current time state is calculated from the future and the past, respectively. The feature representation is obtained and combined with the context semantic information, which is input into the attention mechanism network. The attention mechanism network learns the input feature representation and obtains the corresponding attention probability distribution [14]. Subsequently, it multiplies each attention probability by its corresponding feature vector and finally obtains the music genre feature vector representation . The attention process is given as follows:
In the above formula, is the attention score of the feature vector at time in the feature representation . In the next phase, the activation function softmax is applied, as given by equation (28), to compute as given by the following equation:
The output layer of the network model is defined as follows by calculating the cross-entropy loss function:
In the above formula, is the loss, is the number of samples, is the input sample, and and are the output predicted value and target expected value, respectively, of input of the network model. Note that the classification of music genres is calculated using the following equation:
In the above formula, the classification of music genres is realized through the steps described above.
5. Experimental Analysis
5.1. Experimental Environment and Datasets
In order to verify the effectiveness of the music genre classification method based on deep learning, the MATLAB 2016a programming software was used to extract the features of music signals. We build a fully connected neural network based on Theano library using the Python language. Similarly, we model training that uses the Adam optimization method as the gradient descent optimization algorithm. The learning rate is set to 0.001, and the training rounds are set to 200 rounds. All experiments are carried out and verified on the GTZAN dataset. There are a total of 1000 audio files in the GTZAN dataset. These 1000 files contain 10 genres of music, and each genre has a total of 100 samples. Note that the experiments were carried out several times and the reported results are averaged over multiple runs. In the experiments, the method of nonrepetitive random sampling is adopted, and 80% of each music genre dataset is selected. Furthermore, the distribution of the number of music genres in each category of the training set and validation set is as shown in Table 1.
5.2. Classification Evaluation Index
We performed the music genre classification experiments on five different music genre files of rock, metal, country, classical, and blues. In fact, this is a multiclassification task, and the categories are relatively balanced. The accuracy of the sample population is expressed follows:
In the above formula, is the number of samples in the population.
5.3. Music Genre Classification Effect
After the music genre classification network model is trained by the proposed method, the classification performance of the music genre classification network model is evaluated by using the verification set. The results and the forecast confusion matrix outcomes for 5 files are shown in Table 2.
Analyzing the results demonstrated in Table 2, we conclude that the metal music, classical music, and blues music all successfully fit into their appropriate classification categories, with accuracy rates of 94.94 percent, 92.50 percent, and 95.00 percent, respectively. Furthermore, the rock music and country music are sometimes mislabeled. Due to the fact that some country music can be used as an accompaniment to country dancing and that some rock music is mistakenly categorized as country music, country music is often confused with rock music. The distinction between rock music and metal music is somewhat erroneous. However, the possible reason is that they both pay more attention to rhythm and are similar. In general, the proposed method is used to effectively classify the music of the above five genres, and the proposed method has a better effect on the classification of music genres.
The total number of neurons in the BP neural network has a significant impact over the training and test error. For example, as shown in Table 3, when the number of neurons increases, the training error continues to decrease, and we observed that there is a certain correlation between them. After the analysis, we concluded that 7 as the number of neurons is the most ideal measurement for our experimental setup.
5.4. Classification Accuracy of Music Genres
The assessment outcomes and comparative study of classification precision of various music genre approaches are presented in Figure 6.

We can easily observe from Figure 6 that, under different validation sets, [4] is 73%, and [30] is 82%. The average music genre classification accuracy rate is 91%. Furthermore, we can also observe that, associated with the method demonstrated in [4] and the approach presented in [30], the correctness and accuracy of the proposed music genre classification method are significantly higher.
5.5. Music Genre Classification Time
The evaluation results, in terms of classification time, when the proposed approach is compared with other music genre classification techniques, are presented in Figure 7.

We can observe from Figure 7 that when the number of verification sets increases, the music type classification time of various techniques will also increase. The technique based on the deep learning algorithm, projected in this paper, has the benefits of refining the accurateness, precision, and effectiveness of the music classification.
6. Conclusions and Future Work
In this paper, a prediction method based on the deep learning algorithm was proposed, which has the advantages of refining the correctness, precision, and effectiveness of the music classification. The experimental outcomes demonstrated that the projected method has the ability to effectively improve the accuracy of the music classification and is helpful for music channel classification. Moreover, its music genre classification accuracy is high, which can effectively shorten the music genre classification time and has, therefore, a better music genre classification effect. However, because the research scope of this algorithm is not extended to the subject of finite element, the proposed method has some limitations. In the process of extracting music genre features, this paper ignores the accompaniment information of music. The main melody of the same piece of music, accompanied by different music, may present different genres and styles.
In subsequent research, we can consider combining the main melody and accompaniment of music to extract features to further improve the accuracy of classification. Moreover, advanced deep learning methods such as deep neural networks should be considered to improve the accuracy of the prediction outcome. In learning algorithms, the training is one of the activities that take significant time and can degrade the performance of the whole system. Therefore, we will consider dividing the training and prediction phases over the edge-cloud architecture so that the training may happen at the remote cloud that has usually bulk of resources. The prediction part of the algorithm should run on edge which will essentially increase the processing and response time of the system.
Data Availability
The data used to support the findings of this study are available from the author upon request.
Conflicts of Interest
The author declares that he has no conflicts of interest.