Abstract

With the booming development of Internet information technology, e-commerce platforms in the era of network economy have undergone great changes, triggering a new marketing model change. Innovative research on marketing models can help the transformation and development of small and medium-sized e-commerce companies, which has important practical significance and theoretical value. The prediction of e-commerce sales is one of the key aspects of the evaluation of innovative marketing models, and only an accurate prediction of future sales can lead to a reasonable marketing plan. Therefore, a big data-driven e-commerce sales forecasting method is proposed. First of all, for 1703 real e-commerce companies, a large number of relevant data that affect sales are selected, including sales records, product information, product evaluation, and other information. A knowledge graph was then used to preprocess the data samples to produce a sample set containing concepts, entities, and relationships. Next, the knowledge graph K-modes clustering model is established. By fixing the affiliation matrix and the clustering cluster matrix in turn, the minimum of the objective function is continuously solved to obtain the cluster centres. Finally, sales prediction is achieved based on the clustering results. The experimental results show that the proposed clustering model is able to obtain better performance in terms of cluster purity, NMI, and F-value. The proposed clustering model has high sales prediction accuracy and has certain reference value for e-commerce enterprises of different scales to formulate innovative marketing models.

1. Introduction

Of all the innovative modes of business management, the innovation of the marketing model is very important. Finding the right marketing model is fundamental to a company’s survival. Most scholars see the marketing model as a synergistic system consisting of several elements. The ultimate goal of marketing is to gain competitive advantage and achieve profitability. The marketing system therefore has to specify a holistic solution to achieve this goal [14]. The information technology revolution has provided space and impetus for marketing model innovation. In the current era of network economy, network and mobile communication technologies are changing the way of production and life of human beings in an all-round way. Therefore, marketing model innovation must also be coupled with information technology to play an important role. The Internet economy is developing rapidly and big data-driven innovation models have become a popular research direction at present [58]. Many companies are now using big data technology to achieve knowledge mining. At the same time, the full sharing of off-site resources is achieved through the network, shortening the distance between time and space. Real-time knowledge mining driven by big data can maintain the continuity and uninterrupted nature of innovation.

The China Internet Information Network Centre (CIINC) has published the 41st Statistical Report on the Development of the Internet in China [911]. This report showed that by December 2020, the number of online consumers in China would reach 772 million, with 40.74 million new people. Compared to 2019, the number of online consumers has increased by 2.6%. The Internet penetration rate has reached 55.8%, as shown in Figure 1. According to the China Electronic Commerce Research Centre, e-tailing transactions reached 3.1 trillion yuan in 2020, up by 34.8% year on year compared to 2.3 trillion yuan in 2019, as shown in Figure 2.

In recent years, the number of small e-commerce companies engaged in marketing via the Internet is increasing year on year. E-commerce platforms such as Alibaba, Amazon, and eBay are home to a huge number of small e-commerce companies. These small e-commerce companies, like other traditional businesses, are faced with the problem of marketing model innovation. Among all the innovations, the innovation of the marketing model is fundamental to the business. There are three broad categories of marketing models [12, 13]: value creation models, ecosystem models, and profitability models. The three types of theories discuss the connotation of the marketing model from different perspectives. A marketing model is an ecosystem consisting of many elements. This study considers the core element of the marketing model to be the profitability model. Profitability is a necessary condition for the existence of a market player. A business cannot sustain its basic survival without profit. For the many small e-commerce companies in e-commerce platforms, profit model innovation is their main concern.

The inadequate information management systems of small and medium-sized e-commerce businesses make it more difficult for them to innovate their profit models. In addition, small and medium-sized e-commerce enterprises face huge risks in the process of marketing reform because they cannot afford the larger capital costs. As a result, small and medium-sized e-commerce enterprises can only rely on forecast data of future sales to monitor their own business risks in real time [14, 15]. Sales forecasting can greatly improve the flow of capital for small and medium-sized e-commerce enterprises, allowing them to use these funds to develop broader sales channels or to cope with turbulent changes in the market environment. The accuracy of sales forecasting is therefore relevant to all aspects of an e-commerce business and will have a direct impact on its profit and loss, its next steps in financing, and its survival.

There are many models on sales forecasting, such as linear regression models, time series exponential smoothing forecasting models, logistic regression models, convolutional neural networks, and clustering mining [1618]. In the preparation stage of each of these forecasting models, a sufficient amount of historical data has to be collected in order to predict the results with high accuracy. Therefore, these prediction models are all big data-driven models. Clustering analysis is an important application technique in data mining and is widely used in practical problems. Due to the development of information technology, consumer business data have become very large. With the continuous accumulation of data and the emergence of new business behaviours, it becomes difficult to classify data based on a priori experience. Therefore, Erenko et al. [19] proposed the application of cluster analysis to business marketing problems. Kingsland et al. [20] proposed the use of cluster analysis for data mining of consumer behaviour to give valuable guidance for future business decisions. Elizabeth et al. [21] proposed a cluster analysis-based model for e-commerce platform sales risk prediction model. From the perspective of microfinance companies, the types of e-commerce financial risks and the causes of their formation are analysed.

Overall, clustering analysis can uncover new patterns in large amounts of data, which can be quite a powerful tool for data processing problems in the marketing field. By applying clustering algorithms, new patterns can be obtained that are not influenced by previous experience, thus allowing a more comprehensive exploitation of the information contained in the data. One of the most widespread applications of clustering analysis is the classical K-means algorithm [22]. However, the K-means algorithm is only applicable to datasets with ordered categorical attributes. To solve the clustering problem for unordered categorical attributes, Huang et al. [23] proposed the K-modes algorithm based on the classical K-means algorithm. Under the framework of the original K-modes algorithm, Saha et al. [24] proposed the genetic fuzzy K-modes algorithm, which improved the accuracy of data mining to a certain extent.

The aim of this study is to use the K-modes algorithm to mine the historical data of small and medium-sized e-commerce businesses to achieve an accurate forecast of future sales, thus providing data support for e-commerce businesses of different sizes when developing innovative marketing models. In the sales forecast of goods, the demand for products in the e-commerce industry is unstable as people’s hobbies, consumption habits, and other factors are changing all the time. This phenomenon leads to no clear pattern in sales volume trends. In addition, sales are influenced by various external social factors such as climate and fashion trends. These changes can often affect the predicted outcome of a product, so it is essential to extract the key attributes of the data before clustering analysis.

The main innovations and contributions of this paper include the following:(1)The K-modes algorithm in cluster analysis is introduced into the field of e-commerce marketing forecasting research, thus providing data support for e-commerce enterprises of different sizes in developing innovative marketing models.(2)To further improve the clustering accuracy, the knowledge graph technique [25, 26] was used to extract the key attributes of the historical sales data before the clustering analysis of the historical sales data using the K-modes algorithm. After the knowledge graph analysis, K-modes algorithm was able to obtain a higher clustering accuracy.

The rest of the paper is organized as follows. In Section 2, the knowledge graph is studied in detail, while Section 3 provides the proposed knowledge graph K-modes clustering algorithm. In Section 4, the KGK-modes based e-commerce sales volume prediction model is studied in detail, while Section 5 provides experimental results and analysis. Finally, the paper is concluded in Section 6.

2. Knowledge Graph

For the marketing model of the e-commerce industry, accurate clustering of relevant sales data samples is more difficult due to the variability of the merchandise. This is because people’s preferences, consumption habits, and other factors change all the time when forecasting sales of products. This phenomenon leads to no clear pattern in sales volume trends. Sales are also influenced by external social factors such as weather, trends, and so on. These changes ultimately lead to highly unstable forecasting results for commodities. In addition, if the key attributes of the sample are not extracted correctly, this may lead to incorrect clustering results. The above analysis shows that if the core attributes contained in the data sample can be identified, the scope of the clustering can be further reduced, thus effectively improving the accuracy of the clustering. Therefore, before using the K-modes algorithm to cluster the historical sales data, this paper uses the knowledge graph technique to extract the key attributes of the historical sales data.

The knowledge graph uses quadruples to represent knowledge [27], mainly containing concept, entity, relation, and attribute. The structure of the knowledge graph is shown in Figure 3.

Suppose the set of all knowledge elements in knowledge domain d is :where represents the i-th knowledge element. Each knowledge element contains the concept knowledge , entity knowledge , relation knowledge , and attribute knowledge .

The set of concepts, entities, and relationships within knowledge domain d is denoted as , , and , respectively.where , , and represent the total number of concepts, entities, and relations, respectively.

The set of attributes corresponding to each concept is :where indicates the total number of attributes.

First, the complex commodity data are classified into knowledge sets. Next, knowledge unit parsing is performed. Finally, the knowledge elements and graphs contained in the knowledge units are extracted [28]. The knowledge graph is obtained by layer-by-layer analysis, where the scale structure of knowledge is shown in Figure 4.

3. The Proposed Knowledge Graph K-Modes Clustering Algorithm

3.1. Principle of the K-Modes Algorithm

The K-modes algorithm is a divisional clustering algorithm used to solve the problem of clustering categorical attributes [2931]. The basic idea of the K-modes algorithm is the same as the classical K-means algorithm, but it introduces a different distance metric and a centroid selection method. Suppose the set of sample points for a classification attribute is . Each sample point contains attributes. The samples are divided into clusters ; then, the minimisation objective function of K-modes clustering is :where is the binary subordination matrix of and is a matrix containing centroid [32].

is the distance from a sample point to a centroid. The distance is calculated in the K-modes algorithm.

The objective function problem is transformed into two subproblems. Let and be the current optimal solutions. After each solution, we need to update ( and ) and save the result of this update to the database in order to continue solving for the minimum of .

When , set to the attribute value of the j-th component of the cluster .where is the attribute value of the cluster . Solve for the distance values in turn until the minimum value of is found, thus obtaining the centroid and the class of each centroid.

The original K-modes algorithm takes a simple matching similarity measure based on the Hamming distance [33]. When the values of an attribute of two sample points are equal, their similarity on that attribute is 1; otherwise, the similarity is 0. This distance metric method may randomly assign less similar objects when assigning sample points, resulting in weaker intra-cluster similarity [34]. Thus, the original K-modes algorithm has limitations when mining sales-related data of small and medium-sized e-commerce companies. This is because there is no obvious pattern in the trend of sales volume changes. In addition, the sales volume of a product is influenced by various external social factors, such as climate, fashion trends, and so on. This problem causes the original K-modes algorithm to rely heavily on a priori experience. Therefore, in order to further improve the clustering accuracy, this paper uses the knowledge graph technique to extract the key attributes of historical sales data before using the K-modes algorithm for clustering analysis.

3.2. The Proposed Knowledge Graph K-Modes (KGK-Modes) Algorithm

Firstly, the proposed KGK-modes algorithm will analyse the data related to e-commerce sales and generate a new sample set containing the knowledge graph quaternions. Then, K-modes clustering is performed on the new sample set. Firstly, k cluster centres are determined, and k clusters are randomly selected from the sample as cluster centres, thus forming the initial cluster set . A suitable is found so that is minimised. During the updating process, stop updating if is satisfied; otherwise, continue updating . The steps of the KGK-modes algorithm implementation are shown in Figure 5.

4. Big Data-Driven KGK-Modes-Based E-Commerce Sales Volume Prediction Model

4.1. Source of Data and Preprocessing

The data source for this study is the daily order volume, sales, number of customers, and number of reviews for 2,700 e-commerce companies on the Taobao website (https://www.taobao.com/). The time frame is from August 3, 2019, to April 30, 2020. The dataset was filtered using desensitisation rules as the data needed to be collected in a way that protected the privacy of the customer. The data deviate from the real e-commerce business data. However, this deviation did not affect the exploration and research of this solution. First, the three tables required (order, review, and product information) were merged into one table based on the e-commerce number in the SQL database. Then, the data were cleaned. The missing values in the fields were replaced by the median and mean values of the column. E-commerce numbers with only one or two items are removed directly. After a hierarchical process, the final selection of 1703 e-commerce companies with at least seven or more types of products was made.

Correction of missing data, duplicate data, and incorrect data contained in datasets related to e-commerce sales volume was done according to statistical methods. Records with the same attribute value in the database are called duplicate records. Two records or two variables that are identical are combined into one record. Data merged from multiple data tables may have semantic conflicts. Inconsistent data can be transformed into consistent data by analysing the links between the data.

4.2. Sales Forecasting Using KGK-Modes Clustering

In this paper, the proposed KGK-modes algorithm is used to cluster the data of 1703 e-commerce companies, so as to complete the sales forecast. Eight variables are first set in the data of these e-commerce companies: sale_amt (sales), offer_amt (offer amount), offer_cnt (number of offers), rtn_cnt (number of returned orders), rtn_amt (amount of returned orders), ord_cnt (number of orders), bad_num (number of negative comment), and good_num (number of positive comment). Based on these 8 variables, the total data of 1703 e-commerce companies were calculated. The purpose of clustering with all variables is that the same type of e-commerce can be combined together, so as to forecast the sales of each type of e-commerce. The total daily data of the e-commerce companies are shown in Table 1.

Before clustering, we first performed a data check, i.e., descriptive statistics of the data [35], such as mean and standard deviation. It is found that the mean and variance are very small, so there is no need to standardize the data.

After analysing 1703 e-commerce companies by KGK-modes clustering algorithm, these e-commerce companies can be roughly divided into three categories. The first category is the e-commerce company with the largest variety of goods and a favorable rate of over 99%. In the first category, each company has more than 30 kinds of commodities. The number of the first type of e-commerce companies is 39, accounting for 2.3% of the total number. The second category includes e-commerce companies with 10∼30 commodity types and 93%∼98% favorable rate. The number of the second type e-commerce companies is 1,129, accounting for 66.3% of the total number. The third category includes e-commerce companies with less than 10 kinds of goods and a favorable rate of less than 93%. The number of the third type e-commerce companies is 535, accounting for 31.4% of the total number.

5. Experimental Results and Analysis

In order to verify the performance of KGK-modes in e-commerce sales forecasting, standard dataset tests and real case tests were conducted. The experimental hardware environment was a desktop computer with 64-bit Windows 10 operating system, Intel 7 CPU, 8G RAM, and GTX3060 graphics card, and the software used for the experiments was MATLAB R2018b. Firstly, the effect of the knowledge graph on the clustering of KGK-modes was verified on the commonly used machine learning dataset. Secondly, the clustering performance was compared between the commonly used clustering algorithms and the proposed KGK-modes algorithm, respectively. Finally, the effectiveness of the proposed KGK-modes algorithm was analysed on a dataset of historical sales of 1703 e-commerce companies. The main clustering evaluation metrics [36] were purity (), standard mutual information (NMI), and F-value (F). Commonly used machine learning data come from the published UCI dataset and Sogo laboratory news dataset, which are shown in Tables 2 and 3, respectively.

5.1. The Influence of the Knowledge Graph on K-Modes
5.1.1. Clustering Performance on the UCI Dataset

To verify the effect of the knowledge graph on K-modes, the UCI dataset was tested using the K-modes and KGK-modes algorithms, respectively, and the results are shown in Table 4.

It can be seen that the KGK-modes algorithm shows better performance for all four different datasets. The cross-sectional comparison shows that the K-modes algorithm has the highest clustering purity of 0.8061 on the Seeds dataset and the lowest clustering purity of 0.7662 on the Flowers dataset. This indicates that both algorithms obtained optimal performance on the Seeds dataset and the worst clustering on the Flowers dataset. Comparing the NMI and F performance, the K-modes algorithm showed better clustering performance after the knowledge graph analysis. This is because after the knowledge graph analysis, the data samples are accurately delineated in terms of concepts, entities, and relationships. The delineation of the data samples helps to determine the sample categories to a certain extent, thus reducing the difficulty of subsequent K-modes clustering.

KGK-modes showed good performance in terms of , NMI, and F-values in clustering the four-class dataset of UCI. To analyse the stability of clustering purity, the RMSE performance of clustering purity [37] was tested. A random sample of 1000 from the UCI dataset was tested for clustering, and the results are shown in Figure 6.

The RMSE of both the K-modes and KGK-modes algorithms gradually decreased as the clustering time increased. In comparison, it was found that the RMSE of the clustering purity obtained by KGK-modes decreased more rapidly. Eventually, the RMSE of KGK-modes converges to 0.5, while that of K-modes converges to 0.75.

In addition, the clustering times of the K-modes algorithm and the KGK-modes algorithm were further compared on the UCI dataset, and the statistical results are shown in Table 5.

It can be seen that the clustering time of the K-modes algorithm and KGK-modes algorithm on the UCI dataset is directly related to the sample size. The Iris dataset with the largest sample size required the longest clustering time, while the Wine dataset with the smallest sample size required the shortest clustering time. The comparison revealed that the knowledge graph analysis consumed some time, and therefore the clustering time for KGK-modes was longer than that for K-modes, but the difference between the two was smaller.

5.1.2. Clustering Performance on News Datasets

To further validate the effect of the knowledge graph on the K-modes algorithm, the performance of the news dataset was tested using the K-modes and KGK-modes algorithms, respectively, and the results are shown in Table 6.

It can be seen that as with the UCI dataset, after the knowledge graph analysis, the K-modes algorithm showed better NMI and F performance. The analysis of the stability of the clustering purity on the news dataset is given in the following, and the results are shown in Figure 7.

It can be seen that the RMSE of the K-modes and KGK-modes algorithms decreases significantly as the number of clustering iterations increases. The RMSE of the clustering purity obtained by KGK-modes decreases faster than that of the K-modes algorithm and eventually converges to about 0.4, while that of K-modes converges to about 0.5.

Next, a comparison of clustering time performance was performed. The clustering times of the K-modes and KGK-modes algorithms on the four news datasets are shown in Table 7.

For the news set with the same sample size, the KGK-modes clustering time was slightly longer than the K-modes clustering time. A comprehensive analysis of the above results shows that the KGK-modes algorithm shows higher performance on the news dataset than the UCI dataset, mainly because the news dataset has significantly fewer feature dimensions than the UCI dataset, and therefore the data clustering effect is more significant.

5.2. Sales Forecast

To verify the effectiveness of the KGK-modes algorithm in e-commerce sales prediction, the sales-related datasets (containing 8 variables) of 1703 e-commerce businesses were tested using K-means, K-medoids, CNN [38], and KGK-modes algorithms, respectively. 90% of the dataset was used as the training set, and the remaining 10% was used as the test set. Eight variables are used as inputs, and the above four prediction models are, respectively, used for fitting. The loss function was mean squared deviation, the learning rate was 0.01, the maximum number of iterations was 1000, and the resampling rate was 50%.

Taking the first type of e-commerce as an example, the KGK-modes prediction model reached optimality when the optimal number of iterations was 367, as shown in Figure 8, by continuously performing iterative calculations. This indicates that the error has been minimised when the number of iterations reaches 367. The predicted values of the model obtained by the KGK-modes algorithm are more accurate at this point. The results in Figure 9 show that in the first category of e-commerce, the magnitude of the impact on sales is ord_cnt, offer_cnt, bad_num, offer_amt, rtn_cnt, and rtn_amt in that order.

The fitted model was applied to the test set, and the error was compared between the true and predicted values of the test set, and the results are shown in Figure 10.

It can be seen that the KGK-modes algorithm has the highest sales prediction accuracy of about 0.94. The CNN algorithm has the second highest sales prediction accuracy after the KGK-modes algorithm at about 0.91. The K-means algorithm has the worst sales prediction accuracy at about 0.81. In terms of running time, the K-means and K-medoids algorithms are the most efficient, reaching stability at around 60 s. On the other hand, the KGK-modes and CNN algorithms took 65 s and 70 s, respectively, to reach stability, and on balance, the KGK-modes algorithm had the highest sales prediction accuracy and the K-means algorithm had the best runtime.

5.3. Adjustment Strategies for Innovative Marketing Models

For the first e-commerce, the main influencing factors for sales are ord_cnt, offer_cnt and bad_num. The higher the order volume, the higher the sales amount. For the second category of e-commerce, the main influencing factors of sales are ord_cnt, rtn_amt, bad_num and offer_amt. For the third category of e-commerce, the main influencing factors of sales are ord_cnt, bad_num and good_num. Therefore, for the third category of e-commerce, which has the smallest range of products, they should focus more on the marketing model innovation in positive comment. The number of bad reviews has too great an impact on third category e-commerce, and some reasonable marketing strategies should be prepared in advance to reduce bad reviews. This is because the first thing that consumers browse for when they enter a third category e-commerce is the reviews of the goods. It is recommended to offer some economical and affordable goods from the customer’s point of view and to improve the logistics management.

After a comprehensive analysis of the characteristics of these three types of e-commerce, it is found that ord_cnt is the most important influencing factor on e-commerce sales. Therefore, we can directly use ord_cnt to simply forecast the change of sales. The role of marketing activities such as appropriate price reductions is more obvious for the second category of e-commerce. For the third category of e-commerce, the impact of bad reviews is much greater than that of good reviews. This is because in the process of online shopping, in order to buy what they want, people will refer to other people’s reviews of this product. If this shop has more bad reviews, it will affect the consumer’s purchase intention. For e-commerce companies of different scales, the most important sales influencing factors are different. In the future, according to the priority of the factors affecting sales, different sizes of e-commerce companies need to adjust the innovative marketing model to ensure maximum profit.

6. Conclusion

In this paper, a big data-driven KGK-modes clustering algorithm is proposed and applied to e-commerce sales volume prediction, so as to reasonably adjust the required innovative marketing mode. knowledge graph was used to preprocess the data samples to generate a sample set containing concepts, entities, and relationships so that the key attributes of the historical sales data could be extracted. After the knowledge graph analysis, K-modes algorithm was able to achieve better clustering performance. The KGK-modes clustering model was used to achieve e-commerce sales prediction. The experimental results show that the KGK-modes clustering model has high sales prediction accuracy. The proposed method has certain reference value for e-commerce enterprises of different scales to formulate innovative marketing models. Subsequent research will be carried out in terms of the efficiency of clustering, and invoking the Spark parallel platform should be considered to reduce the running time of KGK-modes clustering. On the other hand, an attempt is made to optimize the objective function of KGK-modes clustering in order to improve the efficiency of the clustering algorithm.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the Humanities and Social Sciences Projects of Universities in Jiangxi Province (Research on the Compounding Mechanism of Changes in College Students’ Political Trust (issue no. SZZX21019); Practical Reflection and Strategy Optimization Research on Rural Cultural Construction from the Perspective of the Peasant (issue no. SH19104)).