Abstract
Excessive or insufficient business hall resources may result in unreasonable resource allocation, adversely affecting the value of an entity business hall. Therefore, proper characteristic parameters are the key factors for analyzing the business hall, which strongly affect the final analysis results. In this study, a characteristic analysis method for the economic operation of a business hall is developed and the feature engineering is established. Because of its simplicity and versatility, the -means algorithm has been widely used since it was first proposed around 50 years ago. However, the classical -means algorithm has poor stability and accuracy. In particular, it is difficult to achieve a suitable balance between of the centroid initialization and the clustering number . We propose a new initialization (LSH--means) algorithm for -means clustering. This algorithms is mainly based on locality-sensitive hashing (LSH) as an index for computing the initial cluster centroids, and it reduces the range of the clustering number. Furthermore, an empirical study is conducted. According to the load intensity and time change of the business hall, an index system reflecting the optimization analysis of the business hall is established, and the LSH--means algorithm is used to analyze the economic operation of the business hall. The results of the empirical study show that the LSH--means that the clustering method outperforms the direct prediction method, provides expected analysis results as well as decision optimization recommendations for the business hall, and serves as a basis for the optimal layout of the business hall.
1. Introduction
An entity business hall is where a company directly conducts specific business activities, such as commodity trade, business handling, and service. However, owing to rapid urbanization and economic development, unreasonable resource allocation is becoming increasingly prevalent. For example, the number of entity business halls is excessive in some places and insufficient in others. Hence, the deployment of new commercial outlets (halls) or resource allocation optimization for existing retail outlets often needs to be performed manually. Therefore, how to evaluate the efficiency of business halls has emerged as a major concern for many enterprises.
To this end, many researchers have attempted to overcome the disadvantages of human judgment, which is highly subjective. Brandeau and Chiu [1] considered the transportation cost and the distance between the warehouse and the customer and used a gradient-like algorithm to study the location issue. Wang et al. [2] used nearest-neighbor clustering and the function of Ripley [3] to analyze the layout of commercial outlets and suggested that business type, land price, and traffic accessibility are the critical factors. Gerard [4] analyzed the service needs and waiting demand of customers for bank halls and attempted to shorten the perceived waiting time of customers on the basis of the customers’ business types. Thus, customer satisfaction was improved. Anderson et al. [5] used the queuing model to optimize the queuing service system of banks. They determined the optimal number of service windows by acquiring and presenting a large amount of data. Lin et al. [6] studied the relationship between retail stores and street centrality and pointed out that besides the transport network, which has a strong impact on the retailer’s location, the street centrality influences the type of retail store. Kang [7] analyzed the changes in warehouses from central urban areas to the urban periphery over time and studied the main factors affecting the warehouse location. Hui [8] used data mining to establish the channel analysis model for an electricity business hall and optimized the resource allocation. Based on the statistics of customer queuing time, business processing time, customer satisfaction, and so on, Yan et al. [9] established an intelligent access platform for the business data and improved the service efficiency. However, there is no unified standard for the business hall index system.
Clustering is a key technique in data mining, and its applications include pattern recognition [10, 11], image processing [12], and recommendation [13]. Clustering aims to partition data into different categories based on a measure of similarity. The -means algorithm is widely used owing to its simplicity and effectiveness. However, the different settings of the parameters and random selection of the initial clustering centers make the classical -means algorithm unstable.
The classical clustering algorithm involves two problems: the first problem is to classify a given dataset on the basis of the prespecified cluster number ; hence, the problem of determining the “correct cluster number” has attracted considerable interest. Although several methods have been developed for estimating the number of data clusters [14–17], it is difficult to use them in practical applications. Therefore, determining the correct number of clusters has long been an important research topic in cluster analysis. The second problem is to determine the initial clustering center, which has a significant impact on the clustering effect. Studies conducted thus far have explored several initialization methods for the -means algorithm. For example, the -means++ algorithm [18] has been proposed to avoid this issue. This algorithm randomly selects the first centroid, and the other centroids are selected as far away as possible from the first centroid. However, random selection is still widely used in practice [19]. Erisoglu et al. [20] proposed an incremental approach for computing the initial clustering centers. In this approach, the reduced dataset is partitioned until the number of clusters equals the predefined number of clusters. However, the number of clusters must be known in advance. The compressed -means (CKM) algorithm [21] is initialized by locality-sensitive hashing (LSH) [22], and the distance is calculated using the Hamming distance between binary codes. The LSH link [23] can rapidly find a nearby cluster to be connected through the LSH algorithm. David et al. [24] proposed a new LSH scheme adapted to the distance for approximate nearest neighbors (ANN) search in high-dimensional spaces.
In summary, there is no unified standard for the index system of business halls at present. Therefore, we establish an index system for analyzing the efficiency of a business hall. To address the problem of -means initialization sensitivity as well as the difficulty in determining the number of clusters, we initialize the -means centroid on the basis of LSH. Accordingly, we implement the relevant algorithms and present the optimal allocation scheme for the business hall.
The main contributions of this study are as follows:(1)According to the average waiting time, ticketing time, and business type of a business hall, we analyze the average load rate of the business hall and use the relevant characteristic variables to describe the load of the business hall. Finally, we propose a general business hall index system.(2)By combining the characteristics of -means and LSH, We propose a new initialization (LSH--means) algorithm for -means clustering. The model can get the load classification of each business hall by inputting the relevant index variables for the optimization of business hall distribution.(3)The results of our empirical analysis verify the validity of the proposed LSH--means approach. Thus, LSH--means can be efficiently used for the operational analysis of a business hall.
The remainder of this paper is organized as follows: Section 2 introduces the required preliminaries, definitions, and models. Section 3 describes the proposed initialization methodology. Section 4 presents, compares, and discusses the experimental results. Finally, Section 5 concludes the paper.
2. Preliminaries
2.1. -means Algorithm
The notations used in this paper are defined in Table 1. The -means [25] method is the most well-known clustering method because of its simplicity. It has been identified as one of the top 10 algorithms in data mining [26]. Given a dataset , -means aims to partition it into different clusters , where is a predefined number. The objective of the -means clustering algorithm is to minimize the sum of squared errors (SSE) [27] over all clusters. The SSE is defined as follows:where denotes the -th cluster centroid, which is computed as the mean of points in , and is the data object in the -th cluster.where denotes the number of data points in the -th cluster.
To solve equation (1), an expectation–maximization (EM)-like optimization method is adopted by updating or and simultaneously fixing the other [28]. In general, the clustering procedure involves three steps: (1) initialize cluster centroids; (2) assign each sample to its closest centroid; and (3) recompute the cluster centroids with the assignments produced in Step 2 and go back to Step 2 until convergence. This is known as the Lloyd iteration procedure [29]. Such an iterative optimization approach has several drawbacks. First, it is sensitive to the initialization, which may lead to an inferior result for a given poorly initialized . Many methods have been proposed to obtain a stable solution, including the -means++ algorithm [18]. Second, finding the optimal solution to -means is an NP-hard problem. Some variants of -means have been proposed, such as various parametric -means, including fuzzy -means [30, 31]. Third, -means cannot handle new data, which requires the entire dataset to be observed. The complexity is , where , , , and denote the number of iterations, size of the dataset, number of clusters, and dimensionality, respectively. This complexity is considerably higher than that of other well-known clustering algorithms such as DBSCAN [32] and mean shift [33].
2.2. LSH
LSH is a well-known solution for the approximate nearest neighbor problem in high-dimensional spaces. LSH was first introduced for the Hamming metric by Indyk and Motwani [34]. Data points are assigned to individual hash buckets in each hash function. The idea of LSH is that closer data points are mapped to the same hash bucket with high probability. LSH has been shown to be effective even for high-dimensional data, both theoretically and experimentally [35]. are a family of hash functions. Each hash function must satisfy the LSH property: , where is the similarity between and . These hash functions must meet the following two conditions:(1)If , then (2)If , then
where represents the distance measure between and , , and . The definition implies that and are hashed into the same bucket in the projection with a very high probability . Regardless of whether they are close to each other, they will be hashed into the same bucket with a low probability. A -sensitive family of hash functions is useful when the collision probabilities , satisfy . Figure 1 shows an example of hashing key space.

3. Proposed LSH-Based Initialization Algorithm
The proposed framework involves three steps: (1) an index system for the efficiency analysis of a business hall is established in Section 3.1. (2) To overcome the problems of poor stability and low accuracy of the classical algorithm, a boost -means algorithm based on LSH initialization is proposed in Section 3.2. (3) The -means algorithm is implemented to obtain the clustering results. The details of these three steps are illustrated in Figure 2.

3.1. Establishment of the Index System
Through the load analysis of the business hall, we can determine the high and low loads and optimize the business hall. The average utilization rate of each business hall is analyzed according to multiple indicators (including average waiting time, ticketing time for business, and business type). Thus, we can use the relevant characteristic variables to describe the load of the business hall. By applying the clustering algorithms, we can obtain the load categories of different business halls, which provides a basis for planning the locations of the business halls. First, the following two essential features are extracted: the maximum load of the business hall (M) and the ratio of the actual daily load to the maximum load .(1)The maximum load of the business hall is given by where is the proportion of a specific business, is the average time for the clerk to handle the business, is the number of business types, is the working time of the clerk, and is the number of clerks in the business hall. The variables are taken from the peak period. This value represents the maximum business volume that a business hall can withstand during the peak period. The peak period can be obtained by measuring the historical data of each business hall.(2)The ratio of the actual daily load to the maximum load is given by where represents the actual daily load of the business hall and is the maximum load in one day.
By combining the essential characteristics of the business hall and based on the analysis of historical data, we can obtain the calculation indicators of the business hall to prepare for the subsequent model input. Therefore, the feature engineering for the business hall efficiency analysis is established, and the critical indexes extracted are as follows:(1)The ratio of the average load to the maximum load is given by where is the number of days, and and are the same as above. This index denotes the ratio of the average actual load to the maximum load over some time.(2)The actual load trend is given by where denotes the actual load curve fitting, is a constant, is the regression coefficient, is a time-independent variable, and is the number of statistical data. This index indicates whether the load trend of the business hall will be rising, flat, or declining for some time. A fitting curve can be used to characterize the load trend of the business hall over some time, and the slope represents the trend state. Our method includes a commercial center, residential center, new urban area, and other factors.(3)The proportion of high-value business is given by where is the high-value business volume and is the total business volume. Thus, this index denotes the proportion of high-value business to total business in the peak period.(4)The high-frequency load is given by where is a high threshold and represents the load of exceeding within a period. Furthermore, can be obtained by statistical analysis of the historical data of the business hall.(5)The low-frequency load is given by where is a low threshold and denotes the frequency that is less than for some time. Furthermore, can be obtained by statistical analysis the of historical data of the business hall.(6)The latest high-load interval is given by where represents the current time, denotes a high threshold, and refers to the time when the latest is greater than . Furthermore, denotes the interval from to .(7)The latest low-load interval is given by where represents the current time, denotes a low threshold, and refers to the time when the latest is greater than . Furthermore, denotes the interval from to .
3.2. LSH--Means
The main purpose of clustering is to divide data into clusters in which objects in the same cluster are close to one another, whereas objects in different clusters are far from one another. Two factors affect the quality of -means clustering. Before applying the algorithm, we need to specify the number of clusters and select the initial cluster centroid. Selecting an appropriate initial cluster centroid can improve the quality of clustering. To this end, a critical study was conducted by Vassilvitskii et al. [18, 36]. If the initial cluster centroid is selected carefully, the -means algorithm converges to a better local optimal solution. Furthermore, careful selection of the initial cluster centroid makes the -means iteration converge faster [18]. However, to make the initial centroid adapt to the data distribution, it is necessary to scan rounds. Therefore, although the number of scanning wheels in [36] has been reduced to a small value, the additional computing cost is still inevitable. Our algorithm exploits LSH. The algorithm minimizes the path by adding the nearest neighbor, and LSH can effectively search for the nearest group features in the path. The average time complexity of the hash-based search is . LSH scans the data records and finds the nearest points; the average values are computed after the nearest points are classified as a category. Algorithm 1 describes the process of obtaining the initialization centroids in our proposed LSH--means scheme. The main steps are as follows:(1)Suppose that we have a set of points via the index system in Section 3.1. We use LSH to index the feature vectors extracted from the dataset to reduce the search time for the nearest neighbor of each query. This is based on the hash mapping function, hash functions, and hash table [37]. Constructing an effective LSH index structure for approximate nearest neighbor search depends on the number of hash tables and the number of bits of the hash codes.(2)To facilitate the statistics of nonclustered data points, in Algorithm 1, we copy a dataset from . Randomly select one data point from as the centroid. Then, is merged into the set and removed from the dataset , where is the -th cluster. After obtaining points, query the corresponding bucket number according to the hash table in Step 1 and take out the data in bucket number . Calculate the similarity or distance between and the data points in the bucket and return the nearest neighbor data .(3)Take data point from , whose distance to does not exceed . Put merged into , that is, , and remove it from the dataset .(4)Repeat Step 3 until the other data point in reaches a certain threshold; the threshold can be computed as follows:(5)Repeat Steps 2-3 until the length of the dataset is less than the threshold . As shown in Algorithm 1, .(6)The arithmetic mean values for the final k sets of samples are computed; then, we can obtain the clustering centers for all the categories in this way: Therefore, based on the aforementioned steps, we will have two algorithms to choose from: “best” movement and “fast” movement [38]. For the “best” movement, we can use equation (13) and the value in Algorithm 1 as the initial clustering center of the classical -means input , and run the algorithm; the result is the final result. For “fast” movement, the divided categories can be regarded as approximate clustering results and directly used as the classification results. Because the initial clustering center is determined and the initial category is obtained, the result of the algorithm is more stable and accurate, and it requires a relatively short running time.
|
4. Experimental Results
First, we use the UCI https://archive-beta.ics.uci.edu/ml/datasets datasets [39] to verify the performance of the proposed algorithm, and we state the verification criteria. In addition, we use the Mall-Customers dataset https://www.kaggle.com/shwetabh123/mall-customers for the value range of the number of clusters of the proposed LSH--means model. Our experimental results demonstrate the effectiveness and superiority of the proposed LSH--means. Then, we compare it with the actual business hall dataset and present an example to optimize the business hall operation.
4.1. Experimental Design
To verify the aforementioned points and evaluate the effectiveness of the proposed LSH--means model, numerous experiments were conducted on the UCI datasets, which consist of Balance, Wine, Breast, Diabet, Iris, Hayes-roth, Tic-tac-toe, and Bupa. We followed the experiments conducted in a previous study [40]. We briefly review the existing baselines as follows:(1)-means [25] is derived from the classical -means.(2)Enhanced -means [38] enhances the classical -means algorithm. The initial cluster centers are determined in advance instead of random selection.(3)The AC algorithm [41] for clustering can assume each sample as a pattern; by computing the similarity between patterns, the more similar patterns are grouped into one class, and the less similar patterns are classified into different classes. The difference between two patterns in AC clustering is usually measured by the distance function, including the Euclidean distance or Hamming distance. In the experiment, the AC algorithm is implemented by the KnowledgeMiner Software [41].
There are 200 samples in the Mall-Customers dataset. It includes gender, customer ID, age, annual income, and expenditure scores. In addition, it collects insights from the data and groups them according to their behaviors. The elbow method [42] is a well-known method for determining the optimal value of . As shown in Figure 3(a), the optimum number of clusters of the Mall-Customers dataset is 5. According to Algorithm 1, we set the minimum number and the maximum distance . Owing to the small amount of data, we set the number of buckets to 1. After 10 LSH-based initializations, we get the value of between . Figure 3(b) shows the results of LSH -means clustering. The black dots represent the centroids.

(a)

(b)
There are 525 samples in the Balance dataset. For the classical -means algorithm, the number of clustering categories that match the real categories is 271, and the matching rate is 51.62%. The corresponding values of the LSH -means algorithm are 288 and 54.87%, respectively. Similarly, the results of the other UCI datasets are listed in Table 2. To determine whether there are significant differences between algorithms, we use the Wilcoxon signed-rank test [43]. It is a nonparametric statistical test. The Wilcoxon test has been widely used in many fields, especially in algorithm comparison and analysis [40]. It is expressed as follows:where is the difference in clustering performance between the two algorithms on the -th dataset, and the absolute values of their difference are arranged in the ascending order. If the rank is the same, we take the average value. implies that the sum of ranks for the algorithm is better than the other, and implies the opposite.
The calculations for the eight aforementioned datasets are presented below.
Let ; we get . According to the critical value table of the Wilcoxon test, we can judge that the difference between algorithms is significant under the condition . Furthermore, as shown in Table 2, there are five datasets for the LSH-based -means, which is hence better than the enhanced -means; thus, in terms of quantity, the LSH-based -means algorithm outperforms the enhanced -means algorithm. Therefore, we can judge that the efficiency of LSH-based -means is significant.
In addition to comparison with actual categories, we further distinguish the clustering effects of -means clustering and the AC algorithm. A tight and separative indicator is used to evaluate the clustering results [44], which is defined as follows:where , , and denote the cluster centers, is any sample in the dataset, is the number of clusters, and is the sample set. The Xie–Beni (XB) index [45, 46] is based on intracluster and intercluster distances; it is formulated in terms of the cluster compactness and separation between the clusters. We use the XB index for the evaluation of the cluster effects, and it is defined as follows:where is the ratio of the average distance between data objects and their corresponding clustering centers to the minimum distance of the cluster centers. The smaller the value of , the higher is the clustering quality. The results are summarized in Table 3.
From the XB value calculated in Table 3, we can conclude that the difference between the algorithms is significant. The XB value of the AC algorithm is the largest, while that of the LSH-based -means algorithm is the smallest, which implies that the LSH -means algorithm outperforms the other algorithms in the experiment. Thus, the experimental results verify the effectiveness and superiority of the proposed method. Therefore, it can finally be applied to the empirical analysis. In the next section, we describe the application of LSH -means to business hall analysis.
4.2. Business Hall Analysis
In reality, business hall resource allocation may be unreasonable. For example, some business halls may be busy, while others may be idle. This may be caused by overlapping user coverage in different business halls, unreasonable location of the business halls, and a large proportion of low-value businesses. In this section, we experimentally verify the effectiveness of our index system and analyze the results of the proposed LSH--means model.
4.2.1. Business Hall Clustering
When the index system is established as described in Section 3.1, we get the characteristic information of the business hall. After data preprocessing, the number of clusters is determined subjectively. Consider the load intensity and time change information for the business hall. The load intensity can be categorized into , , and , and the load trend can be categorized into , , . Thus, a nine-square grid (Figure 4(b)) map can be obtained. At the same time, by referring to the knowledge of field experts, the number of clusters can be defined as 9 for the subjective clustering methods. After the clusters are determined, the extracted feature indicators can be taken as the input, and the clustering model is implemented. For the LSH -means algorithm, the distance parameter was selected as the Euclidean distance, the maximum number of iterations was set to 500, the number of seeds was set to 10, and the number of the clusters was set to 9. Then, the outcomes were obtained, as shown in Table 4 and Figure 4(a).

(a)

(b)
Meanwhile, in the case of different predetermined cluster numbers for the subjective clustering methods, the AC algorithm determined the clustering number automatically, which was computed on the basis of the similarity between the samples. Here, the similarity was set at 95%, and the algorithm was implemented at the same time. Thus, the result was exactly consistent with that of the LSH -means algorithm. The details are presented in Table 5. For example, the first and sixth samples, the second sample, and the third sample were clustered into the same category.
The final classification results obtained from the model can provide the load grades and decision-making suggestions, which can serve as a basis for site planning optimization of the business halls. In addition, the results of the two algorithms were consistent, which indicate that the LSH -means algorithm is effective and a stable result was obtained. Accordingly, further optimization action can be implemented.
4.2.2. Optimization Analysis
As shown in Figure 5(a), the 16th and 17th business halls are both in Class . This category indicates that the current load is , and the load trend remains unchanged, implying that the business hall resources are redundant in this area. The business halls in this class are idle, and the site may be unreasonable. In addition, the merger of business halls, relocation, and reduction of resource input in this area should be considered.

(a)

(b)
By contrast, for Class (the red part of Figure 5(b)), it can be seen that the current load and load trend are both , which implies that the business volumes of the business halls are large, and the load trend change is still on the rise. Currently, the first and sixth business halls belong to this category. The future trend is still likely to be growing, and the business volumes will keep increasing. Therefore, this area is where more business hall resources need to be input, and the optimal site planning of business halls should be considered accordingly. We can define the objective function of the optimal site for the input business hall resources as follows:where is the quantitative value of factors that affect the rationality of business hall location, is the number of factors, is the distance function, is the weight, and is the target point to be solved. Thus, according to the objective function and relevant coordinate information of key units in the area, we can compute the optimal planning location of the business hall using the optimization algorithm. Here, the optimal location of the business hall was computed as , and the optimal solution was 527.0368. The details are shown in Figure 5.
The 9th and 12th business halls belong to the Class , which shows that the current load and load trends are both normal, and the status is stable. Therefore, the business halls in this class are not the current focus of optimization. In addition, the other classes are similar to this category, which is also not the current focus. The main objects are Class and Class , that is, excessive or insufficient business hall resources are mainly concentrated in these two classes, which are the focus of our optimization analysis.
5. Conclusion
Excessive or insufficient business hall resources may result in unreasonable resource allocation, which adversely affects the value of an entity business hall. Therefore, proper characteristic parameters are the key factors for analyzing the business hall, which strongly affect the final analysis results. According to the time change and load trend, multiple variables such as average load rate, actual load trend, and high-frequency load are extracted as the characteristic indexes of the business hall. In this study, a characteristic analysis method for the economic operation of a business hall was developed, and the specific calculation process was presented; accordingly, the feature engineering was established. Moreover, based on the load intensity and time change information of business halls, we built an index system and performed further optimization analysis. The key characteristic indicators extracted were the average waiting time, ticket handling time, and business type, and a model for evaluating business hall efficiency was established. The model obtained the load grading of each business hall by the relevant variable input, which provided a basis for optimal site planning of the business halls.
An empirical study showed that the LSH--means clustering method outperforms the direct prediction method, provides expected analysis results and decision optimization suggestions for business halls, and serves as a basis for the optimal layout of business halls. In addition, by considering the load intensity and time change information, the cluster number was determined according to the characteristic analysis results, with a certain theoretical and practical significance. In the future, we will explore and develop a general method to automatically determine the parameters and use it in practical applications.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.