Abstract
These days, with the increasingly widespread employment of sensors, particularly those attached to vehicles, the collection of spatial data is becoming easier and more accurate. As a result, many relevant areas, such as spatial crowdsourcing, are gaining ever more attention. A typical spatial crowdsourcing scenario involves an employer publishing a task and some workers helping to accomplish it. However, most of previous studies have only considered the spatial information of workers and tasks, while ignoring individual variations among workers. In this paper, we consider the Software Development Team Formation (SDTF) problem, which aims to assemble a team of workers whose abilities satisfy the requirements of the task. After showing that the problem is NP-hard, we propose three greedy algorithms and a multiple-phase algorithm to approximately solve the problem. Extensive experiments are conducted on synthetic and real datasets, and the results verify the effectiveness and efficiency of our algorithms.
1. Introduction
These days, with the development of sensors (especially vehicle sensors and mobile sensors) [1–3], it is increasingly simple to acquire spatial and temporal information [4, 5].
Many studies based on vehicle sensors data have been conducted in recent years [6–8]. As a result, many applications now provide services based on users’ real-time spatial information and these are becoming ever popular. Among these applications, some focus on crowdsourcing services that use spatial information. These applications usually require some workers to help an employer to accomplish a task. For example, Uber (https://www.uber.com) organizes drivers and provides users with a convenient taxi service, whereas +Meituan (http://www.meituan.com) provides a credible and fast food-delivery service. This area, called spatial crowdsourcing, is attracting significant attention.
The task assignment problem is one of the fundamental concerns in spatial crowdsourcing. For example, real-time taxi-calling platforms, such as Uber and Didi Chuxing [9], always need to assign each taxi-calling task to a suitable taxi (i.e., a crowd worker). An incorrect assignment may cause taxis to be dispatched to far-away places, which results in a slow response time and the loss of the platform. Many studies on the task assignment problem have been published in recent years [10–12]. However, most of them only consider the spatial information of tasks and workers, while ignoring the individual variations among workers. Namely, different people may excel or struggle with different tasks, and tasks also contain certain requirements for which some workers may be inadequate.
Take Figure 1 as an example. Suppose a website development task requires coders skilled in .NET, SQL, and HTML to assemble at the location of the origin, and there are three coders available (whose skills are presented in Figure 1). Although coder is located closer to the origin than and , hiring will not help finish the task. In other words, it is necessary to further consider individual variations among different workers and special requirements of tasks.

As in [13, 14], each worker is associated with a set of skills representing their strengths. Tasks are also associated with a set of skills representing their special requirements.
R-trees [15] are a classical index structure for multidimensional data. Derived from B-tree, the data in an R-tree are stored in leaf nodes and all leaves are located in the same level of the tree. Every internal node contains between and child entries, and every leaf node contains between and data entries, where is usually related to the size of disk pages, and is predefined such that . The tree is specially structured such that the children of a node overlap with few data from other nodes. Using an R-tree, we can dynamically insert/update/delete nodes, and rapidly search for all nodes located in a given rectangle.
The objective of our problem consists of two parts. First, workers need to move to the location of the task but receive no reward for this movement. In consideration of the workers, we attempt to reduce the gratuitous moving distance. Second, the employer wishes spend the minimum amount necessary to accomplish the task. In consideration of employers, we attempt to obtain a team at the lowest cost, on condition that the skill requirement is satisfied. As the problem definition in Section 2 shows, the objective of our work contains not only the distance between the task and workers, but also the total cost.
Contributions. In summary, our contributions are as follows:(i)We propose a new Software Development Team Formation (SDTF) problem and prove that it is NP-hard.(ii)Three greedy algorithms are provided to solve the SDTF problem.(iii)We employ a multiphase algorithm based on R-trees.(iv)We verify the effectiveness and efficiency of the proposed algorithms through extensive experiments on synthetic and real datasets.
Compared with our previous work [16], we propose a novel multiple-phase algorithm by using the index structure of R-trees. Additional experiments are also conducted on synthetic and real datasets.
The remainder of this paper is organized as follows. In Section 2, the problem is formally defined and proved to be NP-hard. In Section 3 three greedy algorithms are provided to solve the SDTF problem. In Section 4, we propose a multiple-phase algorithm based on R-trees. Extensive experiments on real datasets are described in Section 5. Previous work related to our problem is presented in Section 6, and the conclusions to this study are presented in Section 7.
2. Problem Statement
First, we introduce the two basic concepts of a task and a coder. We then formally define the Software Development Team Formation (SDTF) problem.
Definition 1 (task). A task is defined as , where is a set of skills that are indispensable to complete the software development task , and is the location specified to meet up and talk about task , which, for example, can be described by longitude and latitude.
Similar to the definition of a task, a coder is formally defined as follows.
Definition 2 (coder). A coder is defined as , where is a set of skills mastered by coder , is the location of coder , described similarly to that of a task , and is the price of coder .
Briefly, a team of coders is feasible for a task if the coders in the team can collaboratively accomplish the task.
Definition 3 (feasible team). A team is defined as a set of coders . is a feasible team for task , if .
Example 4. Suppose that we have a task concerning website development, where , and a universal set of coders . The skill set of every is listed in Table 1. Team is a feasible team because , which is a superset of , .
We finally define our problem as follows.
Definition 5 (SDTF problem). Given a task with a set of skills and a location and a universal set of coders , each with skill set , location , and price , , we wish to find a team satisfying , and we minimize , where represents the distance between the location of coder and task and is a parameter to adjust the weight of distance and price.
Theorem 6. The SDTF problem is NP-hard.
Proof. We prove the theorem by showing that a special case of the SDTF problem can be reduced from the weighted set cover problem. An instance of a weighted set cover problem consists of a set  and a set , where  for . Each  is associated with a positive value , which can be viewed as the weight of . The weighted set cover optimization problem aims to find a subset  of  satisfying  minimizing .
We consider a special case of the SDTF problem in which the task and coders are located in the same position, and the skill set of the task is the universal set of all skills. To reduce the weighted set cover problem to the special-case SDTF problem, we observe that each element in  corresponds to a skill in , each element in  corresponds to a skill in , and the weight of  corresponds to the price of . As the task and all coders are at the same location, for every team , , and we need only minimize . Obviously, there exists a solution to the weighted set cover problem if and only if there exists a solution to the special-case SDTF problem, and we can obtain an instance of the special-case SDTF problem from the instance of weighted set cover problem in polynomial time. Therefore, the general case of the SDTF problem is NP-hard.
3. Greedy Solutions for SDTF
In this section, we present three greedy algorithms to solve the SDTF problem. The first two algorithms greedily choose the nearest/cheapest coder who can cover at least one uncovered skill. Because they only consider optimizing part of the objective function, the solution is sometimes not good enough. Thus we propose a third greedy algorithm that considers both price and distance when choosing a new coder.
3.1. Price First-SDTF Greedy Algorithm
The idea of the first greedy algorithm is to repeatedly add the cheapest coder to the team until the team is feasible. The whole procedure of this price first- (PF-) SDTF is illustrated in Algorithm 1. We assume that there exists at least one feasible team.
| 
 | ||||||||||||||||
Considering that skills not in the skill set of the task contribute nothing to the accomplishment of the task, the term “cheapest coder” must be treated carefully. Here, we define the Average Price on Uncovered Intersecting Skills to describe how a coder contributes to the price part of the objective function: where is the uncovered skill set of task . We can see that APUIS describes how a coder influences the price part of the objective function if we add him/her to the final team. Choosing a coder with lower APUIS means we can satisfy the requirement of the skills with a lower total price. Note that when there is no intersection between the skill set of the worker and the uncovered skill set, APUIS will be infinity. Because we greedily choose the worker with the lowest APUIS, we omit this special case in (1).
In line (1) of Algorithm 1, we initialize an empty team . In lines (2)–(5), when is not feasible, we find a coder who can cover at least one uncovered skill of task and has the lowest value, add to team , and update . Ties are broken by distance first, then arbitrarily. In line (6), we return the resulting feasible team .
3.2. Distance First-SDTF Greedy Algorithm
The idea of distance first- (DF-) SDTF is to repeatedly add the nearest coder to the team until the team is feasible. The framework of DF-SDTF is similar to that of PF-SDTF. In each iteration, we find the nearest coder who can cover at least one uncovered skill of task ; that is, where is the uncovered skill set of task . The whole procedure of DF-SDTF is illustrated in Algorithm 2. We assume that there exists at least one feasible team.
| 
 | ||||||||||||||||
In line (1), we initialize an empty team . In lines (2)–(5), when is not feasible, we find the nearest coder who can cover at least one uncovered skill of task , add to team , and update . Ties are broken by price first, then arbitrarily. In line (6), we return the resulting feasible team .
3.3. Distance Price-SDTF Greedy Algorithm
The aforementioned two greedy algorithms are not effective, because they only try to optimize part of the objective function. To optimize both distance and price at every iteration, we design a utility function . Given a task , current team , and coder , the definition of is where represents the increment in the maximum distance if is added to team , that is, if , and if . In fact, the value of in (3) is the increment in the objective function.
Using this utility function, we have a third greedy algorithm, Distance Price- (DP-) SDTF. The whole procedure of DP-SDTF is illustrated in Algorithm 3. We assume that there exists at least one feasible team.
| 
 | ||||||||||||||||
In line (1), we initialize an empty team . In lines (2)–(5), when is not feasible, we find a coder who gives the highest utility. Ties are broken by distance first, then arbitrarily. In line (6), we return the resulting feasible team .
4. Multiple-Phase R-Tree Algorithm
In this section, we introduce an algorithm based on the R-tree data structure. Considering that some previous work has applied R-trees in Nearest Neighbor (NN) searching [17, 18], a naïve idea is to use an R-tree to accelerate the NN search in the DF-SDTF algorithm proposed in Section 3.2. However, this simple use of R-trees can only accelerate the search speed and does not help optimize the final cost. As our experiments will show, the DF-SDTF algorithm performs worse than the DP-SDTF algorithm proposed in Section 3.3. The above situation requires us to find an algorithm that is both efficient and effective in solving the SDTF problem.
Our original algorithm derives from an intuitive observation: if we query all nodes located in the square whose centroid is at the location of the task and whose side length is , the distance between the task and the nodes in the result set will be at most . This characteristic provides an applicable tool for the distance part of our objective function. By choosing a rectangle with suitable sides, we obtain a set of candidate coders who are close to the location of the task. The price part of the objective function can also be optimized if we employ a proper strategy to choose the next coder from the candidate coder set.
Based on the above observation, we propose the Multiple-Phase R-tree (MPR) algorithm. The main idea of our algorithm is as follows.(1)Initialize a new R-tree and insert all coders into the tree.(2)In each phase, obtain a candidate set of coders by querying all nodes located in the square whose centroid is at the location of the task.(3)Sort all coders in the candidate set in descending order of APUIS. For each coder, add him/her to the final team if his/her skills can cover at least one uncovered skill in the task.(4)If team is not feasible, return to step (2) and use a square with longer sides.
In detail, we generate the list of side lengths by uniformly dividing the maximum distance between the coders and the task. Given a parameter denoting the number of phases, we first scan the whole set of coders and calculate the maximum distance between the coders and the task, . Then, we iteratively start a phase by using a square with side length , until we obtain a feasible team.
The pseudocode of our MPR algorithm is shown in Algorithm 4. First, we initialize the team and find the maximum distance between the coders and the task in lines (1)-(2). Then, we calculate the step size of the sides between two phases in line (3). In each phase (iteration in lines (5)–(9)), we first query all nodes located in the square whose centroid is at the location of the task and whose side length is . We then alternately add coders with the minimum APUIS (lines (7)–(9)). Similarly, ties are broken by distance first, then arbitrarily. In line (10), we return the resulting feasible team .
| 
 | ||||||||||||||||||||||||||
5. Evaluation
We applied our four algorithms to synthetic and real datasets. The algorithms were implemented in C++, and the experiments were performed on a machine with an Intel i7-4710mq 2.50 GHZ 4-core CPU and 8 GB memory.
5.1. Datasets
We use real and synthetic datasets to evaluate our algorithms. The real dataset is taken from CSTO (http://www.csto.com/) and includes 2033 active coders. In the CSTO dataset, each task is associated with a set of skills needed to complete a software development task, and each coder is associated with a set of skills and an average price that can be deduced from the history data. As few coders have associated price information (because many coders have not any completed tasks), we analyze the price distribution using coders associated with price information. Except for some expensive coders, the price of a coder is uniformly distributed in the range 0–5000 and is unrelated to the number of mastered skills. As the CSTO data are not associated with location information, we generate coordinates for each coder according to a uniform distribution.
For the synthetic data, based on our observations of the real dataset, we generate the price of coder following a uniform distribution. We assume that each coder has 5–25 skills, which is common in practice. The distance from each coder to the task is generated according to a uniform distribution. The statistics and configuration of synthetic data are illustrated in Table 2, where the default settings are marked in bold font.
5.2. Number of Phases in MPR Algorithm
In the MPR algorithm, we introduce a new parameter representing the total number of phases, . Before conducting experiments on the synthetic and real data, we determined an appropriate value of to ensure better performance of the MPR algorithm. We first generate a synthetic dataset with the default settings to preexamine how affects the performance of the MPR algorithm. The results are shown in Figure 2 for from 5 to 100. According to these results, we use in all subsequent MPR experiments.

5.3. Experiments on Synthetic Datasets
The experimental results using the synthetic data are shown in Figures 3 and 4. In this section, we measure the effectiveness and efficiency of these four algorithms and analyze how various parameters affect the results given by each algorithm.

(a) Cost of varying

(b) Cost of varying

(c) Cost of varying

(d) Cost of varying

(a) Time of varying

(b) Time of varying

(c) Time of varying

(d) Time of varying
Effectiveness of Proposed Algorithms. Figure 3 shows the effectiveness of our four algorithms. The DP-SDTF and MPR algorithms offer similar performance and outperform both DF-SDTF and PF-SDTF.
Efficiency of Proposed Algorithms. Figure 4 shows the efficiency of our four algorithms. We can observe that although DP-SDTF and MPR have similar cost results, MPR is faster than DP-SDTF. This is because we use the R-tree to prune some unvalued nodes and accelerate the process of the query. We can also observe how the restriction of skill satisfaction affects the running time of four algorithms. Although PF-SDTF, DF-SDTF, and DP-SDTF all use greedy strategy and their structures are similar, DF-SDTF algorithm consumes more time than that of PF-SDTF and DP-SDTF algorithms. This is because DF-SDTF algorithm only considers the effect of the distance. As a result DF-SDTF needs more coders to make the team feasible, resulting in more iterations than the PF-SDTF and DP-SDTF algorithms.
Effect of . Figure 3(a) shows the effectiveness of varying . As varies from 0.1 to 0.9, the cost of DP-SDTF decreases smoothly, indicating that contributes more than . Because the DF-SDTF (PF-SDTF) algorithm only considers distance (price), when is high (low), the performance is similar to that of the DP-SDTF and MPR algorithms. However, as decreases (increases), the performance of DF-SDTF (PF-SDTF) becomes worse.
Effect of , , and . The effect of varying , , and is illustrated in Figures 3(b), 3(c), and 3(d). Because the default setting of is 0.5, finding a good team requires distance and price to be considered simultaneously. We can observe that the DP-SDTF and MPR algorithms perform better, with the DF-SDTF and PF-SDTF costing 3 to 4 times more.
5.4. Experiments on the Real Dataset
The experimental results using the real dataset are shown in Figure 5. Figure 5(a) shows the effects of varying , and Figure 5(b) shows the effects of varying . Varying produces a similar effect as with the synthetic dataset. When varying , the costs of the four algorithms oscillates, probably because of the structure of the CSTO dataset. Unlike the experiments on synthetic data, the MPR algorithm performs worse than DP-SDTF but still outperforms DF-SDTF and PF-SDTF. This is probably because, in real datasets, different skills may make different contributions, leading to a gap between results with synthetic data and real data.

(a) Cost of varying

(b) Cost of varying
Comparison with the Exact Result. Because the SDTF problem is NP-hard, we only conduct small-size experiments to compare the output of our DP-SDTF and MPR algorithms with the exact solution. The setting is and , where coders are randomly chosen from the real dataset. The experimental results are shown in Figure 6. We can observe that the performance of DP-SDTF is similar to that of the exact algorithm, but the cost of the MPR algorithm is 1.25 to 1.5 times the exact minimum cost.

Conclusion. From the extensive experiments conducted on both real and synthetic data to validate our four algorithms, we found that DF-SDTF (PF-SDTF) algorithm, which focuses on the distance (price) part of the objective function, performs better with larger (smaller) values of . The DP-SDTF algorithm gives the best performance among the four algorithms discussed here because it considers both parts of the objective function. The fourth algorithm, MPR, accelerates the query process with little increase in the cost, which is more applicable in practice.
6. Related Work
The SDTF problem tackled in this paper covers the domains of Team Formation and Spatial Crowdsourcing. On the one hand, the SDTF problem can be simplified to the task assignment problem if we ignore the skill constraint. On the other hand, it is exactly the most distinctive requirement that the skills of a team must cover the skills of the task. Previous work related to these two domains is introduced in the following subsections.
6.1. Team Formation
The team formation problem was first proposed in [19]. The problem requires a team of workers that (1) its skills satisfy the requirement of the task; (2) the overall communication cost is minimum. In this paper, the NP-hard nature of this problem is also proved. The problem has been extended by associating each worker with a capacity [20], which is the maximum number of tasks assigned to the worker. To solve the capacitated team formation problem, two approximation algorithms with proved guarantees were proposed. Different from [19, 20], which only include a single task, the team formation problem has been considered with multiple tasks and workers in both offline and online scenarios [21]. While the above-mentioned studies attempt to optimize the overall communication cost, the workload can be balanced among workers by treating the communication cost as a restrictive constraint [22]. As the above shows, most studies on team formation focus on skills satisfaction in communicative graphs, while ignoring the influence of spatial information.
6.2. Spatial Crowdsourcing
The problem studied in this paper is an extension of the task assignment problem in spatial crowdsourcing, known as the server-assigned task assignment problem [10, 11], in which workers cannot reject the assigned tasks. Recently, task assignment in real-time spatial crowdsourcing has also been studied by the online algorithmic model [12, 23]. Based on the original task assignment problem, both [24, 25] study the conflict-aware task assignment problem, in which tasks may conflict with each other and thus cannot be assigned to the same worker. In addition, the work [26] not only considers spatiotemporal conflicts of tasks but also schedules the plan that each worker complete tasks [26]. Furthermore, Kazemi et al. propose the quality-based task assignment problem [27], which utilizes majority voting techniques to guarantee the quality of task assignment results [28–30].
Although [13, 14] integrate the task assignment problem and team formation problem and propose a two-level-based framework to solve the problem, there are two main differences between [13, 14] and our work: (1) there is no capacity constraint in our work, which means that there are more candidates in the search space; (2) the objective of our work considers both the distance between the task and workers and the overall cost, whereas [13, 14] only attempt to minimize the overall cost.
7. Conclusion
With the development of sensors, particularly vehicle sensors and mobile sensors, spatial crowdsourcing is gaining ever more attention. In this paper, we propose a novel spatial crowdsourcing problem called Software Development Team Formation (SDTF). We prove that SDTF is NP-hard and design three greedy algorithms and an index-based algorithm to solve the SDTF problem. The first two greedy algorithms, DF-SDTF and PF-SDTF, only consider part of the optimization objective, and the performance is therefore below expectations. To overcome the shortcomings of these two algorithms, we design a third greedy algorithm, called DP-SDTF, which considers both parts of the optimization goal. In addition, we develop a multiple-phase algorithm based on R-trees called MPR. The MPR algorithm can accelerate the query process with little increase in cost. We conduct extensive experiments to evaluate the performance of our algorithms. The results show that our DP-SDTF algorithm achieves similar performance to the exact algorithm.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is supported in part by National Grand Fundamental Research 973 Program of China under Grant 2015CB358700, NSFC Grant no. 71531001, and SKLSDE Open Program SKLSDE-2016ZX-13.