Abstract
According to the technical requirements of intelligent development of auxiliary combat system, we construct a visual intelligent test platform. A near-real military scene dataset based on physical rendering is built, which contains 11,000 remote sensing images collected by an analog camera taking pictures in different illumination, weather environment, camera shooting angle, and scene scale condition. Besides, we add a natural style transfer module for a single unmodeled military scene image’s multienvironment generation. We conduct experiments to evaluate the stability of several UAV remote sensing image object detection algorithms. Based on the quality and speed value of the tested algorithms, the adaptability scores in different environments are calculated. Furthermore, we propose a comprehensive evaluation index system of military remote object detection based on a hierarchical model. We envision that our comprehensive benchmark will play a role in the evaluation of algorithm capability for military object detection tasks and the improvement of training algorithm capability.
1. Introduction
In terms of military applications, compared with intelligent object recognition tests in real scenes, artificial near-real virtual scenes have more significant advantages on cost, fidelity, repetition rate, and controllability. First, in order to meet the technical requirements of the intelligent development of auxiliary combat systems, we constructed a visual intelligent test platform. Compared with the real scene, the virtual platform can arbitrarily transform the position and scale of the objects in the virtual scene without changing the relevant restrictions such as the details of the target. Second, inspired by many style transfer approaches based on GAN [1–4], we proposed an algorithm, which included a generator to render natural military scene images in different weather in the same camera view based on CycleGAN [5], so as to provide specific datasets (In real military missions, pictures of different weather can be generated in real time for selected scene pictures by the generator trained by images collected on our intelligent platform. We only make a brief of the algorithm and do not show the dataset content specifically.) for other military missions, and improved the generalization ability of intelligent algorithms.
Images can be generated from the natural phenomenon generation module of physical rendering by Unreal Engine 4. For example, cloud rendering is simulated by using a ray projection algorithm based on image sequence in a three-dimensional scene [6–11]. We get the large-scale images of the virtual scene in different environment based on the physics engine. From the perspective of expression style and content characteristics of the deep convolutional neural network, the natural style generation of a single military scene image was realized. For an unmodeled single real military scene image, the generator of our style transfer algorithm can render it.
Based on the intelligent platform, we can measure and evaluate an intelligent detection algorithm in multiple conditions such as different illumination, environment, scale, and angle. The algorithm to be evaluated was applied to different scene images, in which the light intensity was divided into bright light, moderate intensity light, and weak light, and the weather conditions include sunny, rainy, snowy, and foggy. We also used different camera capturing view including distant-view, mid-view, and close-view to simulate the scales of images. Then, the scene image’s angle was divided to three. The mAP and FPS scores evaluated by nonadaptive system can only be evaluated on a single fixed dataset. On the platform of near-real virtual scene, we can measure image detection methods from illumination, environment, perspective, scale adaptability, and other aspects to obtain the performance score and speed score under different conditions. We completed semantic annotation of basic scenes and used these annotation and context conversion technology to generate a large number of near-real training pictures. Compared with the existing 2D images [12] and 3D indoor scenes [13–15], the 3D environment of this project has the characteristics of more complex environment and stronger real simulation.
For intelligent algorithms on different missions, after scoring on all four indicators, we added a comprehensive evaluation model based on the analytic hierarchy process [16]. Firstly, through questionnaire surveys and investigation of a large number of application scenarios of military target monitoring, the importance comparisons among the four evaluation indicators were obtained, and the qualitative judgment difficult to be quantified was transformed into an operable importance comparison. The evaluation indexes were compared pairwise to get the judgment matrix, and the consistency test was carried out. Then, the weight of the judgment matrix was obtained by the arithmetic average method, geometric average method, and eigenvalue method. We put the obtained light adaptability score , environmental adaptability score , scale adaptability score , and angle adaptability score to calculate the index value into the formula of the illuminant index, the environment index, the scale index, and the angle index, respectively. Finally, combining the weight of the judgment matrix, the comprehensive evaluation score of the algorithm was obtained.
2. Related Work
Visual intelligent recognition datasets, ImageNet [17], MSCOCO [18], PASCAL VOC [19], KITTI [20], and Oxford Robot car [21], promoted the development of visual intelligent technologies such as face detection and recognition, pedestrian tracking, gait recognition, visual question answering, and scene understanding. HiEve [22] played a good role in multiobject tracking, attitude estimation and tracking, and motion recognition. CrowdHuman [23] focused on challenging scenarios under a variety of crowded and complex events whereas previous datasets were mostly associated with normal or relatively simple scenarios. Cityscapes [24] referred to the semantic understanding of urban street view, aiming to apply the evaluation visual algorithm to semantic segmentation tasks to better support the object detection tasks on urban scene. However, due to the limitations of filming equipment and scenes, there is still a gap between the dataset collected from the real environment and the requirements of simulating military application-related scenes. Gaidon et al. [25] recorded a video dataset in a virtual environment to extend KITTI for analyzing the multiobject tracking capability of algorithms. Qiao et al. [26] constructed a virtual supermarket environment and collected images containing objects of different scale to analyze the robustness of object detection algorithms to scale changes. Gupta et al. [27] evaluated their proposed visual navigation algorithm by measuring task completion rate and average search path in a highly realistic virtual environment. Xia et al. [28] proposed a large-scale dataset for Object Detection in Aerial Images, DOTA, whose rotation label and polygon label were very helpful for the identification of some categories, such as ships. But the dataset still has a problem of limited remote sensing data. At the same time, some AI research teams have opened a large number of virtual environments for AI training and testing, such as AirSim, OpenAI, and DeepMind, striving for the commanding heights of AI research and testing platforms. On the other hand, although the datasets constructed by using virtual scenes has excellent performance, the cost is high, so how to transfer the style from a single image to obtain images in different environments is also worth exploring. Zhu et al. [5] proposed the CycleGAN model to realize the conversion of sample data without pairing, which can be applied to image style transfer tasks.
For remote sensing object detection tasks, it is extremely challenging due to the strong restriction of scene illumination, camera capturing angle, and scale of the captured images. Xu et al. [29] restored the low-illumination image through decomposition and enhancement, which greatly improved the target detection task under illumination changes. For objects with high density and small size and complex background, Long et al. [30] used a combination of a variety of traditional methods and depth methods to detect objects with small size, partial occlusion, and those not in the field of vision, as well as objects appearing in the dark background, which had a very good effect. Wang et al. [31] proposed MS-VANS to solve the multiscale problem in the target detection process by combining visual attention and loss function for data enhancement, aiming at target rotation and scaling and background clutter. Hendrycks and Dietterich [32] introduced a benchmark to evaluate the robustness of the recognition model for common corrosion images. Michaelis et al. [33] also proposed a benchmark to discuss the application of existing target detection algorithms in autonomous driving tasks. For comprehensive test indicators, Saaty [16] proposed an analytic hierarchy process (AHP), which combined qualitative and quantitative decisions with multiple plans or goals.
3. Near-Real Military Visual Intelligent Platform
In the whole framework of a visual task, the visual intelligent test platform plays a decisive role, which is of great significance to the improvement of the algorithm evaluation system and the construction of the visual task system. How to evaluate the visual detection algorithm is an important problem. In order to better evaluate the performance of the existing algorithms scientifically, we build a remote sensing image detection algorithm test platform on the established near-real virtual scene dataset. The performance of the object detection model can be measured from four aspects: illumination, environment, angle, and scale adaptability. Combined with analytic hierarchy process (AHP), a comprehensive index evaluation model is proposed.
3.1. Establishment of the Dataset
To comprehensively test the recognition and tracking performance of the intelligent algorithm in different natural scenes, it is necessary to carry out real-time scene style conversion in the constructed near-real scenes to improve the generalization ability of the recognition algorithm. The natural phenomenon generation technology based on physical rendering needs to obtain the complete spatial information of 3D scene species, so it is difficult to process a single military scene image. To solve this problem, we adopt a multiway natural phenomenon scene generation route [34–37]. For 3D near-real scenes that have been fully modeled, physical particle rendering based on the UE4 software is used to complete scene conversion of natural phenomena such as rainfall, snowfall, and fog. For a single military scene image, images of various natural phenomena generated by the physics engine are used as the training set, and the style transfer model of natural scene phenomena is trained by deep learning technology. The flow chart is as follows (see Figure 1).

3.1.1. Physical Rendering Based on Natural Scene Generation Module
Unreal Engine 4 provides a powerful particle system, which can create a variety of complex visual effects. The physics rendering engine based on UE4 includes two rendering methods: particle rendering and volume rendering, which are, respectively, applied to the scene of rain and snow and cloud rendering. (i)Rain and Snow Rendering Based on Particle System. Making an -dimensional vector represents a single particle, then define the mapping for a single particle to a positive integer set. Each particle gets a map; then, we make the mapping from to be , and let represents the state of the particle. The finite set of particle mapping is defined as a particle system, expressed as , in which is the set of particle system at different moment . The state of the particle system at the initial moment is expressed as
Based on the above particle system description method, by adjusting the shape and appearance, motion speed, and life cycle of the particle system, the style generation of natural phenomena such as rainfall and snowfall can be realized in near-real scenes. (ii)Cloud Rendering Based on Volume Rendering. It is necessary to use volume rendering algorithm to create a real and random cloud weather environment in virtual scene. At present, commonly used volume rendering algorithms include ray projection algorithm, split-deformation algorithm, and frequency domain volume rendering algorithm. The ray projection algorithm uses a solution based on ray scanning, which is in line with common sense of human life, and the algorithm can be easily transplanted to GPU for implementation. Therefore, we use the ray projection algorithm to simulate the volume cloud and gas haze
Ray-casting algorithm is a direct volume rendering algorithm based on image sequence. From each pixel of the image, along a fixed direction, usually the gaze direction, it launches a light across the image sequence. In this process, the image sequence is sampled to obtain the color information, and the color values are accumulated according to the light absorption model until the light passes through the whole image sequence. The final color value is the color of the rendered image. The renderings based on the 3D engine are shown in Figure 2.

(a)

(b)

(c)
3.1.2. Natural Scene Generation Module Based on CycleGAN
The generation method based on physics engine can not only transform natural environment quickly and accurately for the nearly real scene with complete modeling but also use the generated images as training samples for the natural environment generation system of single military scene image. Based on the large-scale training dataset generated by the physics engine, this project, from the perspective of the expression style and content characteristics of the deep convolutional neural network, improves the generator based on CycleGAN algorithm and realizes an updated generator. (i)Our Generator. In this part, an improved generator composed of an encoder, a decoder, and a feature transfer module is designed. The structure of the feature transfer module is changed, and the image features are interpolated. The semantic attributes of the image are modified, and the learning of image features is realized successfully. Our baseline selects CycleGAN model, and the generator is composed of residual structural blocks and sampling convolution layers. Because the residual network is easy to train and has a strong capability of feature extraction, the application of this structure block has a good effect on image generation tasks. The network consists of two generators and two discriminators, which refers to the rainy day image as the domain and the sunny day image as the domain , respectively. A mapping of a different domain is learned through a circular consistency process so that a model can be trained with only two types of images(ii)Feature Transfer Module. The generator of the original CycleGAN is composed of two downsampling convolution layers, nine residual modules, and two upsampling convolution layers. In the improved model, the generator of the two domains is integrated, and the depth feature transfer structure is added on the basis that the lower sampling layer, the upper sampling layer, and the residual module remain unchanged. Under the condition that the 9 residual modules and the network depth remain unchanged, the two subsampling convolution layers with a step size of 2 and the first three modules of the residual module are used as encoders together, and the last three modules are used as decoders together with the two upsampling convolution layers with a step size of 1/2. The middle three residual modules are used as feature transfer modules from domain to domain and from domain to domain, respectively. The improved model adopted military scene dataset included images of sunny day and rainy day for training. The model adopted Adam optimizer, and the batch size was set as 1, and the initial learning rate was set as 0.0002. 200 epochs were trained, and the number of iterations was the number of images of a single category in the training set
3.2. Evaluation Index System of UAV Remote Sensing Image Detection Algorithm
3.2.1. Indicator Overview
Aerial images have the characteristics of scale diversity, perspective particularity, small target problem, multidirection problem, and high background complexity. The algorithm detection platform is built based on the near-real virtual scene dataset, which contains scoring module of illumination adaptability, environmental adaptability, scale adaptability, and angle adaptability. Through this platform, the evaluation scores of the algorithm model in illumination index, environment index, scale index, and angle index can be obtained. (i)Illumination Index. It is unreasonable to apply the radiometric correction coefficient obtained by the only reference object to all images due to the low flying altitude of micro-UAV and the constant changes of illumination and shooting angle during its flight. This will make continuous images have uneven illumination, resulting in inconsistent image saturation, brightness, and other problems. Illumination index mainly measures the performance of the algorithm model under different illumination intensity(ii)Environmental Index. Due to meteorological conditions or air pollution and other factors in some areas, there is a lot of haze, which seriously reduces the quality of UAV image acquisition and affects the subsequent image analysis and understanding. The environmental index mainly evaluates the performance of the algorithm model in different environmental scene, that is, the level of test accuracy(iii)Scale Index. The shooting height of aerial remote sensing images is between hundreds of meters and tens of thousands of meters. The size of similar detection targets on the ground is not the same, such as fighter jets, ships, and other military equipment, the size based on tens of meters to hundreds of meters. Scale index mainly evaluates the performance of the algorithm model in different scale scenes(iv)Angle Index. Most remote sensing images are taken from a top view from high altitude, but most of the datasets commonly used by deep learning models are images from a ground-level perspective. In practical application scenarios, the angle shooting mode for the same target is variable, and the image detection learning model obtained only in the training of conventional datasets will have a worse effect when applied to the UAV remote sensing detection task in the military scenes. Angle index mainly evaluates the performance of the algorithm model in different angles
Each index is composed of a four-dimensional vector, and the calculation steps are shown as below. We enter the test platform with the evaluation algorithm model and apply the evaluation model to different lighting, environment, angle, and scale, respectively, lighting 1/environment 1/angle 1/scale 1, lighting 2/environment 2/angle 2/scale 2,…, light /environment /angle /scale . Taking the illumination index as an example, the detection quality score and speed score of a single frame image are obtained under the scene of illumination , and then, the mean and standard deviation of the quality and speed scores under the illumination is calculated. The mean value represents the performance of the algorithm, and the standard deviation represents the stability of the algorithm index. For example, compared with the traditional KITTI dataset, our dataset is more flexible to acquire (see Figure 3).

(a)

(b)
3.2.2. Quality Score and Speed Score
Assume that the quality score of the model to be evaluated under light/environment/angle/scale is , which is the mAP value. The speed score is , which is the FPS value. The mean and standard deviation of quality score and the mean and standard deviation of speed score of the model to be evaluated are counted under all conditions:
In , can be 1, 2, 3, and 4. and represent the mAP score of an algorithm under illumination , environment , scale , and angle , respectively;
In , can be 1, 2, 3, and 4. and represent the FPS score of an algorithm under illumination , environment , scale , and angle , respectively.
3.2.3. Adaptability Score of Each Index
The calculation process of illumination index, environmental index, scale index, and angle index is shown in Figure 4. The algorithm to be evaluated is tested to obtain illumination index of the model to be evaluated. The scoring formula of environmental index, scale index, and angle index is the same as the above formula, and the environmental index , scale index , and angle index are obtained, respectively.

The expression method of the algorithm index is shown in Equation (2). In , can be and . and represent illumination index, environmental index, scale index, and angle index, respectively;
In , can be 1, 2, 3, and 4. and represent the number of different lights, environments, scales, and angles tested by the algorithm.
The scores of illumination adaptability, environment adaptability, scale adaptability, and angle adaptability can be calculated by Equation (3).
In , can be and . and represent the light adaptability score, environmental adaptability score, scale adaptability score, and angle adaptability score, respectively.
3.3. Comprehensive Evaluation Index Formulation Based on Hierarchical Model
The comprehensive evaluation index model is constructed based on the hierarchical structure model.
3.3.1. Construction of Hierarchical Structure Model
We utilize analytic hierarchy process (AHP) to analyze the weight of the four indicators. The evaluation target layer, evaluation criteria layer, and evaluation scheme layer are used to build a three-layer structure model. The top-level evaluation target layer considers the purpose of decision-making. In the evaluation model, weights of illumination, environment, scale, and angle are obtained to get the final score.
The lowest evaluation scheme layer is an alternative scheme for decision-making, where seven test algorithms are selected, Faster RCNN [38], RetinaNet [39], ATSS [40], FoveaBox [41], GFocal Loss [42], PAFPN [43], and RepPoints [44], representing seven schemes. The middle layer is the criterion that needs to be considered in the evaluation process, namely, four evaluation indexes: illumination index, environment index, scale index, and angle index. We have solved the relative weight problem of the evaluation criterion layer to the evaluation target layer. For each algorithm in the evaluation scheme layer, the final score is calculated according to the importance weight of the four indicators, so as to get the optimal scheme among the seven schemes.
3.3.2. Construct the Judgment Matrix
We construct the matrix using the consistent matrix method. Due to the high difficulty of comparison of factors with different properties, different understandings of the judgment of the four indicators in the evaluation criteria layer will lead to different grasp of the standards, resulting in evaluation errors. Through multidiscussion and expert consultation, a judgment table is constructed according to the scale table of paired comparison matrix. The scale table is as follows (see Table 1).
3.3.3. Hierarchy Single Sorting and Consistency Test
The judgment matrix contains subjective evaluation, and the weight of each index in the criterion layer is obtained according to the judgment matrix; then, a consistency test must be carried. If the verification is passed, the obtained weight can be used. Consistency index can be calculated as shown in Equation (4).
We have to set a matrix dimension to find the average random consistency index RI. The RI value table is shown in Table 2.
The formula for calculating consistency ratio is as shown in Equation (5).
4. Result
4.1. The Subindex of Each Algorithm
On the algorithm test platform, we evaluate the performance of the algorithm model from several perspectives, such as illumination index, environment index, speed index, and angle index. For illumination index, light simulation and data sampling under normal light, low light, and weak light are experimentally performed. As for the environmental index, the environmental simulation and data sampling under five scenes of constant light close range, thin fog, thick fog, rain, and snow are tested. For the angle index, three kinds of angle environment simulation and data sampling have experimented with angle 1, angle 2, and angle 3. For the scale index, the far and near environment simulation and data sampling of the distant, medium, and near scenes have experimented. The accuracy score and the different indexes’ mean and variance value of the test algorithms for Faster RCNN, RetinaNet, ATSS, FoveaBox, GFocal Loss, PAFPN, and RepPoints are shown in Tables 3 and 4. The adaptability scores are shown in Table 5.
To more intuitively, Figure 5 shows the advantages of each algorithm in the four indicators.

It can be seen from the line chart that the angle adaptability score of the test algorithm is relatively high, which shows that all of the object detection algorithms have stronger strain ability in the practical task with various angles. On the other hand, it can be seen that the adaptability of the algorithms to the scene with various scales is slightly poor, so the scores are not high. In general, Fobeabox has the best comprehensive performance, while RetinaNet performs the worst.
4.2. Comprehensive Evaluation Index
For different application scenarios, we set the judgment matrix differently. In some cases, climate conditions are different from those in ordinary areas due to variable meteorological conditions or air pollution and other factors. The conventional UAV remote sensing mission is susceptible to adverse and changeable weather. The application of some scenes is special, such as urban physical examination based on remote sensing, which limits factors such as UAV flight height and camera focal length. The targets in some scenes are densely arranged from multiple angles, and the UAV has a great limitation on the angle in the monitoring process.
4.2.1. Strong Illumination Index
Small and micro aerial vehicles usually fly under the clouds. If it is cloudy, the remote sensing pictures taken by the machine will not be blocked by the clouds but will be affected by the sharp changes in the illumination intensity, resulting in different light and shade. Even in a sunny environment without the interference of clouds, the illumination is constantly changing. Although this change will not make a big difference in vision, it will have an impact on the gray value of the picture. In addition, the UAV is limited by the flight height and the focal length of the camera, so the coverage area of a single picture collected each time is small, and multiple images need to be stitched and fused into a panoramic image. The problem of local chromatic aberration caused by illumination deviation is also very difficult. For all-weather detection tasks in a changeable environment and application scenarios focusing on extracting geometric information of geographic space, illumination adaptability is in high demand.
4.2.2. Strong Environmental Index
The UAV has the characteristics of small size and lightweight, and it has higher requirements for takeoff and landing sites and is highly sensitive to the working environment, such as terrain, landform, and weather conditions. The performance of the same detection algorithm varies greatly in different weather conditions. If extreme conditions are encountered, the camera cannot shoot. For optical sensors, when clouds, fog, or water vapor interfere with the signal’s path, the remote sensing image taken by the camera will be partially obscured. For the application of large changes in weather and lack of sunlight, it has requirements on environmental adaptability.
4.2.3. Strong Scale Index
Many of the objects in the scene are small and of varying sizes. Small scale objects show a smaller range of pixels in the image, while large scale objects show a larger range of pixels. Common remote sensing target monitoring is difficult to identify target objects at different scales with poor generalization ability. In the convolutional neural network, the deeper features have greater feelings and richer semantic information. However, if the geometrical information of the image is lost due to the reduction of the resolution, the shallow information has a small receptive field and rich geometrical information. For the application scenes with large change of target scale and target density, large image field of view, and complex background, adaptability to the environment is required.
4.2.4. Strong Angle Index
The objects in the scene generally rotate arbitrarily from multiple angles and present dense arrangement. The normal method of capturing the front frame is not enough. The single receptive field of neural networks does not have a good solution to the object variability. The receptive field of the general network is fixed in size and has no angle, and it is only calculated along the horizontal direction. However, the vertical projection in the monitoring scene leads to different angles of the actual target. In moving object monitoring, target tracking, antidetection, and other application scenarios, the importance of angle index is relatively high. For tasks with large target rotation changes, it means that the targets of the same type have multiple angle changes, which requires higher angle adaptability.
4.2.5. The Military Object Detection Task
The multichannel style generation and conversion system of natural phenomena constructed in this project provide users with four seasonal changes of spring, summer, autumn, and winter and 24-hour day and night changes. Users can match any natural phenomena such as four seasons, day and night, rain and snow, and cloud and fog as required. The rendering of the scene in four seasons is shown in Figure 6.

The military scene is located in a mountainous area, and the environment is changeable, so the UAV detection task has a high adaptive demand to the environment. The shooting dimension is low, and the lighting changes are not drastic. Moreover, as the UAV focuses on detecting objects with multiple angle changes such as weapons and vehicles, the adaptive requirement for angle changes is also high. In the reconnaissance missions of small unmanned aerial vehicles, the flying height is relatively fixed, and the scale of the photos taken is not strict. For this task, the relative matrix is selected after expert’s review and questionnaire survey, as shown below, and the weight vector obtained by the geometric average method is [0.14371197, 0.57484787, 0.0630452, 0.21839496]. In the final comprehensive evaluation stage, for different object detection tasks, the evaluator can set the value of by himself. However, it should be noted that this judgment matrix must be reviewed by a large number of questionnaires or the review of the expert in the field and must pass the consistency test.
We set to be 4 and calculated the value of CI to be 0.017271 by the matrix . Then, the value of CR was calculated as 0.019406. The criterion for whether the matrix passes the consistency test is to judge whether the CR value of the consistency ratio is less than 0.1. If the inequality is true, the consistency of the judgment matrix is determined to be acceptable; otherwise, the judgment matrix should be modified. The comprehensive evaluation scores of the seven algorithms involved in the evaluation are as follows (see Table 6).
5. Summary
We propose a novel method to solve the adaptability scores of various algorithms for different conditions in the UAV detection task and formulate a new benchmark. But the disadvantage is that the adaptability score of the algorithm in different environments depends on common comparisons with other algorithms. The more algorithms tested, the more accurate the results. Secondly, the main formula of the hierarchical model analysis method is based on qualitative analysis. In the judgment matrix, the importance assigned to each indicator is very arbitrary, which can cause conflicts in multiperson decisions. Although this article uses the expert decision-making method to summarize opinions and take weights, the process is long and cumbersome. To deal with the remote sensing target detection task with subtle environmental differences under a certain index, it is necessary to formulate the matrix again. The generalization ability of this method is not very brilliant. It cannot be flexibly applied to the complex remote sensing object detection scenes in the subdivided areas.
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 61971383 and the CETC funding.