Abstract
Glass reflection and refraction lead to missing and distorted object feature data, affecting the accuracy of object detection. In order to solve the above problems, this paper proposed a glass refraction distortion object detection via abstract features. The number of parameters of the algorithm is reduced by introducing skip connections and expansion modules with different expansion rates. The abstract feature information of the object is extracted by binary cross-entropy loss. Meanwhile, the abstract feature distance between the object domain and source domain is reduced by a loss function, which improves the accuracy of object detection under glass interference. To verify the effectiveness of the algorithm in this paper, the GRI dataset is produced and made public on GitHub. The algorithm of this paper is compared with the current state-of-the-art Deep Face, VGG Face, TBE-CNN, DA-GAN, PEN-3D, LMZMPM, and the average detection accuracy of our algorithm is 92.57% at the highest, and the number of parameters is only 5.13 M.
1. Introduction
With the widespread use of glass curtain walls in architectural design, it makes the building more beautiful and increases the lighting effect of the interior. At the same time, the use of dark glass also increases the difficulty of collecting interior information from the outside. In actual military operations, it is often necessary to collect and identify information on enemy personnel inside through glass walls or dark glass, but tinted glass will filter or add color information to the object. Different degrees of glass bending or uneven thicknesses make the refractive index of different regions of the glass change, causing distortion and deformation of the collected object with shifted position information [1]. At the same time, the presence of reflected images [2, 3] or stains on the glass surface will obscure the object information, resulting in missing or blurred information in parts of the acquired image, making it more difficult to achieve accurate identification of enemy objects under such circumstances.
In the process of collecting object information, in order to avoid the detection of enemy personnel, the algorithm is required to have good real-time, with fast search and object identification capabilities. Effective training of the model is an important prerequisite for detection algorithms, and the larger the number of samples and the stronger the training model, the better the detection effect [4]. But during the process of translucent glass detection, uncontrollable unknown factors such as uneven glass thickness or bending, scattering, reflection cause problems such as distortion and deformation, blurred or missing feature information that is difficult to predict in the collected sample images, making it difficult to label them accurately as well as to object training. Without accurate and sufficient training sample data, it is difficult to achieve effective training of the model, and it is even more impossible to guarantee the accuracy and robustness of the detection for striking objects.
In view of the above problems, this paper proposes a detection method for detecting objects through windows. Firstly, this paper improves the ResNet structure using skip connection and extension blocks to reduce the number of network parameters. Secondly, this paper extracts the abstract features of the object by binary cross-entropy loss. This feature ignores some details of the image and is highly distinguishable. Finally, the relative position relationship of each part feature is used to correct the distorted object image, which improves the accuracy of the algorithm for object detection. The specific process is shown in Figure 1.

In this paper, our main contributions include the following:(1)Improved deep residual neural network. The gradient disappearance problem is effectively reduced by introducing skip connections, and the expansion modules with different expansion rates are introduced to reduce the parameter training of the algorithm.(2)Propose a distortion object detection method. Combining the binary cross-entropy loss to abstractly extract the key feature information of the object, and using the relative distance relationship of the features to correct the image distortion caused by glass refraction, so as to increase the accuracy of the object identification and localization.(3)The algorithm in this paper was evaluated in the environment of glass refraction, glass reflection, and glass surface stain interference on the GRI dataset and achieved good performance.
2. Related Work
During through-window object detection, inconsistent refractive indices cause unpredictable distortions in the object image and random changes in the relative positional relationships of the object features [5]. The algorithm is difficult to repair the distorted and deformed images by means of refractive index inversion [6, 7] due to the inability to accurately capture the refractive index information of each part of the glass, which makes the detection process difficult. At the same time, the presence of stains or water mist on the glass surface and the use of tinted glass cause the captured images to be more blurred. Factors such as reflective contamination on the glass surface and object occlusion also make the identification process more difficult.
To address the glass reflection problem during detection, Zhang et al. [3] proposed to use an edge-aware cascade network to remove reflections, and thus reduce the effect of glass reflections when taking pictures through glass. To reduce the impact of glass reflections on the quality of acquired images, a multi-image-based depth estimation method using convolutional neural networks (CNN) was proposed in [2]. After first classifying the background and reflected images in the image, the reflected images are then removed, and finally, the removed background edges are regenerated using a generative adversarial network (GAN). Ref. [8, 9] proposed regularization methods based on wavelet transform and a network based on feature sharing strategy for the removal of reflective information from images, respectively.
In terms of reducing the effect of image distortion on object detection. Ref. [10] proposed a combination of different networks to restore the linear geometry of the face, thus reducing the effect of image distortion on detection. In the [11], an automatic correction method for omnidirectional image distortion based on a unified learning model (OIDC-Net) is proposed, which uses an attention mechanism with heterogeneous distortion coefficient estimation to achieve the correction of omnidirectional images. The above methods have excellent performance on the corresponding distorted images, but the above-distorted images have some regularity and can be adjusted by corresponding coefficients. For the distorted images caused by glass refraction, the irregularity of the glass and the unknown parameters make the distortion correction more difficult, and no corresponding research literature has been found.
With the rapid development of deep learning, the detection ability has been greatly improved, and the real-time performance of the algorithm is also highly required [12]. Qi et al. [13] used spatial relationships between local patches and facial features, which improved the accuracy and robustness of the algorithm for pose change and occlusion object tracking. Jian et al. [14] combine color contrast features of orientation features and background features to generate robust saliency maps, which makes the saliency detection model more effective and efficient. Cai et al. [15] propose an underwater distortion target recognition network. This method compensates for the problem of missing salient features by spatial semantic information, which effectively improves the accuracy of recognizing underwater distorted targets. Qian et al. [16] propose the Distortion Generation Network, which generates generous fish-eye images automatically using only small samples. Moreover, the Adversarial Distortion Generation Network is proposed to mine hard examples via adversarial training. These hard examples benefit detectors to be more robust to seriously distorted objects in fish-eye images.
In summary, there are some research in object detection and anti-interference. However, there are few pieces of research on object detection with through-window. Glass refraction and other factors cause distortion and blurring of the object, which still affects the accuracy of object detection and localization through the glass. In this paper, a skip connection and expansion module are added to the network to reduce the number of parameters of the algorithm. For the problems of distortion and feature blurring in the distortion images, the relative distance relationship of object features is corrected by domain contrast loss, which can increase the accuracy of object detection in the distortion images.
3. Proposed Method
3.1. Network Model
In order to reduce the number of parameters brought by the network and to improve the operational efficiency of the network, the optimized FSSNet [17] is used as the backbone for the distortion object detection network. Recoding the abstract features of the acquired translucent images using an encoder network to reduce the interference of factors such as color, brightness, and blur in the detection process. The abstract features are redecoded by the decoder combined with the feature relative position information to reduce the problem of position shift and image distortion caused by the inhomogeneous refractive index.
Let the input image be a 3-channel image. The convolutional layer has convolutional kernels. The convolution kernel size is and the step size is . Convolution followed by batch normalization and rectifier linear unit (ReLU) for optimization, i.e., Conv-BN-ReLU. After processing, a feature map with channels is obtained. An additional maximum pooling operation is performed to obtain a 3-channel feature map. The two feature maps are fused to obtain a 16-channel feature map. At the same time, a skip connection is added to the input and output to enable the model to extract features better, which is shown in Figure 2.

To be able to extract feature information under different fields of view, the convolution filter is set up as a convolution model with different dilation rates, i.e., a dilation module. The use of the dilation module helps the network to use fewer network layers to obtain more field of view, setting its dilation rate to , , and , respectively. The use of the dilation convolution expands the effective receptive field. As shown in Figure 3, is usually defined as a set of convolutions followed by batch normalization and rectifier linear unit (ReLU), and the algorithm in this paper uses the parametric rectifier linear unit (PReLU) [18] as the activation function.

3.2. Distortion Image Correction Detection
3.2.1. Abstract Feature Extraction (Encoder)
Currently, most algorithms are processed by feature extraction for each pixel of the original image, and the extracted features are used for subsequent object recognition or object detection. However, the human eye does not depend entirely on the information of each pixel of the image when detecting the process. For example, accurate detection can still be achieved when the human eye sees a sketch image or a distorted image of the object. In this paper, the feature extraction process is optimized based on this theory.
Let and denote the source and object domains, respectively, and the samples corresponding to and are denoted as and , where is the number of samples. The extraction of object features maps the sample to a two-dimensional label space, which can be expressed as , where is the mapping function.
Since the input image is distorted, not only do the object features need to be extracted in the feature extraction process, but also the relative position information of the features needs to be calculated and stored. The relative position information is obtained and then compared with the data information trained in the source domain, and the input image feature position information is corrected. So that the information in the extracted feature space is less affected by interference such as glass refraction and reflection. Let be the extracted feature encoder for the object abstract features. According to the binary cross-entropy loss, it is obtained thatwhere is the empirical error of for the source domain labels. To improve the accuracy of the feature mapping function , the following optimal mapping function can be found by reducing the error:
As shown in Figure 4, each feature of the input image is recorded in the 2D plane by the feature mapping function , which makes the subsequent correction of distorted features simpler.

3.2.2. Distorted Image Object Detection (Decoder)
The algorithm in this paper requires not only the extraction of features but also the determination of feature relative position relationships. Each feature of the input image is mapped in the 2D plane by the feature mapping function , and the relative distance relationship of the features is calculated, which can be expressed as follows:where is the distance between the feature point and the feature point , and is the distance calculation. The distance between features is stored at the same time during feature extraction, i.e., the relative position relationship of each feature is calculated.
Let denote the learned training features, and construct a positive sample pair and for a batch of samples in the source domain and a distorted image sample in the object domain . The domain contrast loss from the source domain to the abstract feature space is as follows:where is the feature from the last convolution layer of the sample image and is the temperature parameter. Similarly, the contrast loss from the object domain to the abstract feature space is as follows:
The autonomous correction of distorted images is achieved by reducing the contrast loss from the object domain to the abstract feature space; that is, the relative position relationship between the object domain and the abstract features in the source domain is reduced. The specific formulation is expressed as follows:
At the same time, region-specific identification of the object is achieved by using the location of the minima of the contrast between the source and object domains.
When the distortion of the window acquisition image occurs, the image can be corrected and repaired according to the relative position relationship of the features, reducing the influence of the unknown refractive index information of the glass and the irregularity of the distorted shape on the image correction. In the decoding process, the abstract features are reduced and the relative positional relationships of the features are added at the same time. By adjusting the distance relationship of feature points, the repair of distorted and deformed images is realized, and the object feature position before refraction offset is accurately located, the specific process is shown in Figure 5.

3.3. Model Training
Due to the non-uniform refractive index of the window glass through which the acquisition process takes place and the specific data cannot be measured, the captured image undergoes unpredictable distortion. Therefore, the model is more suitable for training in an unsupervised manner. Settings denote the source domain image information captured by the transmissive window and is the corresponding label. Assuming that is the object domain acquisition image without labels, the feature information of the source domain image is extracted autonomously using the distortion object detection network, which can be expressed as follows:where and is the feature space, and the distortion object detection network is trained to learn the encoder using a deep residual neural network.
In order to ensure that the network has good convergence and effective learning ability. Settings be an anchor sample, which can be calculated to obtain a positive sample , and also can be obtained a negative sample corresponding to .
The update process of the encoder can be written as follows:where is the information of the -th input, is a metric function to measure the similarity between samples, and the encoder is trained using the similarity information.
Using the vector inner product to calculate the similarity of two samples, the loss function can be expressed as follows:where there are 1 positive sample and negative samples in samples. The purpose of learning is to make the features of more similar to the features of the positive samples and less similar to the features of the negative samples. This allows the abstract features extracted by the model to be more representative, enabling the algorithm to perform the task of window detection better.
4. Experimental Results and Analysis
4.1. Dataset and Evaluation Metrics
4.1.1. Dataset
To verify the effectiveness of the proposed method, the homemade data Glass refraction image (GRI) dataset used in the simulation process, which is publicly available on GitHub (https://github.com/Robotics-Institute-HIST/Dataset/tree/master/Glass\%20\\Refraction\%20Image\%20dataset). There are 1267 original images with corresponding label files in this GRI dataset which is 1.60 GB. The images in the dataset contain a variety of interfering factors such as glass surface stains, water stains, reflections, refractions, and occlusions, which occur to varying degrees on the original image, and the dataset can be used to evaluate the performance of the distortion object detection algorithm.
4.1.2. Implementation Details
In this paper, the optimizer uses stochastic gradient descent (SGD). The initial learning rate is 0.01. The momentum and weight decay are 0.9 and 0.0005, respectively. The initial learning rate is 0.01. The batch size is 64. The entire training process is iterated 60,000 times, where the learning rate decays with a decay rate of 0.1 when the number of iterations is 48,000 and 54,000. The initial values of the network satisfy a normal distribution with a mean of 0 and a variance of .Where , is the convolutional kernel size and is the number of convolutional kernels.
4.1.3. Evaluation Metrics
In this paper, accuracy, confidence, and IoU are used as the evaluation criteria of the distortion object detection algorithm.where is the number of faces correctly detected by the algorithm, is the number of faces incorrectly detected by the algorithm, and is the percentage of correct faces detected by the algorithm in the corresponding data set.
The confidence degree is the probability that indicates the detected object belongs to a certain category.
In order to accurately measure whether the algorithm can accurately detect the position of the real object, the IoU criterion is also used to measure the image detection results, and the deviation of the algorithm labeled position relative to the real position is determined by the magnitude of the IoU.where is an area of overlap and area of union.
4.2. Analysis of Results
In this section, the effectiveness of the proposed algorithm for the distortion object detection task is verified by means of experimental comparative analysis. The algorithms in this paper are compared with existing excellent object detection algorithms, such as Deep Face [19], VGG Face [20], TBE-CNN [21], DA-GAN [22], PEN-3D [23], and LMZMPM [24]. The environment in which the algorithm runs: the CPU is Intel CORE i9 10900K and the graphics card is RTX 3090 VENTUS 3X 24 G.
4.2.1. Simulation of the Algorithm in This Paper
As shown in Figure 6, there is the simulation of the image processing process for the algorithm. To increase the feature information of the generated image and reduce the influence of glass interference on object detection, the original image is reconstructed with only grayscale image output. In the first layer of the convolutional reconstruction result, the general outline information appears, but the reconstruction effect is not satisfactory, then the second, third, and fourth layers of the reconstruction process. The final output object image is not significantly different from the original image, and the object information is clearer. By comparing with the image in Figure 6, the reconstructed image reduces the effect of ordinary glass and tinted glass on the object image. In the reconstruction of the last row of glass refraction images, the algorithm in this paper corrects the key feature positions of the object. The distortion of object features caused by glass refraction is improved, and the output image is clearer than the original image without obvious distortion. Through the results of the study, we show that this paper’s algorithm to process distortion images makes our processing effect more satisfactory.

As shown in Figure 7, we test the detection ability of the algorithm in this paper, compared with Deep Face, VGG Face, TBE-CNN, DA-GAN, PEN-3D, and LMZMPM. The position of the detected object is marked in different colors and the detection label and confidence level are marked in the top left of the marking box. The detection confidence levels shown by each algorithm are relatively stable and have high values for the detection of transparent, black, and yellow glass images. The confidence values of each algorithm decreased when object detection was performed for water droplet interference, stain interference, and glass refraction images. Among the comparison algorithms, the Deep Face has the highest confidence level of 0.969 for the through-glass image, and the lowest confidence level of 0.851 for the glass stain interference image, with a maximum range of confidence variation of 0.118. The confidence level of the LMZMPM decreases from a maximum of 0.998 to a minimum of 0.987, with a minimum variation of 0.011. Among the other compared algorithms, VGG Face, TBE-CNN, DA-GAN, and PEN-3D confidence vary in the range of 0.102, 0.018, 0.036, and 0.036. The confidence level of the algorithm in this paper varies in the range of 0.011, where the highest confidence level is 0.999 for the glass-permeable image and the lowest confidence level is 0.988 for the stain-interference image. The confidence level of the algorithm in this paper is as low as 0.996 when removing images of the stain interference type, with a confidence variation range of only 0.003. At the same time, the algorithm in this paper marks special areas while detecting the object, which is beneficial to the application of subsequent military tasks.

4.2.2. Detection Results on the GRI Dataset
In order to verify the detection capability of the algorithm in distortion images, GRI datasets are created and classified into three types of images: glass reflection, glass refraction, and specular interference, which are shown from Figures 8 to 10.



In the detection of the object through the window image, the effective information of the captured image is reduced due to the reflection of the glass, and this situation will have a significant impact on the accuracy of detection. The detection results of different algorithms are shown in Figure 8 and Table 1. In Figure 8, the images in each column show the reflected interference under different conditions, which include Normal, Bright light, Low light, Occlude, and Blurred. The LMZMPM has the highest detection accuracy rate of 95.85% with an IoU of 0.87 for the position prediction when detecting normal images. The detection accuracy of the proposed algorithm in this type achieves 95.57%, which is 0.28% lower than that of the LMZMPM, and the IoU of our algorithm in this paper is 0.92, which is 0.05 higher than the IoU of the LMZMPM. The DA-GAN and LMZMPM have the highest detection accuracy of 93.59% and 93.35% for bright light and occlude types. However, the algorithm achieved the highest detection accuracy and IoU for low light and blurred types, with a detection accuracy of 94.92% and 92.24, respectively, and IoU of 0.91 and 0.88, respectively. In the column of average, the highest detection accuracy and IoU of 93.88 and 0.88 are achieved by the algorithm in this paper. Although the detection result of this algorithm needs to be further improved under occlude type, our algorithm has the best overall performance. The above data proves the excellent detection performance and anti-interference ability of this algorithm for glass reflection images.
During the detection of distorted images caused by glass refraction, the object is detected, but deviations occur when predicting the position. For the above situation, the results of glass refraction interference image detection are compared and analyzed. In Figure 9, the refractive interference is classified into six interference cases: partially distorted, facial distortion, regular distortion, irregular distortion, color filter, and occlude. The results with the men are shown in Figure 9(b), the facial information of the object person is distorted by glass refraction. The Deep Face, PEN-3D, and LMZMPM detected the facial information uniformly under various conditions and do not consider the distortion effect caused by the glass on the object information. The results with the women are shown in Figure 9(d), the object’s face is stretched laterally due to the effect of glass refraction. The Deep Face and the TBE-CNN detected the image information accurately, but there is a large error in the position prediction. The performance of the proposed algorithm is relatively stable under various interference conditions, and the error in position prediction is small, as shown in Table 2.
The detection results of glass refractive deformation images are shown in Table 2. We observe that the proposed algorithm simultaneously achieves the highest accuracy of 93.17% and IoU of 0.91 under the partially distorted condition. When dealing with facial distortion, our algorithm still achieves the highest detection rate, and the IoU performance is comparable to that of PEN-3D. Under color filter type and occlude type, LMZMPM performs best in accuracy rate, while the boundary boxes predicted by our method are more consistent with the ground truth, which is 0.83% and 0.21% higher than 92.62% and 92.83% of this paper’s algorithm, respectively, but the IoU of this paper’s algorithm are both the highest 0.82 and 0.80. In addition, our method has the best overall performance under regular distortion type and irregular distortion type. The proposed algorithm shows strong robustness and practicability when dealing with the five distortion types mentioned above.
In practical application scenarios, in addition to the interference of specular refraction and specular reflection, the mirror surface is likely to have stains or objects that introduce a lot of noise. In this experiment, we consider four cases: object occlusion, water droplets, stains, and glass blurring. The experimental results are shown in Figure 10. The results with the women shown in Figure 10(a), the bounding boxes predicted by the Deep Face and the VGG Face show large differences. From the data in Table 3, it can be obtained that the TBE-CNN labeled the most accurate object location when the object was obscured, with the maximum IoU of 0.90, but its detection accuracy was only 90.71%. And in this case, the detection accuracy of the algorithm in this paper is 94.31%, which is 3.6% higher than the detection accuracy of the TBE-CNN. When the glass is stained, the highest detection accuracy is 91.83% for the LMZMPM, which is 0.12% higher than our algorithm, but our algorithm got the highest IoU for 0.90. About the average, the algorithm in this paper has the highest detection accuracy of 92.68% and the highest IoU of 0.90.
When detecting indoor objects through windows, in many cases the objects are at a distance, so it is more appropriate to simulate the objects at a long distance. In Figure 11, object detection is performed for ambient interference, object occlusion, and side-face conditions, while each type of image is subdivided into a dark environment, bright environment, and multi-person conditions. The specific detection accuracies of each algorithm are shown in Table 4.

The DA-GAN has the highest IoU of 0.91 in the case of environmental interference, but its detection accuracy is only 85.49%. Under the same condition, the highest detection accuracy of this paper’s algorithm is 93.42%, and the IoU is 0.91, which is 0.01 lower than the IoU of the DA-GAN. In the case of object occlusion, the algorithm in this paper has the highest detection accuracy and IoU of 91.15% and 0.91, respectively. The highest detection accuracy of the LMZMPM is 94.77% in the detection of long-distance side face images, and the detection accuracy of the algorithm in this paper is 94.49%, which is 0.28% lower than that of the LMZMPM. But the IoU is 0.01 higher than the LMZMPM. Among the average data of the three cases, the detection accuracy and IoU of the algorithm in this paper are the highest at 93.02% and 0.91, respectively, which are 0.84% and 0.01 higher than the LMZMPM, and 8.15% and 0.03 higher than the DA-GAN.
The parameters and detection results of different algorithms are shown in Table 5. The average detection accuracy and IoU of the algorithm in this paper are the highest 92.57% and 0.88, respectively, the number of parameters of the algorithm is 5.13 M, and the FLOPs are 1.44 G. The number of parameters of PEN-3D is the smallest in these algorithms, which is 0.67 M smaller than ours, but its detection accuracy and IoU are only 84.06% and 0.81, and the FLOPs are 2.55 G, which is much lower than ours. The average detection accuracy and IoU of LMZMPM, which performs better for the data under a few interference conditions, are 91.16% and 0.85, respectively, which are slightly lower than the corresponding data of the algorithm in this paper. However, its number of parameters is 59.48 M, which is much larger than the number of parameters of the algorithm in this paper. The above data analysis shows that the algorithm in this paper has a smaller number of parameters and performs better when detecting through windows. Comparing Ours and Ours-1 reveals that reconstruction of distorted images can significantly improve the performance of the algorithm.
5. Conclusion
The simulation is validated by the glass reflection, glass refraction, and glass surface contaminant interference in the GRI dataset, and the detection under different types of light intensity, glass color, object occlusion, and image blurring are also considered. The data comparison analysis was performed by multiple simulations, and the average detection accuracy of the algorithm in this paper under different conditions was 92.57% with an IoU of 0.88, which was 1.41% and 0.03 higher than the LMZMPM and 4.57%, and 0.03 higher than the DA-GAN. The above data show that the algorithm has excellent detection ability and accurate position labeling ability when facing windowed images. However, there is a need to continue to improve the object detection of occluded images, so that the algorithm can show excellent detection and localization ability when facing different types of images.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This paper was supported by National Key Research and Development Project (2019YFB1311002).