Abstract

As virtual reality technology advances, 3D environment design and modeling have garnered increasing attention. Applications in networked virtual environments span urban planning, industrial design, and manufacturing, among other fields. However, existing 3D modeling methods exhibit high reconstruction error precision, limiting their practicality in many domains, particularly environmental design. To enhance 3D reconstruction accuracy, this study proposes a digital image processing technology that combines binocular camera calibration, stereo correction, and a convolutional neural network (CNN) algorithm for optimization and improvement. By employing the refined stereo-matching algorithm, a 3D reconstruction model was developed to augment 3D environment design and reconstruction accuracy while optimizing the 3D reconstruction effect. An experiment using the ShapeNet dataset demonstrated that the evaluation indices—Chamfer distance (CD), Earth mover’s distance (EMD), and intersection over union—of the model constructed in this study outperformed those of alternative methods. After incorporating the CNN module in the ablation experiment, CD and EMD increased by an average of 0.1 and 0.06, respectively. This validates that the proposed CNN module effectively enhances point cloud reconstruction accuracy. Upon adding the CNN module, the CD index and EMD index in the dataset increased by an average of 0.34 and 0.54, respectively. These results indicate that the proposed CNN module exhibits strong predictive capabilities for point cloud coordinates. Furthermore, the model demonstrates good generalization performance.

1. Introduction

With the development of Internet technology, image processing technology has become an important means of information technology. People can easily use image processing technology to obtain information, so as to construct different technical models. The improvement of the image processing effect by computer is an important part of information realization. With the increasing demand for information technology in the whole society, image engineering is playing a more and more important role in contemporary science and technology.

With the development of virtual reality technology, 3D environment design and modeling technology have been paid more and more attention. It has been applied in virtual network environments, urban planning, industrial design, manufacturing, and other fields [1]. However, the existing 3D modeling methods have large error accuracy defects. In many fields, especially in environmental design, the practicability is limited to some extent [2]. Moreover, the 3D modeling of environmental design requires a high degree of 3D reduction. This is because the restoration accuracy of reconstruction methods based on a single perspective is limited [3]. With the progress of technology, the 3D modeling method based on double-view multidimensional data has gradually become the mainstream [4]. Under the multidirectional 3D modeling framework, the environment modeling method based on texture mapping can achieve 3D restoration to a certain extent [5]. In order to further improve the modeling accuracy, 3D reconstruction methods based on learn-perception classes have been widely studied.

There are many methods and theories for image-based 3D reconstruction. Among them, structure from motion recovery (SfM) is one of the most widely used classical methods [6]. SfM calculates that the feature points successfully matched between images have 3D information and can be restored to 3D coordinates to form 3D point clouds. However, the feature point information contained in the image is relatively small [7]. Therefore, the point cloud model calculated by SfM is sparse, and the accuracy of the reconstructed model is low. The multiview stereo (MVS) [8] method can calculate the dense 3D point cloud of the scene from multiple view images of the object. Patch-based MVS [9] takes sparse point cloud reconstructed by SfM as input information. Then, using the image surface neighborhood information iteration, a point cloud expansion strategy is used for point cloud expansion and filtering. Finally, dense point clouds are reconstructed by this method. Wang et al. [10] took the sparse reconstruction model and camera attitude obtained by SfM as input. This method uses depth map fusion to recover dense point clouds. The MVS method based on learning is shown in literature [11]. The depth map fusion method used in literatures [12, 13] is also effective in restoring high-precision dense point clouds in the scene. Literature [14] proposed a 3D model reconstruction method based on point cloud, which achieved better reconstruction accuracy by defining loss functions such as chamfering distance and spatial distance. Literature [15] classifies internal points and external points based on fusion features and proposes a point cloud sampling optimization strategy. The scheme allows for a more detailed reconstruction of the point cloud. In order to effectively restore the occlusion area of the single view of the object, literature [16] combines the 3D encoder–decoder structure with the generative antagonism network. The detailed dimensional structure of the object is reconstructed from a single view, and good experimental results are obtained on the synthesized dataset.

In order to improve the accuracy of 3D object reconstruction with a single view, a fusion of digital image processing technology and convolutional neural network (CNN) algorithm is proposed to optimize and improve CNN. Through the improved stereo-matching algorithm, the 3D reconstruction model was constructed to improve the 3D environment design and reconstruction accuracy and optimize the 3D reconstruction effect. Experiments on the dataset of ShapeNet [17] show that the evaluation indexes of Chamfer distance (CD), Earth mover’s distance (EMD), and intersection over union (IoU) in the model experiments constructed in this paper are superior to other traditional methods. The ablation experiment also verifies that the CNN module proposed in this paper can effectively improve the reconstruction accuracy of point clouds, has a good prediction of point cloud coordinates, and the generalization performance of the model presented in this paper is also good.

2. State of the Art

2.1. Structure and Principle of CNNs

CNN is a representative algorithm in deep learning [18]. The algorithm is a deep feed-forward neural network with local connection and weight sharing. CNN continuously extracts features through multiple convolution kernels to realize image classification and natural language processing. CNN consists of an input layer, convolution layer, pooling layer, flattening layer, forgetting layer, and fully connected (FC) layer. Its structure is shown in Figure 1.

The convolution layer mainly realizes feature extraction of data. The convolution kernel in the convolutional layer slides on the input data one by one and carries out the dot product operation with the data at each position, and the output is the feature graph. The convolution operation can be expressed as shown in Formula (1):

In the above formula, g represents weight and h represents bias.

The pooling layer replaces the network output in the region by using the region’s overall characteristics. This can achieve the purpose of reducing network parameters and reducing the amount of calculation, so as to avoid the overfitting problem.

The flattening layer is the realization of 2D data 1D.

The forgetting layer is to temporarily hide some weight values by setting parameters to alleviate the occurrence of overfitting. This can achieve the regularization effect to a certain extent.

The FC layer completes the classification task. Output the data, get the classification result, and use the Sigmoid function to output the classification probability value. The function formula is shown in Formula (2):

In Formula (2), s represents the output of the upper layer of the model.

2.2. Digital Image Processing Technology

Digital image processing technology is widely used for the practical needs of environment design. Among them, stereo imaging technology is developing rapidly. This paper studies the principle of 3D environment design based on stereo imaging technology. Digital image processing technology can effectively model 3D scenes and improve the authenticity of environmental design. See Figure 2 for the specific method principle.

The 3D coordinates of scenes in different coordinate systems can be extracted by triangular projection. On this basis, this paper uses the stereo projection-matching algorithm to coordinate the pixel points of the 3D scene. Considering the 3D reconstruction modeling using 2D images will have stereo distortion. Based on traditional 3D modeling, this paper can make stereo compensation for the extracted image depth information and finally realize the reconstruction of a highly restored 3D scene. The schematic diagram of the nonparallel bidirectional stereoscopic imaging 3D modeling method is shown in Figure 3.

In Figure 3, U is projected sterically in two coordinate systems, O1 and O2. Its projection points in the projection plane are, respectively, U1 and U2. The observed coordinates of U1 and U2 in the coordinate system with the origin of O1 and O2 are and , respectively. Let In represent the true coordinates of U. Use I1 and Ir to represent the coordinates of U1 and U2, respectively, in the observed coordinate system, then the corresponding relationship can be obtained, as shown in Formula (3):where z1, zr, n1, and nr are the parameters of the stereoscopic projection transformation between the two observed coordinate systems and the real 3D coordinate systems. Transform Formula (3), as shown in Formula (4).where K and T are stereoscopic projection transformation parameter matrices. It is defined as shown in Formula (5).

The stereo projection transformation parameters are different at different points. 3D matching is a nonlinear optimization to determine the optimal stereo projection transformation parameter matrix.

3. Methodology

The algorithm in this paper is a combination of digital image processing technology and an improved CNN algorithm. The stereo-matching algorithm model can be constructed by this algorithm.

In the experiment part, the reconstruction accuracy of the data is measured, and the 3D reconstruction effect of different models is analyzed. Through the analysis and verification of the model, the model with higher reconstruction accuracy and better 3D reconstruction effect is selected. Through this model, the precision of 3D reconstruction can be improved, so as to achieve the purpose of optimizing 3D environment design.

3.1. Stereo-Matching Algorithm Based on Improved CNN

Deformable CNN is a deep learning model for image processing that adaptively adjusts the shape of the convolution kernel to better capture nonlinear features in images. By introducing deformable convolution, the algorithm is able to more accurately capture the subtle differences in the surface of an object in a stereoscopic image, which improves the accuracy and detail representation of point cloud reconstruction.

The stereo-matching algorithm based on deformable convolution is composed of feature extraction, matching cost space, cost postprocessing, parallax/residual regression, and parallax optimization modules. The design structure of the stereo-matching algorithm is shown in Figure 4.

The feature extraction module is an encoder–decoder that introduces a 2D deformable convolution hourglass in the encoding stage. The matching cost space is constructed by the associated operation of DispNetC to form the 3D cost space. In the cost postprocessing module, the 3D deformable convolution of the residual structure is used to regularize the matching cost space. The parallax regression module adopts the softargmin method proposed by GC-Net. Its expression is shown in Formula (6):where represents the predicted parallax value. d represents the parallax value of the candidate. Dmax represents the maximum candidate parallax. σ indicates the softmax function. cd indicates the matched generation value.

The parallax optimization module is a spatial propagation network [19]. The network can extract the similarity matrix of the image and optimize the predicted parallax value.

The algorithm is divided into three stages to get a parallax map with different precision.

In the first stage, the feature extraction module extracted feature map F1 with a resolution of 1/16. Therefore, the candidate parallax value ranges from 0 to 1/16 Dmax. After parallax regression and optimization, it is necessary to obtain the parallax map of the first stage by up-sampling operation and multiplying by 16 times.

In the second stage, the range of candidate residual d is set to −2–2. According to the parallax map from Stage 1, the new feature map is warped on the right feature map F2 at 1/8 resolution. Then, the matching cost space is formed with the left feature map. The residuals of regression are added to the parallax map of Stage 1. Then, the parallax map is optimized to get the parallax map of the second stage.

The third stage is the same as the second stage.

3.2. Deformable Convolution

An ordinary convolution consists of two steps. The process is shown below:(1)A regular grid R is used for sampling on input feature graph i.(2)The sampling value is multiplied by the weight m and summed. For example, R = {(−1, 0),…,(0, 1), (1, 1)} represents a 3 × 3 grid with expansion rate of 1. For each position u0 on the output feature graph y, the expression is shown in Formula (7):where ut represents every position belonging to R. In the deformable convolution, R has an offset . Transform Formula (7) into Formula (8):

Now, the sampling is at the regular and offset position. Because is a decimal, Formula (8) needs to be implemented by linear interpolation. Its expression is shown in Formula (9):

In the above formula, u represents any position. In Formula (8), . v represents each integer position in the feature graph i. has two dimensions and can be divided into two 1D cores. Its expression is shown in Formula (10).where .

Figure 5 shows a 2D deformable convolution with a convolution kernel size of 3 × 3. The offset value is obtained by adding a layer of convolution to the same feature graph. The size and expansion rate of the convolution kernel are similar to the current deformable convolution kernel. 2N is the number of channels in the convolution, corresponding to N 2D offsets. 3D deformable convolution is a generalization of 2D deformable convolution. The principle is the same as in two dimensions, but one dimension is added to the dimension of the convolution.

3.3. Space Propagation

The spatial propagation network structure is shown in Figure 6, a parallax map used to optimize regression. It mainly consisted of a differentiable linear propagation module and a deep CNN model that learned the similarity matrix. Linear propagation of spatial propagation network is to scan the matrix row by row or column by row in four fixed directions. The four fixed directions are left to right, right to left, top to bottom, and bottom to top. The following is mainly introduced from left to right direction, and other directions are the same principle.

First, assume two 2D images, I and B, both of size t × t, where I is the image before spatial propagation. B is the image after space propagation. and are their respective tth columns. They are both t × 1 in size. Linear propagation is performed from left to right in two adjacent columns using the t × t linear transformation matrix Mn. Its expression is shown in Formula (11):where M denotes the t × t identity matrix. The initial condition is . is the diagonal matrix. The xth entry is the sum of row x in Mn. Its expression is shown in Formula (12):

Therefore, the matrix is updated recursively by column. For each column, bn is the preceding column bn−1 multiplied by the matrix Mn and combined with xn, which is linear.

When the recursion is complete, the matrix expression of Formula (11) is shown in Formula (13):where G represents a triangular transformation matrix under T × T (T = t2). . The dimension is T × 1. The parameter is , and the size is t × t, λn = XDn.

The deep CNN module is mainly used to output the similarity matrix A, and then linear propagation is carried out to obtain Hq. The algorithm mainly uses deep CNN and linear propagation modules to learn H from the left image to guide the optimization of the regression parallax map.

3.4. Loss Function

In order to predict the position of a point cloud, EMD, CD, symmetric loss, and an equidistant prior loss are used as loss functions for model training. The specific definition is as follows:(1)EMD

EMD is defined as the minimum sum of the distances between elements u in the set and all elements in the set San. Its expression is shown in Formula (14).where S1 stands for reconstructed point cloud, and San stands for ground truth (GT) true point cloud. σ is the bijective relation.(2)CD

The CD is used to measure the distance between two sets of point clouds. Formally defined as Formula (15):

The first term represents the sum of the minimum distances from any point in S1 to San, and the second term represents the sum of the minimum distances from any point in San to S1.(3)Equidistant prior loss

Let S1 be the reconstructed point cloud and s be any point in S1. is the xth adjacent point to s. After Gaussian filtering, the position of s changes accordingly. Take x coordinate as an example, as shown in Formulae (16) and (17).

Equidistant prior losses are defined as shown in Formula (18):where S1 is the initial point cloud, and is the point cloud after Gaussian filtering. The introduction of equidistant prior loss function can make adjacent points close to each other.(4)Symmetric loss

In order to maintain the symmetry of the point cloud model in the deformation process, the symmetric loss function of the point cloud is introduced, and the expression is shown in Formula (19).

In the above formula, M (S1) is the specular reflection transformation.

4. Result Analysis and Discussion

4.1. Experimental Setup

In all experiments, the model inputs are RGB color images, and the output is a 3D point cloud with 2,048 vertices. Meanwhile, in order to train the graph-convolutional network end-to-end, the Ad-am optimizer is used in the experiments, and the learning rate is initialized to 5 × 10−5. The number of iterations of the model is 50 epochs, and the batch size is 32. All the experiments are implemented on NVIDIA GeForce GTX1080Ti GPUs using the open-source machine learning framework Pytorch.4.2 Experimental data and evaluation criteria

In order to evaluate the reconstruction performance of the proposed algorithm, ShapeNet synthetic dataset, ModelNet, and dataset and Pix3D [20, 21] real scene dataset were used for experiments. ShapeNet has a total of 51,300 3D models in 13 model categories. The ModelNet dataset contains about 17,210 3D models in about 50 different categories. The partially occluded or truncated data is excluded, and the training set and test set are randomly divided according to the ratio of 4 : 1. The same Pix3D dataset is used to do the preprocessing, with the background of the mask information to remove useless background and moved to the center of the object, will eventually image zooming or cut to 224 by 224 as the input image. In this paper, IoU, CD, and EMD were used as indicators to measure experimental results. IoU represents the intersection ratio between the 3D voxel shape of the network reconstruction and the shape of the true solid element. Here, the same voxel generation method as literature [14] is adopted. CD and EMD represent the difference between two point clouds. Here, the GT point cloud is sampled to generate a point cloud model with a number of vertices of 2,048, and the reconstructed point cloud is compared with the reconstructed point cloud in this paper.

4.2. Experimental Data and Evaluation Criteria

Verify the robustness of the loss function design strategy proposed in this paper, as shown in Figure 7. Figure 7(a) shows the comparison of the effects of loss function on different training sets. By comparison, it can be seen that on the three different training sets, the loss function of the training set generally keeps a downward trend during the training. The loss function of the training set decreases rapidly in the first 25 times of the epoch and tends to be stable after the 40th time. It can be seen that the method in this paper has high robustness. Further, Figure 7(b) shows the convergence of the loss function in the point cloud deformation process of the CNN. It can be seen from Figure 7(b) that the CNN has a good convergence result in the deformation stage, indicating that the model has a good 3D reconstruction effect.

4.3. Quantitative Comparison of Experimental Results

In order to quantitatively analyze the differences between the proposed method and other methods, Tables 1 and 2 show the comparison of reconstruction accuracy in the ShapeNet dataset and ModelNet dataset. The evaluation index was scaled 100 times and compared with the methods of literatures [14, 22, 23]. In terms of CD evaluation indexes, the method in this paper achieves higher reconstruction accuracy in 13 categories, such as airplanes. Similarly, in terms of EMD evaluation indexes, the method in this paper is superior to other methods in all categories. The average reconstruction accuracy of CD and EMD is higher than that of other methods.

Further, we compared the differences between the proposed method and literatures [22, 23] in different categories of IoU. As can be seen from Table 3, the IoU of this paper’s method is higher in eight categories, such as airplane and literature [22], and is higher in sofa and speaker.

Literature [23] achieved the best performance in the car and phone categories under 5-view reconstruction. Overall, on the ShapeNet dataset, the average IoU of the proposed method is improved by 9.16% over the literature [23] in five views and 7.63% over the literature [22]. On ModelNet dataset, the average IoU of the proposed method is improved by 11.11% over literature [23] at five views and 9.22% over literature [22].

4.4. Comparison of Ablation Data

(1)CNN module ablation experiment comparison

In this paper, the CNN module is used to adjust the 3D reconstructed point cloud model of the stereo-matching algorithm. In order to verify the effectiveness of this method, the CNN module is replaced by a common FC layer, and the model is trained and tested. CD and EMD are used to measure the quality of the generated point cloud, and the test results are shown in Table 4.

As can be seen from Table 4, after the CNN module is added, CD and EMD have a certain improvement in most datasets. CD and EMD schemes only showed slight declines in some datasets. CD increases by 0.1 on average, and EMD increases by 0.07 on average. For the CD indicator, the chair dataset was increased by 0.34. For EMD indicators, the monitor dataset is increased by 0.44. It can be seen that the introduction of the CNN module can effectively improve the accuracy of point cloud reconstruction.

The performance of the stereo-matching algorithm is verified by experiments. Evaluation indicators were trained and tested on bench, monitor, and phone datasets. As shown in Table 5, after the CNN module is added, the evaluation indexes of different datasets are improved. CD index increased by 0.36 on average, and the EMD index increased by 0.53 on average. It is proved that the CNN module has a good prediction for point cloud coordinates.

(2)Loss function ablation experiment comparison

In order to verify the effectiveness of the loss function adopted in this paper, different combinations of loss functions are selected, and the model is retrained. Based on bench, rifle, and vessel datasets, the test results are shown in Table 6. It can be seen from Table 5 that after all loss functions are adopted, CD performs better than the other two strategies and is effective for different datasets, improving the generalization performance of the model.

4.5. Comparison of 3D Modeling

In order to test the effectiveness of the algorithm, the reduction degree of this paper and different algorithm models is compared, which is shown in Figure 8. In this paper, the lotus flower is chosen as the experiment in the reconstruction of the natural environment. The algorithm in this paper, literatures [24, 25] are used to reconstruct the 3D model of the same lotus flower in the collected sample data. The model effect after reconstruction is shown in Figure 8(b).

According to Figure 8(c), by comparing the image models reconstructed by the three algorithms, we can see that the model reconstructed by the proposed algorithm is clearer. The distortion degree of both rod diameter part and petal part is small. After texture mapping, the image restoration degree is higher, and the feature point recognition is more accurate.

In order to verify the distortion degree of reconstructed images, PSNR values of red dog images were compared by the above three methods. The comparison results are shown in Figure 9. The image with a higher PSNR value has a lower distortion degree, which proves that the image restoration quality is higher.

5. Conclusion

In this study, we combine binocular camera calibration and stereo correction of digital image processing technology with a CNN to optimize and improve the 3D reconstruction method, constructing a 3D reconstruction model using a stereo-matching algorithm. In the experimental portion, we measure the reconstruction accuracy of the data and analyze the 3D reconstruction effects of different models. Experiments demonstrate that the proposed method achieves higher reconstruction accuracy in 13 categories, such as airplanes. Regarding EMD evaluation indices, the proposed method outperforms other methods in all categories. In terms of average reconstruction accuracy, the proposed algorithm yields better CD and EMD results compared to other methods. The proposed algorithm also demonstrates good performance in terms of average IoU. After incorporating the CNN module in the ablation experiment, CD and EMD increased by an average of 0.1 and 0.06, respectively. This validates that the proposed CNN module effectively enhances point cloud reconstruction accuracy. Upon adding the CNN module, the CD index and EMD index in the dataset increased by an average of 0.34 and 0.54, respectively, indicating that the proposed CNN module has strong predictive capabilities for point cloud coordinates. Furthermore, the model demonstrates good generalization performance.

Despite the significant 3D reconstruction accuracy improvement achieved by the proposed method, however, there are some limitations of the method and areas that need to be further explored. For example, (1) the CNN may be sensitive to input variations such as lighting conditions, object orientation, and occlusion. There is a need to further investigate the robustness of the method to these variables. Techniques to improve the robustness of the method to noise, uncertainty, and occlusion will be further explored in the future to enhance its performance in real-world scenarios. (2) The paper provides an overview of stereo-matching algorithms based on deformable CNNs, but the complexity and computational cost of the algorithms are not discussed in detail. It is necessary to elaborate on the practical feasibility of the method in real-time or resource-limited situations. Carry out case studies in specific application scenarios. In the future, some real-world scenarios, such as industrial automation, robot navigation, urban planning, and industrial design, will be selected for practical applications, and the performance of the algorithm will be tested in these scenarios.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.