Abstract

In the multiview stereo (MVS) vision, it is difficult to estimate accurate depth in the textureless and occluded regions. To solve this problem, several MVS investigations employ the matching cost volume (MCV) approach to refine the cost in the textureless and occluded regions. Usually, the matching costs in the large textureless image regions are not reliable. In addition, if an occluded region is also textureless, the matching cost contains a significant error. The goal of the proposed MVS method is to reconstruct accurate depth maps in both large-textureless and occluded-textureless regions by iteratively updating the erroneous disparity. The erroneous disparity in the textureless region is updated by the 3D disparity plane of the region in the inverse-depth space. Then, the surface consensus is computed and used to run the two processes, the surface consensus refinement and the matching cost update. By the iterative update of the 3D inverse depth plane, surface consensus, and matching cost, the performance of the depth reconstruction in the large-textureless and occluded-textureless regions is greatly improved. The performance of the proposed method is analyzed using the Middlebury multiview stereo dataset. The depth reconstruction performance is also compared with several stereo vision methods.

1. Introduction

In many multiview stereo (MVS) schemes, the matching cost volume (MCV) is commonly employed to generate and find the optimal disparity (inverse depth) space image. The semiglobal matching (SGM) method [1], which is one of the famous stereo matching methods, employs the dynamic programming algorithm for the cost aggregation of the matching cost volume. The Soft3D method [2] employs the plane sweep stereo scheme [3] to generate multiview MCV and iteratively update the cost volumes using the visibility of a pixel from every view. Recently, several deep neural networks address the multiview stereo vision problem by employing the cost volume scheme using the 3D convolution layers. For example, GCNet [4], DPSNet [5], and PVSNet [6] generate MCVs between the reference and matching views and apply 3D convolutions to refine the cost volumes for accurate disparity image acquisition in either a feature-learning or end-to-end network.

The MCV is utilized mainly in two ways. The first is depth continuity preservation, and the second is global cost optimization. In SGM, a dynamic programming algorithm is employed for depth continuity preservation and multidirectional cost summation for global cost optimization. The dynamic programming algorithm in SGM uses the disparity of the neighbor pixel in the smoothness-preserving path of the cost aggregation. And the 8 or 16 cost aggregation volumes are integrated for the final cost volume generation. Finally, the optimal depth is determined by the Winner-Take-All (WTA) manner.

In MVS, a common technique of MCV generation is using the plane sweep stereo (PSS) method. Given a reference view, an MCV is generated by applying the matching criteria between every pixel in the reference view and the other pixels in each matching view using the predefined inverse depth (disparity) planes. Each reference view generates MCVs with respect to the other matching views and integrates all MCVs to the final MCV for disparity determination in the WTA manner. The MCV integration uses the measurement of surface consensus to increase the integration weight to a voxel that has high surface consensus. In contrast, the weight of a voxel with low consensus is decreased. In the deep neural network approaches, a similar kind of surface consensus is automatically learned as the weight of the convolution filters.

In Soft3D, multiview MCVs are integrated with the visibility volumes for depth continuity preserving. And the iterative cost update scheme is employed for global cost optimization. Each voxel in the visibility volume represents the visible probability in the reference view with respect to a matching view. Thus, integration of the matching cost between the reference and the matching views with the visibility volume preserves the continuity of the visible surfaces.

As described above paragraphs, obtaining accurate MCV is an important task for accurate disparity map generation and 3D reconstruction. However, in the real environment, many inherent situations cause uncertainty in cost computation. One of the ambiguous situations is cost computation in the textureless regions, due to textureless object, light reflection, color confusion, or object occlusion. Because this ambiguity in MCV is caused by the inherent stereo matching problem, there is no way to remove this problem in the initial cost computation. Instead, to solve this problem, it is necessary to refine or correct the erroneous cost to obtain the optimal matching costs.

In this paper, we propose an MVS vision method that can correct the erroneous matching cost volumes due to wide-textureless and occluded-textureless regions. The proposed idea is based on the update of the precomputed PSS matching costs in an iterative cost optimization process. In one of our previous researches, we have introduced an MVS method called Enhanced Soft 3D Reconstruction (EnSoft3D) [7]. The EnSoft3D method estimates the object surface consensus, and it is used to update the matching cost in an iterative optimization process. After several iterations, the accuracy of matching costs is improved; then, the refined optimal disparity maps can be obtained. The performance of EnSoft3D is determined by the accuracy of the surface consensus. However, if the textureless region in an image is wide or occluded, the surface consensus can be computed inaccurately.

In a real situation, the matching cost of some parts of wide-textureless regions can be refined by a larger number of iterations. However, it has a risk of converging to an inaccurate surface due to local minima. To obtain accurate object surface consensus and disparity maps, the proposed method focuses on refining the surface consensus by updating the matching cost of both wide-textureless and occluded-textureless regions. The contribution of the proposed MVS method is as follows: (1)Iterative inverse-depth plane update in wide-textureless or occluded-textureless region: the surface consensus in the texture region or small textureless region can be sufficiently refined by the conventional matching cost update process. However, the surface consensus in the wide-textureless or occluded-textureless region cannot be updated due to cost values in a large image area is incorrect. Thus, we add another inverse-depth plane update process in the EnSoft3D scheme to correct erroneous costs in the textureless regions(2)Iterative surface consensus refinement: the surface consensus in the textureless area is refined by the estimated 3D plane. Refined surface consensus is used for more accurate generation of disparity maps. Using the refined disparity map again, the 3D inverse-depth plane is estimated more closely to the real object surface. This iterative update process improves the overall performance of matching cost computation and the disparity map generation

Figure 1 shows a comparison of the proposed method with well-known cost volume refinement methods, SGM. PSS, Soft3D, and EnSoft3D. The test image is from the Middlebury 2006 dataset [8], and it contains large textureless areas. In the textureless areas, the proposed method can reconstruct very accurate disparity compared with the other methods. In addition, the proposed method can reconstruct the depth discontinuity and occlusion areas very accurately.

Figure 2 shows the flow diagram of the proposed 3D reconstruction method. The proposed method is implemented and performed based on the EnSoft3D scheme. It means that the accuracy of both surface consensus and matching cost, especially in the textureless region, is iteratively refined. The image segments in the textureless regions are used for 3D disparity plane fitting. The plane information is used to refine the surface consensus and matching cost. After several iterations of the proposed cost refinement process, an accurate disparity of both wide-textureless and occluded-textureless regions is obtained, as shown in Figure 2.

In the rest of the paper, more details of the proposed method are described as follows: Section 3 explains the matching cost update process. Section 4 introduces the proposed method for refining surface consensus. The experimental results are shown in Section 5.

A representative disparity estimation method in the stereo matching is the PatchMatch- (PM-) based method [9]. The PM-based method estimates the disparity using three processes: random matching, propagation, and random search. The random matching process estimates the initial disparity by randomly detecting the corresponding pixels between the reference and neighbor images. And the initial disparity is propagated to the neighbor pixels, iteratively. The estimated disparity is refined using a random search process. The PM-based method has an advantage in that the computation cost is lower than the cost volume method. This is because the disparity is estimated in the image space based on random computation. Therefore, the PM method is often used for estimating high-resolution disparity map. However, like other matching cost-based methods, it is also difficult to estimate an accurate disparity in textureless and occluded regions. To solve this problem, most algorithms perform 2D filtering as the final step of disparity map estimation.

For 3D reconstruction from the disparity image, the object boundary is important information. Therefore, an edge-preserving filter is commonly used such as bilateral filter [10], guided filter [11], or weighted median filter [12]. However, if the window size becomes large, the accuracy of object shape reconstruction is reduced. Additionally, if there is a significant disparity error within the filtering range, it is possible that the error can propagate to neighbor pixels.

The cost volume method usually refines the cost to improve the quality of the disparity image, instead of refining the disparity image directly. If the matching cost is refined accurately, then, the correct disparity is determined by the WTA manner. The matching cost consists of a 3D volume; thus, it requires more computation time. Nevertheless, it has the advantage of being able to refine object shape in detail. Hosni et al. [13] propose a stereo matching algorithm by refining the cost volume using a guided filter. It can obtain an edge-preserving disparity map; however, the refinement performance is limited in the textureless region. Liu et al. [14] have proposed a matching cost refinement method using an adaptive guided filter for refining the disparity in the textureless region. However, if the textureless region is large, the refinement performance is decreased. In addition, the foreground fatting problem remains in the occluded and textureless region.

The Soft3D method refines the disparity in the occluded regions using view visibility. In this method, the accuracy of view synthesis is decided by depth accuracy, especially in the occluded region. The occluded region usually appears in the object boundary; thus, the pixel color is distinguishable. However, if the depth information of this region is incorrect, incorrect color can be synthesized. This is the reason why the Soft3D algorithm focuses on improving the disparity of occluded regions.

The MCV is also commonly employed in learning-based stereo matching methods. GCNet is the first end-to-end network that follows the standard two-view stereo matching scheme. From the rectified stereo image input, features are computed by using multiple 2D convolution layers, like other typical feature extraction networks. And two stereo feature maps are concatenated in each different disparity plane. Thus, the final MCV consists of a 4D volume (), which needs large memory space in GPU. Then, the 4D cost volume is regularized by 3D convolution layers. After several 3D convolutions, each voxel in the regularized MCV has a probability (matching pixel similarity) value so that the final disparity can be decided by the WTA manner (soft argmax).

The baseline of the GCNet is extended to other learning-based multiview stereo matching methods. Among them, DPSNet and PVSNet are end-to-end multiview stereo matching networks based on PSS. Each reference view computes 4D MCVs with respect to the other matching views like GCNet. And each two-view MCV is integrated and regularized. Like the other hand-craft MCV approaches, the matching costs are optimized in the network. In particular, PVSNet integrates the two-view MCV using the visibility information of each view, similar to Soft3D. As a result, PVSNet produces high-quality depth information.

For accurate 3D reconstruction and view synthesis, our previous method, EnSoft3D, updates the matching cost while maintaining the performance of occluded region refinement. However, as explained in Section 1, a simple increase in the number of iterations is not an effective solution to refine the textureless area. In this paper, the proposed method can refine the matching cost of wide-textureless and occluded-textureless regions without a large number of iterations.

3. Cost Volume Refinement

The cost volume refinement process consists of two main parts: surface consensus refinement and PSS matching cost update. The surface consensus of a reference view is represented as a 3D volumetric space (inverse-depth planes) and computed using the disparity maps of all the other views. Then, the PSS matching cost of the reference view is updated using the surface consensus. The refinement process runs iteratively, and the final disparity map is acquired by the refined cost volume of the reference view.

The cost volume refinement is applied to all views simultaneously and as shown in Figure 2 so that the updated matching cost in each view can improve the accuracy of the surface consensus of all views. More details of the cost volume refinement are shown in the following subsections.

3.1. Stereo Matching Cost Volume

The matching cost volume (MCV) is a 3D space that consists of voxel elements; each voxel represents the cost of matching between a pixel in the reference image and another pixel in the matching image. Therefore, the volume has three indices, and for image pixel location and for disparity range. The cost of matching is the result of matching criteria such as Normalized Cross-Correlation (NCC) [15], Sum of Squared Difference (SSD), Sum of Absolute Difference (SAD) [16], Census [17], and Mutual Information [18]. In two-view stereo matching, a single MCV can contain all matching costs between the two images. In multiview stereo matching, a reference view usually has multiple MCVs with respect to the other matching views. Therefore, in N-view stereo matching, for example, it needs initial cost volumes.

Figure 3 shows an example of a matching cost volume between a reference view and the -th neighbor image in the multiview stereo. The volume consists of voxels which represents the matching cost between pixel in the reference image and pixel in the neighbor view. All voxels along the disparity axis at pixel contain the matching cost of the pixel with the disparity index ; therefore, the optimal disparity is usually decided with the index with the smallest cost.

3.2. Surface Consensus

As shown in Figure 2, the first step of the proposed method is generating the PSS cost volumes from multiview stereo images. The initial PSS matching costs are later updated iteratively based on the surface consensus. The backbone of the proposed method is employed from a 3D reconstruction pipeline called Soft3D.

The surface consensus volume consists of probability values to define the likelihood of the object surface on each pixel ray in the discrete disparity space. In Soft3D, it is used for computing the view visibility. The consensus volume of a reference view is computed by disparity images between the reference and the neighbor views. The computation process is as follows. In each reference view, the vote volumes are generated using Equations (1) and (2); Vote Value () and Vote Confidence (). These volumes consist of binary values, which are hard surface and hard visible information in the disparity space.

is the disparity map of view. Then, the consensus volume of the reference view is computed using the vote volumes of all views. where is the voxel index which is projected from to view using the camera parameters. The consensus volume is filtered using a guided filter to minimize computation. Using the filtered consensus, view visibility is computed as in

The surface consensus is represented by a 3D volumetric space. If a voxel on the ray of a pixel is an object surface point, the consensus value is close to 1. Therefore, the surface consensus of each pixel can be determined using the function on the pixel ray. However, the consensus volume is represented in the discrete disparity space, not a continuous space. To update the matching cost and refine the disparity map more precisely, the quadratic interpolation () is performed to compute the surface consensus in subpixel resolution as in

In an ideal case, each image pixel must have only one surface consensus on the pixel ray. However, for some pixels in a foreground object region, the consensus can be very high in both foreground and background surfaces, due to erroneous neighbor views. This case causes a problem in disparity reconstruction. If a background surface is selected as a foreground surface due to wrong consensus values, the disparity value on the foreground surface is updated with the wrong background disparity. To prevent this situation, the surface consensus should be determined in the visible area. We multiply a binary flag of view visibility () to the consensus to reject the background consensus as shown in Equation (6).

3.3. Update of PSS Matching Cost

The disparity of each pixel is determined as the lowest matching cost in the pixel ray. However, if a pixel is included in the textureless or occluded region, the matching costs in the pixel ray can have wrong values and this causes inaccurate disparity update. To solve this problem, the proposed method updates the precomputed PSS matching cost using the surface consensus. To minimize the matching cost in the corresponding surface, an inverse Gaussian kernel based on the surface consensus is computed. The equation of cost update is as follows: where is the PSS matching cost volume between reference view and neighbor view . And determines the range of neighbor costs to update simultaneously within the disparity range.

Typically, the matching costs in the textureless region are incorrectly computed because the pixel color is too uniform to distinguish each other. To correctly update the textureless region, color variance is used to compute a weight value () as shown in equations below.

Here, is the kernel for computing color variance, is the number of pixels within the kernel, and is a normalization parameter. To prevent overupdate, the maximum value of update is limited by . In Equation (12), the weight is finally computed. If is less than a threshold , the constant weight of 0.02 is applied.

Figure 4 shows an example of before and after update of the matching cost. In the beginning of the iteration, the disparity value at the lowest matching cost is inaccurate. Nevertheless, surface consensus is refined by the color consistency with neighbor views. After several iterations using the consensus, the disparity value with the lowest matching cost is correctly refined. Thus, the final disparity can be determined by the WTA manner.

4. Surface Consensus Refinement in the Textureless Region

As explained in Section 1, updating the matching costs in the wide-textureless or occluded-textureless regions is subject to converging to local minima because the surface consensus of all of the views may have wrong values. The main reason for this problem is the surface consensus in the wide-textureless or occluded-textureless region. Most pixels in the textureless regions have inaccurate surface consensus due to disparity noises, and this causes inaccuracy in view visibility computation. It means that if the accuracy of the consensus is improved, the visibility can be refined also. Therefore, we try to improve the accuracy of surface consensus.

In real images, wide-textureless regions mostly consist of planar objects, such as floor or wall. Therefore, we propose to refine the surface consensus simultaneously by refining the object planes in the wide-textureless and occluded-textureless regions. When the matching cost is updated by the refined surface consensus, the correct disparity can be computed in the textureless region. Using the refined disparity again, we can estimate the object plane more accurately. Multiple iterations of the above processes can reduce the error of view visibility and reconstruct very accurate depth maps.

The proposed method performs independently in each view except for the consensus refinement. Therefore, we describe the proposed method based on the reference view.

4.1. Textureless Region Detection

As explained in Section 3, the surface consensus in the texture regions is sufficiently accurate. However, in the textureless region, the consensus values are highly erroneous which results in inaccurate depth reconstruction in both texture and textureless regions. Even the erroneous values appear in the textureless region, the errors can be propagated to the texture region after several iterations of the conventional depth reconstruction pipeline. Therefore, it is necessary to distinguish texture and textureless areas in the color images to accurately update the consensus in both areas.

Textureless regions are detected by using a simple graph-based segmentation [19] and color variance. An important parameter for determining the textureless region is the minimum segment size. If the minimum segment size is small, a wide-textureless area is divided into several small areas which have different inverse-depth plane orientations. In this case, the accuracy of the inverse-depth plane fitting is decreased because a small number of pixels are used for plane fitting. In our experiments, an optimal segment size is set to , which yields reasonable performance in various datasets.

The color variance of pixels in the wide-textureless area is close to 0 because these pixels have very similar colors. Therefore, the wide-textureless segment is decided by an average variance. In Section 3.2, the normalized color variance of all pixels is computed using Equations (9) and (10). Using these equations, the average variance () of the segment is computed as follows:

In the equation, represent the -th segment and is the number of pixels in . As shown in Equation (14), the wide-textureless segment is decided when is lower than a threshold value . The threshold value depends on the input images; however, reasonable performance is obtained when =0.2.

Figure 5 shows an example of wide-textureless detection. After all textureless segments are determined, an approximate inverse-depth plane is estimated in each detected segment region. Using the inverse-depth plane, the consensus in the textureless region is recalculated and refined. Meanwhile, the black-colored regions in Figure 5 (normalized variance map) represent textured regions, and the original surface consensus described in Section 3.1 is used for the update process.

4.2. 3D Plane in the Matching Cost Volume

In Euclidean space, a 3D plane is represented by a linear equation with independent variables along the three orthogonal vectors. If a 3D point exists on a 3D plane, the plane is represented as where is the normal vector of the plane and is the projection distance from the coordinate origin to the plane. Similarly, a 3D plane in the disparity space can be represented by replacing with the disparity and replacing the metric space with the pixel space. Thus, the 3D plane equation is represented in another form as follows:

Figure 6 shows an example of a disparity plane , that is the 3D plane of a textureless segment in Figure 5. The 3D equation corresponding to the segment can be fitted with the valid disparity values in the area.

We use the RANSAC to fit the 3D plane in each segmentation area. Once the plane in the inverse-depth space is defined, the disparity of any pixel can be computed as where

Using Equation (17), an approximate inverse-depth plane of a textureless region can be estimated.

The 3D inverse-depth plane of each textureless region is estimated using the disparity of the region. Usually, in this textureless region, the disparity of the boundary pixels is computed more accurately than the center area pixels because the boundary region pixels have more color variance with neighboring textured pixels in the background. The disparity errors in the center regions can be considered speckle noise. To remove this error in the plane fitting, the speckle filter is applied to the disparity image.

4.3. Surface Consensus Refinement

The surface consensus is refined by fusing all disparity images of multiple views. If the disparity values of an object surface point estimated from the multiple views are consistent, then the consensus is high. However, in the textureless region, the disparity values from the multiple views could be inconsistent, which results in inaccurate and low surface consensus.

The main idea of the proposed method is to iteratively update the matching cost by improving the accuracy of the surface consensus in wide-textureless and occluded-textureless regions. To associate the matching cost update scheme of the EnSoft3D method, we add another iterative update process to refine the 3D inverse-depth plane in the textureless area as in Section 4.2.

The proposed surface consensus refinement process in the textureless region is summarized as follows. First, the 3D plane in a textureless region is estimated by using the disparity values in the region. Second, using the plane model, disparity in the textureless region is interpolated. Then, the interpolated inverse-depth plane is used to refine the surface consensus , at the same time with the textured regions. Third, the surface consensus recomputes the soft visibility , and this visibility updates the PSS matching cost again. Fourth, using the updated PSS cost, the initial disparity image is updated and then the accuracy of the inverse-depth plane in the textureless region becomes higher than the previous step. By iteratively running the overall update steps, the inverse-depth plane can be updated very accurately and this results in more accurate depth reconstruction.

Figure 7 shows an example of the refined surface consensus and disparity map of each iteration step with and without inverse-depth plane update. The higher surface consensus is expressed as the brighter value in the surface consensus map. At the initial step, the surface consensus in the textureless region is spread to all depth range due to incorrect disparity. Some surface consensus in textureless regions can be refined by the original EnSoft3D method. However, the accuracy of the surface consensus is low and a lot of iterations are required. This inaccurate consensus can affect the depth reconstruction not only in the textureless region but also in the texture region due to the propagation of the surface consensus in PSS cost aggregation. On contrary, our method computes more accurate surface consensus with only a few iterations, and it enhances the performance of view visibility computation and matching cost update. As shown in the bottom image in Figure 7, the surface consensus in the occluded-textureless regions is also refined sufficiently with only a few iterations.

An example of the updated matching cost by refined inverse-depth plane and surface consensus is shown in Figure 8. The wide-textureless region has a low color variance; thus, the matching cost is intensively updated by in Equation (8). After the iterative update processes of the proposed system, accurate disparity maps can be obtained in both wide-textureless and occluded-textureless regions from the matching cost volume.

5. Experiments and Evaluations

The proposed multiview stereo vision method reconstructs very accurate disparity maps even there are wide-textureless or occluded-textureless regions in the images. The proposed method has two advantages compared with other MVS methods. First, the depth of each image pixel can be accurately estimated by iteratively updating the PSS cost using many overlapping neighbor views. Second, the accuracy of surface consensus can be computed more accurately. The surface consensus is a kind of probability value, which expresses the disparity consistency of multiview stereo images. For this reason, if the input views have larger overlap areas, then the consensus probability of the scene can be computed more accurately.

In order to show the best performance of the proposed method, we use the narrow-baseline and multiview stereo images from the Middlebury 2006 dataset, which is famous for stereo vision evaluation. Each test data consists of seven color images and transformation parameters between views. Among the multiview dataset, we use some parts of the dataset, which contains wide-textureless and occluded-textureless regions in the images. The proposed method provides the best performance in the center view of the input views. Because the consensus of both left and right occlusion can be computed by the equal number of neighbor views with respect to the center. However, the benchmark provides the ground truth at view-1 or view-5, and most comparison algorithms use view-1 for their evaluation. Therefore, in our experiments, we also evaluate our results using view-1.

The proposed algorithm is implemented using C++, and the computing environment was intel i9-10900K CPU with 64GB RAM.

5.1. Comparison with Cost Volume-Based Methods

The first evaluation is the performance comparison of each process of the proposed method. The representative process of the proposed method is as follows: initial disparity map estimation using PSS, occlusion-aware-disparity estimation, matching cost update, and surface consensus refinement. All processes are performed in the cost volume space, which consists of the inverse-depth planes. As the same with the PSS algorithms, the number of inverse depth planes increases the disparity accuracy. However, more processing time is required. After many real experiments, we decide to use 80 inverse-depth planes because it yields satisfactory performance in both accuracy and processing time. Another important parameter is the number of iterations in the cost volume refinement process. As the same with the number of planes, more iterations yield more accurate disparity maps. However, this also causes a longer processing time.

Figure 9 shows an example of disparity map generation and processing time according to the number of iterations. The red pixels in the images are bad pixels when the error threshold is one pixel. The processing time increases proportionally with the number of iterations. However, the refinement efficiency is saturated after about 10 iterations. Considering the processing time and disparity accuracy, we use 10 iterations in our cost update and disparity refinement.

Figure 10 shows a comparison of disparity map generation with a couple of methods that employ the cost volume-based scheme, PSS, Soft3D, and EnSoft3D. The blue and red boxes represent common examples of the region of both occluded and textureless objects. In the region of only occluded or narrow-textureless objects, inaccurate disparity computed with only the PSS algorithm is sufficiently refined using view visibility and matching cost update in Soft3D and EnSoft3D, respectively. However, there is a limitation to refine the disparity in either wide-textureless or occluded-textureless regions. This kind of textureless region is very difficult to refine using the previous methods, because there are serious errors in the initial matching cost from the PSS algorithm, as shown in the bottom image of Figure 10. Contrarily, the proposed algorithm can reconstruct disparity maps very accurately even in the wide-textureless or occluded-textureless region.

The ratio of bad pixels in Figure 10 (Lampshade1) is shown in Table 1. It represents the ratio of the bad pixels when the error threshold is one pixel. The error measurement with the Nonocc mask is done only in the surface area, and All mask is done in both surface and occlusion area. When the only occlusion-aware-disparity estimation is performed using Soft3D, the ratio of the bad pixels is over 10% in both Nonocc and All masks. On the other hand, the proposed method can estimate more accurate disparity (Nonocc: 1.33%, All: 1.91%) than PSS (Nonocc: 10.5%, All: 10.6%) and EnSoft3D (Nonocc: 1.95%, All: 2.75%). Because there is a larger improvement when the measurement is done with the All mask (0.84%) than the Nonocc mask (0.62%), we find that both wide-textureless and occluded-textureless regions can be improved simultaneously with the proposed method.

5.2. Evaluation with General Stereo Matching Methods

In this section, more stereo matching algorithms are compared with the result of the proposed method. We use the Middlebury 2006 dataset which consists of both wide-textureless and occluded-textureless regions (Lampshade1, Lampshade2, Midd1, Midd2, Monopoly, and Plastic). In the previous algorithms, a typical way of refining the matching cost is using a 2D filter to reduce disparity errors in the textureless regions. Therefore, the results from two cost filtering algorithms (Hosni et al. and Liu et al.) are compared as shown in Figure 11 and Table 2. Compared to Hosni and Soft3D algorithms, Liu and EnSoft3D algorithms show better performance in the textureless regions. However, some parts of wide-textureless region are still inaccurate as shown in Figure 11. In addition, the disparity along the object boundary becomes blurred due to the disparity errors in the occluded-textureless regions.

In our results, the disparity of wide-textureless regions is accurately refined by the iterative update of disparity planes in the textureless regions. And the matching cost in the occluded-textureless region is also refined by the reduced error in the wide-textureless region. The quantitative error analysis is shown in Table 2. In order to evaluate the refinement performance of both wide-textureless and occluded-textureless regions, the error computation uses All mask with one-pixel threshold. The table shows that the performance of the proposed method is the best in all test images. Especially in comparison with EnSoft3D, the bad pixel ratio is greatly reduced in the Midd1, Midd2, and Monopoly images. This means that the proposed method improves the performance of the cost volume refinement in both wide-textureless and occluded-textureless regions.

Various PatchMatch- (PM-) based algorithms are also compared. In [20], Li et al. evaluate their method using various PM methods; PatchMatch Belief Propagation (PMBP) [21], Speed up PMBP (SPM-BP) [22], Graph Cut-based continuous stereo matching using Locally Shared Labels (GCLSL) [23], and PatchMatch-based Superpixel Cut (PMSC) [24]. We cite Figure [13] and Table 6 in [20] for comparison with the proposed method.

Figure 12 and Table 3 show the results of the comparison. The evaluation environment is as follows: quarter-resolution and Nonocc mask, and the error threshold is one pixel. Because the Nonocc mask is used, some parts of the occluded-textureless region are not included in the bad pixel evaluation. Nevertheless, the proposed method shows the best performance on average.

6. Conclusions

In this paper, we introduced a novel multiview stereo matching method for estimating accurate disparity in wide-textureless and occluded-textureless regions. In the cost volume-based stereo matching, refining the incorrect disparity either in wide-textureless or occluded-textureless region is difficult with only using matching cost computation.

To solve this problem, we propose a multiview stereo matching scheme. A new cost update process is introduced by using the inverse depth plane in the textureless region. We detect the textureless regions using a segmentation method. And the inverse depth plane of the textureless segment is estimated. The surface consensus in the textureless regions is fitted by the inverse depth plane. After a few iterations of the proposed cost update process, the matching cost error in the textureless region is minimized. This cost refinement yields very accurate surface consensus and depth map reconstruction.

The performance of the proposed method is evaluated by using the Middlebury 2006 dataset. Compared to various stereo matching methods, the proposed method reconstructs a high-quality disparity map in both wide-textureless and occluded-textureless regions. A limitation of the proposed method is that a lot of memory is required depending on the number of input images and the image resolution. In addition, the computation cost is also required proportionally.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work has been supported partly by the National Research Foundation of Korea (NRF) (No. 2021R1A2C2009722) grant funded by the Korea Government (MSIT), partly by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2021R1A6A1A03043144), and partly by BK21 FOUR project (4199990113966).