Abstract

This paper designs a three-dimensional display system for intangible cultural heritage based on generative adversarial networks. The system function is realized through four modules: input module, data processing module, 3D model generation module, and model output module. Two 3D model reconstruction methods are used to realize the transformation from 2D images to 3D models. In the low-resolution Nuo surface 3D construction, multiresidual dense blocks are introduced and applied to the deep image super-resolution network. The experimental comparison results show that the quadratic optimization multifusion 3D construction model proposed in this paper can achieve considerable improvement and can improve the reconstruction accuracy by about 6.3%. In the high-resolution 3D construction of the Nuo surface, a generative adversarial network model is used to improve the generator, discriminator, and loss function of the original SRGAN model. Experimental results show that this method can generate super-resolution images with more realistic and natural depth maps. In addition, when it is used for high-resolution 3D Nuo surface sculpting, it can also generate 3D voxel Nuo surfaces with more details.

1. Introduction

At present, the development status of intangible cultural heritage projects and the protection of intangible cultural heritage need to be improved. The modern production and way of life have undergone great changes, and traditional handicrafts and artworks have been gradually replaced by industrial products, resulting in certain difficulties in the status quo of intangible cultural heritage projects and the life status of intangible cultural heritage inheritors. The biggest problem facing intangible cultural heritage is the lack of intangible cultural heritage. On the one hand, the intangible cultural heritage is complicated and difficult to learn; on the other hand, some of the intangible cultural heritage is mostly oral, and there is no specific learning carrier. To solve the problem of the decline of national culture caused by the reduction in intangible cultural heritage, it is urgent to use modern three-dimensional construction technology to reproduce the intangible cultural heritage.

The traditional 3D construction technology uses 3D reconstruction software to construct the target 3D, and the construction result has high accuracy, but the use of the software requires training of professionals and detailed data of the modeled objects. In addition, the multi-view geometric method can also be used for 3D construction [1, 2], using multiple views to complete the fusion of different information of the target, so as to complete the 3D construction of the target. However, the above methods are difficult to complete the three-dimensional reconstruction and display of intangible cultural heritage. Therefore, how to complete the 3D construction of a single RGB image quickly and with high quality is particularly important. With the deepening of deep learning research, deep learning models based on CAD databases have begun to be used for 3D reconstruction of a single image. In 2015, Wu [3] started to propose the 3D ShapeNets model, which utilizes deep convolutional belief networks to learn the joint distribution of all 3D voxels in a data-driven manner. Their proposed model learns the distribution of complex 3D shapes across different object classes from raw CAD data, performing joint object recognition and shape reconstruction from 2.5D depth maps. Choy [4] proposed a new structure—3D recurrent reconstruction neural network in 2016, which avoids the prior matching problem existing in traditional methods, which is derived from a large number of A mapping of learning objects to their 3D shapes in synthetic data sets, using one or more images of real objects from any viewing angle as input and 3D voxels as output. In addition, taking advantage of the powerful ability of generative adversarial network in image generation [58], more and more scholars have begun to study the use of GAN for 3D model reconstruction. Generative adversarial networks proposed earlier use a game-like approach to generate images from random noise and refine them to approximate real images. Inspired by this, Wu et al. [9] extended the 2D generative adversarial network to 3D generative adversarial network in 2016. Gadelha et al. [10] proposed projective generative adversarial networks, which estimated the 3D shape and viewpoint, respectively, through the generator network to generate a binary projection image for the projection module and then used the discriminator network to judge true and false. Riegler et al. [11] proposed a learning-based deep fusion method, designed a 3D convolution model for octree network fusion, and used one or more depth maps as input to estimate the 3D space division, ranging from coarse to fine. The reconstruction resolution is increased to 128 × 128 × 128 in three steps. The GAN-based idea uses the discriminator to judge the reconstructed high-resolution image and the ground truth image, which makes the generated high-resolution image closer to the real image as a whole, has more details, and is more in line with human visual perception.

Based on the 3D reconstruction technology of intangible cultural heritage based on generative adversarial network, this paper takes Guangxi Nuo surface as the research object and introduces in detail how to design the 3D display system of intangible cultural heritage. By establishing an intangible cultural heritage data set, a multifeature fusion method is used for low-resolution voxel 3D modeling, and a high-resolution 3D model of intangible cultural heritage is established based on a generative adversarial network.

2. Relevant Knowledge

2.1. Convolutional Neural Networks

A simple convolutional neural network is mainly composed of three types of network layers: convolutional layer with nonlinear activation function, pooling layer, and fully connected layer, and the basic structure is shown in Figure 1, where the small white box represents the size of the convolution kernel. Among them, the size of the convolution kernel can be any value smaller than the size of the input image, and the larger the convolution kernel, the larger the receptive field. Therefore, the more picture information is seen, the better the features are obtained. However, using an excessively large convolution kernel will lead to a dramatic increase in model computation, which is not conducive to increasing the depth of the neural network. In Figure 1, the input layer takes the pixels of the image as the feature node as input. During the forward propagation process, the convolution kernel slides along the width and height of the image and calculates the convolution kernel and the dot multiplication between inputs. Finally, the response of the input to the convolution kernel at each spatial position is obtained, which is called a feature map. To make the neural network have nonlinear fitting ability, a nonlinear activation function is usually added to the convolutional layer. Since the increase in the feature map after the convolution operation will lead to too many parameters of the neural network, the pooling layer is often used to reduce the spatial dimension of the feature map. Finally, outputting the results through a fully connected layer can accomplish tasks similar to handwritten digit recognition.

2.1.1. Convolutional Layer

In different image processing tasks, the input image can be subjected to convolution operation to extract different features for processing. Different features can be extracted using different convolution kernels, such as edges and contours. It is well known that the depth of the neural network can improve the performance of the model more than the width. In the convolutional layer, due to the local connection and weight sharing of the convolutional layer, the parameters that the neural network needs to learn are greatly reduced, which is also conducive to designing a larger neural network to handle more complex tasks. In a convolutional neural network, for an input image X, its convolution is defined as follows:

Among them, is the convolution kernel, and are both two-dimensional matrices, represents the two-dimensional matrix coordinates, and represents the weight at the convolution kernel . Figure 2 is a schematic diagram of a simple convolution. In Figure 2, the convolution kernel calculates the value of 9 pixels on the input image at a time, and then, the sliding step size moves one unit to the right or down, respectively. Finally, the convolution kernel traverses the entire input image to obtain a 2 × 2 output feature map.

2.1.2. Pooling Layer

The introduction of the pooling layer imitates the human visual system to reduce the dimension and abstract the input object, and the use of the pooling layer can retain the main features of the image and reduce the parameters of the model, which can also reduce the degree of overfitting of the model to a certain extent and improve the generalization ability of the model. In addition, the pooling layer can make the model pay more attention to a certain feature present in the image rather than the location of the feature.

The max-pooling layer is one of the most commonly used nonlinear pooling functions. This pooling operation divides the input image pixels into multiple sub-pixel regions and then takes the maximum pixel value of each sub-pixel region as the output. Suppose the size of the convolution kernel is , the stride is , and the input is an matrix, and then, the maximum pooling layer is calculated as follows:

The pooling operation is shown in Figure 3. In Figure 3, max pooling is computed using a filter of size 2 × 2 with a stride of 2. Comparing the calculation results with the input image, it can be seen that the data are reduced by 75% after the max-pooling layer.

2.1.3. Fully Connected Layer

Each layer of the fully connected layer is composed of many neurons, and each neuron is connected with each neuron in the previous layer to integrate the previously extracted features. The essence of a fully connected layer is to transform one feature space into another feature space. To improve the performance of the neural network, the fully connected layer uses a nonlinear activation function. Due to its own characteristics, the use of fully connected layers can lead to a dramatic increase in the parameters of the model. Therefore, the general convolutional neural network model uses one or more fully connected layers after the convolutional layer and the pooling layer.

2.1.4. Activation Function

In neural networks, activation functions usually refer to functions that can achieve nonlinear mapping. The main function of the activation function is to add nonlinear factors, so that the neural network can approximate any nonlinear function and solve the defect of insufficient expression ability of the linear model. Several commonly used nonlinear activation functions are as follows: sigmoid function, tanh function, ReLU function and its variants, and Swish function.

(1) Sigmoid Function. The sigmoid function is a common activation function, and its mathematical expression is as follows:

The sigmoid function can make the output smooth and continuously limited in the range of 0∼1, which is close to linear in the region of the input near 0 and nonlinear in the region far from 0. The result tends to be 1 for large positive numbers and 0 for large negative numbers. It is precisely because of the existence of its saturation region that the gradient of the input at both ends of the function is almost 0, causing the gradient to disappear. In addition, because the output value of the sigmoid function is not centered at 0, the model convergence is prone to oscillation during back propagation.

(2) ReLU Function.

The ReLU function is equivalent to the output below the threshold is 0, and the output above the threshold is linearly invariant. Experimental results show that the ReLU function converges 6 times faster than the tanh function [12]. However, ReLU may update the weights to positions that are never activated again due to excessive gradients during training. Also, setting the learning rate too high may cause most of the neurons in the network to not be activated. To solve the possible vanishing gradient of the model due to the negative value problem in ReLU, a series of variants are proposed, such as leaky ReLU [13]. When , the input is multiplied by 0.01 as the output, and the mathematical formula is as follows:

2.2. Generative Adversarial Networks

Generative adversarial networks (GANs) were first proposed by Goodfellow et al. in 2014 [14]. The generative adversarial network is different from the traditional convolutional neural network. It uses two neural networks to play a game to train the neural network. GAN consists of a generator and a discriminator. The basic model is shown in Figure 4. First, the generator network generates the corresponding target data from the input data. Then, the discriminant network judges whether the data from the generator network are real or fake after learning the knowledge of the real data. To generate more realistic target data to deceive the discriminator, the generation network needs to continuously optimize its generation ability. The generation network and the discriminant network optimize each other in the process of continuous confrontation and finally make the whole network reach the Nash equilibrium. As the discriminator network approaches the optimal solution, the generator network also approaches the minimum.

Currently, GANs are commonly used in image generation tasks and 3D object reconstruction tasks. The standard GAN objective function can be described as follows:

2.3. 3D Construction Based on Voxel Representation

At present, in the three-dimensional construction model of the image, the encoding-decoding architecture is generally used. The encoding stage in this architecture can use different encoders, and similarly, the decoding part in the decoding stage can also use different neural network-based decoders. In the decoder stage, the voxel-based 3D decoder decodes the feature vector to generate a A 3D shape represented by a voxel. Because of the regular 3D decoder, which needs to convolve each voxel position in the 3D volume, the network computation time and memory need to grow cubically as the resolution at which the 3D volume is generated increases, but the method is robust to input, enabling voxel-based 3D construction methods to reconstruct 3D shapes of arbitrary topologies. Deep learning is a type of representation learning, that is, generating useable representations from data. In the encoding-decoding architecture, the encoder can extract the main low-dimensional representation through learning, and the decoder can generate the required high-dimensional data through the low-dimensional representation. For the 3D construction model represented by voxels, it can be divided into a direct representation decoding model, an intermediate representation decoding model, and other decoding models. In this paper, the decoding model is directly represented, and the specific process is shown in Figure 5.

The network consists of a 2D encoder and a 3D decoder. In general, encoders use 2D convolutional neural networks, while decoders use 3D deconvolutional networks. In the encoding stage, the 2D encoder encodes the input 2D image into a low-dimensional latent vector for feature compression. This practice of compressing the input image from a high-dimensional space to a low-dimensional space is mainly to preserve the main features of the input image to a greater extent, which makes the model parameters significantly reduced to fit in the video memory. In the decoding stage, the decoder decodes the latent vectors to generate 3D shapes. Therefore, in short, the 3D construction process of this architecture is to design a 2D encoder to extract image features and then use a conventional voxel decoder to output a 3D shape.

3. System Design

To realize the 3D display system of intangible cultural heritage based on generative adversarial network, this section develops according to the development steps of software engineering standard. Firstly, a detailed demand analysis is carried out for the 3D display system, and then, according to the system design principle, the functional structure of the 3D display system is designed and analyzed, and finally, each functional module is designed in detail.

3.1. System Design Requirements

This paper aimed to complete the three-dimensional display of intangible cultural heritage and contribute to the inheritance and protection of intangible culture. In particular, it has the following significance.(1)Digital Protection of Intangible Cultural Heritage. In this study, the artistic features of intangible cultural heritage are preserved in the form of three-dimensional digitalization, and the digital protection of intangible cultural heritage is realized.(2)Cultural Heritage Display Application Design Strategy. There are many theoretical supports and research methods for the current research on client display applications, but these methods are usually generally applicable models. This study narrows the scope of the research and attempts to summarize the design strategies for cultural heritage display applications, for the purpose of this type of application. The design provides reasonable and scientific design ideas.(3)Case Design of Three-Dimensional Display of Intangible Cultural Heritage. For the purpose of education, the sample case of intangible cultural heritage display application is realized, aiming to achieve a better user experience, users are enabled to have a systematic and comprehensive understanding of intangible cultural heritage, and users are allowed to have a stronger understanding of traditional cultural interest, to achieve better protection of traditional culture and improve the effect of cultural transmission.

3.2. System Design Principles

When designing the three-dimensional display system for intangible cultural heritage, we follow the following principles in system development.

The first is the abstract principle. Abstraction refers to the simplification of complex phenomena, which must be simplified to the extent that it is convenient for people to analyze and understand.

The second is the encapsulation principle. The encapsulation principle is a rule that developers must use when developing the system structure. Each individual program component needs to be encapsulated into a single and independent module, and the module's internal processing logic details are exposed as little as possible when defining a module. It is necessary to require that each module can be independently developed and tested, and the final complete program is assembled from a series of sub-module programs. The principle of encapsulation can greatly improve the modifiability, testability, and portability of the system.

The third principle is the principle of independence between modules. The independence between modules means that any relatively independent subsystem program is completed by an independent module, and the connection between this module and other modules must be achieved very simply. Two standards are commonly used in the industry to measure the independence between modules: cohesion and coupling. Cohesion is a measure of how closely each element is related to each other within each module. The coupling measures the degree of independence between each module, which mainly depends on the interface information type and interface complexity provided by each module and the way to call the module.

3.3. System Structure Design

The key to realizing the dance generation application based on generative adversarial network lies in the patch-based pixel generation in the generative network and the multiscale image discrimination in the discriminant network and the final loss function design. The overall structure of this application is shown in Figure 6.

3.3.1. Interface Layer

The interface layer is mainly used for the interaction between the user and the system, and all functions need to be as simple as possible. This application interface mainly includes various controls and view modules, such as select video button, display video view, generate dance button, and text prompt view. The application interface will directly display the original dance video and the generated dance video to the user, compare the synthesis effect, and generate corresponding indicator descriptions. Simple and elegant interface design will bring users a good user experience.

3.3.2. Logic Layer

The logic layer is often the core of the entire application system, including the realization of core functions and the connection between the upper and lower layers. The main functions include reading local images or mining data from Web pages. Collecting data are an essential function for generating models; in addition, the system needs to call other open-source projects to make RGB image sets and depth image sets; more importantly, inputting RGB images and depth images into the trained generative model to obtain the target 3D model is the core of the system application; finally, the system displays the generated 3D model to the user.

3.3.3. Data Layer

The data layer is mainly used to store and read data, including image storage, algorithm storage, and model storage.

3.4. System Function Design

This section mainly focuses on the module design of this application, including the overall module design of this application and the detailed design of each module. The functional structure diagram of the system is shown in Figure 7.

This application is mainly divided into four modules according to the system implementation framework, mainly including input module, data processing module, 3D model generation module, and model output module. This section focuses on the data processing module and the 3D model generation module.

3.4.1. Image Set Production Module

In the task of 3D reconstruction of images, one of the keys for deep learning to exert its powerful learning ability is to establish relevant data sets for training. The establishment of the data set is the basis for the 3D construction of the entire model. The data set in this paper mainly refers to the MVD method [15] in the production process. At present, there is no public 3D Nuo mask model of the Maonan people at home and abroad. The only way to solve the data set problem is to use a 3D scanner to collect scans in the field. A total of 36 Maonan Nuo masks were collected by visiting the Huanjiang Maonan Autonomous County Museum in Hechi, and a 3D voxel library and a 2D database were created by scanning.

3.4.2. Low-Resolution Reconstruction Module

This module adopts a multifeature fusion method, first generates a 3D model as the initial reconstruction result, and then optimizes the 3D encoding-decoding network to generate a low-resolution Nuo surface voxel 3D model.(1)3D model construction(2)Loss functionTo train the model to generate better 3D shapes, a suitable loss function is required. The loss function of the model is defined as the loss between the reconstructed 3D object and the real 3D object. The loss function can be defined as follows:(3)Training methodThis paper uses a 128 × 128 RGB image as input, and the final output resolution of the network is a voxelized 3D shape of 64 × 64 × 64. This article implements a NVIDIA GTX 1080 with a Dell desktop as the hardware platform and 8G of GPU memory. In addition, the software platform installs TensorFlow on Ubuntu 16.04 and utilizes GPU acceleration to train the network. To fit into the GPU memory size, the training batch size of the single-shot network is 32. The training process uses the Adam optimizer with β1 = 0.5 and β2 = 0.9 in the optimization parameters. Meanwhile, the initial learning rate is set to 1 × 10–4, and the network is trained for 300 epochs. Then, training is fine-tuned for another 50 epochs. For the secondary optimization multi-fusion reconstruction model, the pretrained model first generates a 3D volumetric surface with a resolution of 32 × 32 × 32, and then, the network model parameters are fixed and sent to the 3D encoding-decoding network for secondary optimization. A 3D Nuo surface with a resolution of 64 × 64 × 64 is allowed. Other training parameters remain the same as other network training parameters.

3.4.3. High-Resolution Reconstruction

Module. This module uses a generative adversarial network to obtain a high-resolution Nuo surface voxel 3D model. The image super-resolution method based on the generative adversarial network can reconstruct a more natural super-resolution image, which is mainly due to the two networks adopting the idea of confrontation, that is, using the two opposing sides to play a game against each other. First, the low-resolution images are reconstructed into super-resolution images through a generator network. Then, the discriminator compares the input super-resolution image with true and false sexual judgments. Subsequently, the discriminated results are returned to the optimized generator network and discriminator network. Repeatedly, the final generator network and discriminator network complete self-optimization in mutual confrontation. Early SRGANs were implemented on the basis of standard generative adversarial networks (SGANs). The discriminator loss and generator loss of this network are as follows:

The binary cross-entropy loss function adopted by the SGAN model has the property of unstable training. How to balance the training of the generator and the discriminator is a problem. To this end, some research works try to improve SGAN [16] with other loss functions, for example, relative GAN loss function. Inspired by these works, we also try to improve its network structure and loss function based on SRGAN. Next, we introduce an improved generative adversarial network image super-resolution model.(1)Improved Generative Adversarial Network Super-Resolution Model. The improved generator network is shown in Figure 8. First, we replace the FEB in the feature extraction module (FEM) from the residual dense block of the SRGAN model to a multiresidual dense block. Through the improved multiresidual dense block, more features are reused in the image feature extraction stage and a deeper network is designed. After the features of the image are extracted, two sets of convolution and image sub-pixel convolution operations are performed to generate a fourfold super-resolution image. The multiresidual dense block with batch normalization layer is used.The improved discriminator network is shown in Figure 9. Recent experiments [17] show that strided convolutional layers in the discriminator network lead to reduced resolution of the generated feature maps. Ultimately, this causes the super-resolution images generated by the model to lose details. Therefore, we adopt the same improvement and set the stride size to 1 for all strided convolutions. To make the model fit in memory for training, we change the number of output feature maps for each layer to the amount shown in Figure 9. The other structures of the discriminator network remain unchanged from those in SRGAN.(2)Loss functionThe loss function of the generator consists of three parts: content loss, perceptual loss LVGG, and relative average adversarial loss LRa_G. The loss function can be expressed as follows:

Among them, α, β, and γ represent the coefficients used to balance the overall loss, which are taken as 1, 10–3, and 2 × 10–6, respectively, in the experiment. LVGG calculates the loss in the feature space of super-resolution images and high-resolution images. To ensure that the reconstructed image is structurally similar to the real image, we use pixel-based content loss. The loss computes the 1-norm distance between the generated super-resolution image G(ILR) and the true high-resolution image IHR. This loss function can be expressed as follows:

In the SGAN training process, when training the generator, the real data samples are not used in the loss function, which means that the optimization of the entire generator is entirely guided by the discriminator. In the relative GAN, the loss function of the generator uses real data samples as a reference. Furthermore, in SGAN, the model is forced to recognize real samples as real and fake samples as fake. In the relative GAN, the real and fake samples are mixed and the discriminator is used to judge the real and the fake. This method is more stable for GAN training to a certain extent. In addition, it also has a faster advantage in training speed. The generator loss function and discriminator loss function relative to GAN are expressed as follows:

4. System Implementation and Testing

4.1. System Development Environment

The training and generation of this application are run on the server, and the runtime development environment configuration is shown in Table 1. For the core algorithm, a large amount of video memory is required. This time, 8 Tesla P100 GPUs are used for training, and the training time is 524 hours. Since the system model needs to be trained for a long time, the displayed samples are all models trained in advance.

4.2. System Application Test
4.2.1. Low-Resolution Nuo Surface Voxel 3D Construction Based on Multifeature Fusion

The training details between the traditional 3D reconstruction model, the multi-fusion 3D reconstruction model, and the quadratic optimization multi-fusion 3D reconstruction model are set according to Section 4. By comparing the parameter sizes of different models, we can know the degree of GPU memory required by different models, which can determine whether the model can meet the hardware conditions to a certain extent. In addition, the training time of different 3D reconstruction models is also an important indicator to measure the performance of a model. In Table 2, the single iteration time between different models and the parameter sizes of the models is listed. As can be seen from Table 2, the number of model parameters between the improved model and the traditional 3D reconstruction model is not much different, and the parameters of the multi-fusion feature fusion model are 0.1 M more. However, the multi-fusion 3D reconstruction model has greatly improved the single iteration time, and the speed is 1 time faster than that of the traditional 3D reconstruction model. The iteration time and model parameters of the secondary optimized multi-fusion 3D reconstruction model include the iteration time and model parameters of the pretrained multi-fusion 3D reconstruction model. The second optimization of the multi-fusion 3D reconstruction model in a single network training is not much different from other model parameters, but it is more time-consuming than other models in terms of iteration time. In the experiment, the traditional 3D reconstruction model takes 10 hours, while the improved multi-fusion 3D reconstruction model takes only a few hours. Using 3D deconvolution to generate a higher-resolution 3D object from a lower-resolution 3D object requires more computation time. However, rearranging multiple low-resolution 3D objects to form a higher-resolution 3D object will save computational overhead.

The rendered images in the test set are directly used as the input of the low-resolution Nuo surface 3D construction model, and a 3D Nuo surface with a resolution of 64 × 64 × 64 is output. The 3D reconstruction results of different models are shown in Figure 10. Figure 10(b) is the low-resolution 3D shape generated by the traditional 3D reconstruction model, Figure 10(c) is the low-resolution 3D shape generated by the multifeature fusion 3D reconstruction model, and Figure 10(d)is the low-resolution 3D shape generated by the quadratic optimization of the multifeature fusion 3D reconstruction model. As can be seen from Figure 10, the overall reconstruction results on the rendered image in Figure 10(b) compared with Figure 10(c) and Figure 10(d) are not much different. However, due to the low reconstruction resolution, these reconstruction results cannot show the required details, and only rough contours can be reconstructed. Comparing the reconstruction results of different models with the real 3D structure shown in Figure 10(d), there is not much difference in appearance. However, in terms of details, the improved models proposed in this paper can achieve better visual effects than the traditional 3D construction models. The experimental results show that the proposed 3D construction model can preliminarily complete the generation of corresponding 3D shapes from a single Nuo surface image.

Using IoU as an evaluation metric is used to measure the quality of the generative model. IoU measures the degree of overlap (intersection-over-union ratio) between the reconstructed model and the real model and takes a value from 0 to 1. The higher the IoU calculation result, the closer the reconstructed model is to the real model. The IoU value was calculated on the test results of each model. The IoU is calculated as follows:

The results are shown in Table 3. The fully improved multifeature fusion 3D reconstruction model is better than the traditional 3D reconstruction model and can achieve better performance. From the perspective of model reconstruction accuracy, the quadratic optimization multi-fusion 3D construction model proposed in this paper can achieve considerable improvement and can increase the reconstruction accuracy rate by about 6.3%. Therefore, different comparative experiments show the superiority of the model proposed in this paper in 3D reconstruction of the Nuo surface.

4.2.2. 3D Construction of High-Resolution Nuo Surface Voxels Based on Generative Adversarial Networks

Low-Resolution Results from the Previous Module. The network training period is set to 100, and other parameter settings of the network are the same as above. The generator is pretrained with pixel-based loss. The pretraining method can ensure that the generator can obtain images with better visual quality from the beginning. This way of pretraining avoids the need for the discriminator to discriminate on irrelevant generated images at the beginning, but allows the discriminator to focus on the details of the reconstructed image. The comparison results of the depth map super-resolution experiment based on the GAN method are shown in Figure 11. GAN-based image super-resolution methods do not look visually different from multiresidual dense methods. This may be due to the fact that the reconstructed image is a depth map, and the differences between each local depth image are subtle and not obvious on the reconstructed depth image. However, the calculation results on the PI score show that the improved GAN-based image super-resolution method can achieve lower perceptual scores. This shows that GAN-based methods are able to generate more realistic depth images overall. However, it also needs to be seen that there is still a big gap between the smaller details and the real depth image based on the GAN method.

Figure 12 shows the comparison results of the improved 3D Nuo surface high-resolution reconstruction results in this section and the quadratic optimization, multifusion, and multiresidual dense 3D Nuo surface high-resolution method proposed in Section 3. As can be seen in Figure 12, the GAN-based approach looks similar to the multiresidual dense approach overall. However, in terms of the local details of the generated Nuo surface depth map, the GAN-based method can reconstruct more Nuo surface details. In addition, it is also necessary to see that the multiresidual dense depth map super-resolution method adopts the mean square error loss function, and the depth map generated by this loss function is smooth on the whole. Therefore, the reconstructed 3D Nuo surface is smooth as a whole. However, the surface of the 3D Nuo surface generated by the GAN method is relatively rough, mainly because the super-resolution depth map generated by the GAN method is also prone to image artifacts when generating more realistic surface details.

5. Conclusion

The main research content of this paper is the design of a three-dimensional display system for intangible cultural heritage based on generative adversarial networks. The system function is realized through four modules: input module, data processing module, 3D model generation module, and model output module. Through two 3D model reconstructions, the three-dimensional display of intangible cultural heritage is realized.

This paper takes the Nuo surface as the research object and tests the function of the system. In the low-resolution 3D construction of the Nuo surface, the reconstructed Nuo surface is relatively rough due to the low resolution. This makes it difficult to show the details of the reconstructed 3D Nuo surface in appearance. Multiresidual dense blocks are introduced and applied to the depth image super-resolution network. At the same time, a perceptual loss function is applied, which combines the advantages of residual blocks and dense blocks to make the output depth image more realistic. The experimental comparison results show that our proposed multiresidual dense high-resolution Nuo surface voxel 3D construction model can achieve better performance. The quadratic optimization multi-fusion 3D construction model proposed in this paper can achieve considerable improvement and can improve the reconstruction accuracy by about 6.3%.

Inspired by the SRGAN model’s ability to generate images with more high-frequency details, we also use generative adversarial networks for high-resolution Nuo surface voxel 3D construction. To be able to further improve the performance of GAN-based image super-resolution, we improve its generator, discriminator, and loss function on the original SRGAN. Experimental results show that this method can generate super-resolution images with more realistic and natural depth maps. In addition, when it is used for high-resolution 3D Nuo surface sculpting, it can also generate 3D voxel Nuo surfaces with more details.

Data Availability

The data set can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank major project of China/Mijia Social Science Foundation, integration and research of literature, images, and cultural relics of Chinese ancient wind instruments (no. 17ZDA245).