Abstract
The origins of anime can be traced all the way back to the Homo sapiens period of human civilization. Nowadays, anime is a record of life as well as a popular kind of entertainment and a source of ideal trust for many individuals. Children and individuals of all classes and ages enjoy anime. In the opinion of most people, anime is not simply a form of amusement and pleasure, but it can also express deeper meanings, transmit other cultures, and inspire individuals to pursue their aspirations. Image stylistic migration based on convolutional neural networks has developed as a central research path in recent years, and attempts on style migration have evolved as well. However, there are few studies on style migration. In this paper, we propose a deep learning-based solution to the problem of anime-style migration. Experiments on a relevant database show that our proposed method is effective and accurate and has commercial and academic significance.
1. Introduction
In computer graphics, image style migration generally refers to applying a specific artistic style (usually a class of paintings) to an image to generate a new one with the semantic content of the target style. Deep learning has also aided in the rapid development of image style migration. The definition of image art style is imprecise and subjective, and before deep learning was applied to image style migration, the majority of solutions needed human modeling of art style, which resulted in high labor costs and a minimal application scenario [1–3]. Researchers in related disciplines have paid more attention to neural style migration with the introduction of deep learning techniques. Researchers have been liberated from the tiresome modeling work for specific art styles, resulting in substantial economic gains while bringing color to daily life. As a popular type of art, anime has a wide range of uses, ranging from advertising to children’s education, etc. Like many other types of art, many renowned anime visuals are made by drawing on real-life images. Because the results of anime-style migration methods based on deep learning techniques are currently not very good, image migration methods for anime-style have a high research value. Figure 1 depicts a typical picture style transfer schematic [4, 5].

(a) Real images

(b) Watercolor painting

(c) Pencil case

(d) Crayon painting
Deep learning-based image processing has recently received increased attention, including superresolution reconstruction, image restoration, colorization of black-and-white images, and the recently extremely popular AI face-swapping. In recent years, there has also been an increase in image stylization research, with the goal of transforming a regular image into an artistic style painting. The concept of image stylization migration arose from the use of convolutional neural networks for texture synthesis of image features [6]. It was discovered that the feature images extracted by convolutional neural networks could show both the stylistic and content features of an image. The feature map represents the deep features of an image. The Convolutional Neural Network (CNN) receives the image as input and reconstructs it with convolutional layers, along with extracting the features of different dimensions of the image. The Gram matrix then computes the style features and replicates the texture features of various style maps, and the fusion reconstruction achieves the final result for the content map and style map to create a beautiful painting with the image’s content and artistic style.
Recently, stylized image migration has been introduced in various areas, including video broadcasting [7–10] and movie special effects [11, 12], and has been frequently sought for and adored by young people on social media platforms. Although stylized image migration has benefited several sectors, it is currently primarily in oil images with apparent textures, with line images such as animation and drawings being seldom engaged. Anime has a major effect in our lives [13], for example, Japan’s Black Deacon, Cherry Pills, and other legendary animation have been an irreplaceable part of the hearts of a generation of 90s; Disney and Marvel’s anime, today still in the worldwide scope of countless fans’ affection [14]. Many individuals fantasize about the other world produced by anime. Anime may help you calm your mood, boost your creativity, discover your values and faith in anime, and see the optimism of the future. Following the foregoing details, we believe it is still important to enhance current research and apply the neural art algorithm to create anime-style rendering [15].
With the advancement of computer information technology, computer vision has begun to play a role in creating entertaining applications in everyday life. In recent years, style migration, the combination of computer vision and artificial intelligence, has become the primary path of computer technology development and a vital means of commercial computer applications. Style migration has several applications and, in its broadest definition [16], is a strategy that employs algorithms to turn a picture into another defined style with the same picture texture. Style migration, in essence, corresponds to the image-to-image category of conventional computer vision jobs, with animation style migration being the most challenging aspect of the entire style migration issue. Because 3D animation is the dominating direction in China according to the national industrial plan [17], there is a scarcity of artists in the 2D animation industry. However, the 2D animation sector will continue to have a sizable market share in the near future. Traditional 2D production requires artists to define lines, colors, and split shots, which are the most time-consuming tasks in the animation production cycle. The research on style migration by generative adversarial neural networks gives the local 2D animation sector a chance to conquer the market.
The advancement of civilization [18], the quick expansion of the economy, and the quest of spiritual life have resulted in today’s developed animation industry. The animation industry is characterized by animation and cartoons as its principal carriers. The United States is a long-established anime powerhouse with a highly concentrated market, and many recent Hollywood blockbusters are based on superb anime from the 1970s and 1980s. The domestic anime market was worth 160 billion dollars in 2019, 175 billion dollars in 2020, and 250 billion dollars in 2025 [19]. It is clear that anime style migration will result in significant economic gains if it is invested in the animation industry’s development. Digital image algorithms, very deep convolutional networks (VGG) [20], and generative adversarial neural networks are the key methods employed in the development of style migration (GAN). We decided to use GAN in our work since the previous two need paired training sets, which means the artist must create the accompanying anime drawings by hand based on actual photos. Finding matched data sets of the same anime style is difficult at the moment. GAN’s unsupervised technique allows for many different styles [21], quicker work cycles, and fewer trial and error costs. Deep learning is used to immediately translate the produced images into anime styles, which can significantly reduce the cost of anime production. According to the regional distribution of the papers published, GAN-related style migration papers are primarily concentrated in China, Hong Kong, Singapore, and Korea, whereas many new ideas are proposed by Hollywood special effects artists and animation artists in the West, particularly in the United States. However, it is obvious that GAN-based style migration research [22] offers a wide variety of application scenarios, easing the production of cinematic animation. Not only that, but image animation offers a wide range of entertainment applications. The entertainment algorithm of the camera has been liked by the public in everyday life, from changing looks to changing styles, and has a large audience. Jitterbug Racer, for example, is becoming increasingly significant in people’s everyday lives and enjoyment. The method used to capture the brief video and modify the aesthetic of the shot is based on generative adversarial neural network [23].
Nowadays, anime and animation are growing increasingly popular across the world, and many well-known anime are based on real-world locations. However, anime requires significant artistic abilities that take a long time to develop. A significant tool for an artist to build his or her unique style would be an excellent computer software that transforms pictures of real-world settings into anime visuals. Furthermore, this approach may be integrated into picture editing software to transform ordinary photographs into anime-style images. Based on the above analysis, in this paper, we propose a new method for anime-style rendering. We use convolutional networks for anime-style migration, and through experiments, we come up with the most suitable method for the anime style. The proposed algorithm generates color-blocked smooth anime images by generating fuzzier pseudoreal images while adding identity consistency loss to maintain the character’s identity characteristics.
2. Related Work
2.1. Image Style Transfer
People desire to enjoy both the riches of worldly life and the completeness of the spiritual world, which leads them to admire great paintings and seek spiritual sustenance more frequently. People are increasingly employing stylized painting picture software to create photographs that resemble the stylistic effect of great paintings in modern life. Image stylization is becoming increasingly worthy of our attention in the context of attaining artistic beauty in general. Movies with unique oil painting-style topics are becoming increasingly popular, making the study of motion picture stylization essential to image stylization. Because of advancements in graphics processors, deep learning based on convolutional neural networks has witnessed a second spring in recent years. In addition, as compared to standard image art stylization methods that rely on rendering and filtering, deep learning-based image stylization algorithms are more adaptive to various image processing approaches. Realistic and nonrealistic images are the two types of images that can be found. Nonrealistic images are computer-generated images with a certain nonrealistic style, and the method is also known as nonrealistic rendering. Realistic images are image representations of real-world objects, whereas nonrealistic images are computer-generated images with a certain nonrealistic style. Style migration is the process of applying a style to a real image. Image style artistry is related to style migration, which involves combining existing artwork with nonrealistic rendering to create a range of creative effects in actual images. More research was done when computers initially came out to display realistic scenes, also known as realistic sketching. Now that this realistic drawing can be easily recorded, there is a growing interest in allowing computers to copy painters’ brushstrokes or create more esthetic forms of drawing. Once such drawing style algorithms are created, a painter’s painting that takes more than a month may be done in only a few minutes by a computer, which can replicate or mix multiple artists’ styles to make an esthetic image that does not overlap. Realistic images depict realistic scenarios that do not address the demands of all sectors and users [24]. Because many esthetic expressions have a powerful visual effect and cultural significance, they may be used in various industries, such as advertising design, film production, animation, and game rendering. Because nonrealistic images are required in specialist domains within certain disciplines, such as engineering and industrial design, nonrealistic images can aid in completing simulations that would otherwise be time-consuming and complex. The study of image art stylization is significant because it allows computers to emulate human art production. Assume the artist’s style can be replicated by a machine. In such instances, the computer has a higher level of intelligence and learning capacity, allowing for creating a vast number of nonrealistic images with more creative expression and psychological impact.
The major focus of our study is on the development of oil painting style [25], which is the result of thousands of years of civilization, and these works transmit information with more depth than genuine photos. Furthermore, because there is a significant demand for creative pictures among modern people, an increasing number of mobile applications produce artistic images. These artistic images generate new works with aesthetic merit, in addition to providing visual delight to readers and viewers. Nonrealistic image research has a wide range of applications and development potential, and it is used in a variety of industries, including cultural and entertainment media, computer animation, industry, and filmmaking. For example, (1) beauty shot and other types of beauty and aesthetic special effects in camera; (2) applications in the advertising, film and television, and animation sectors to generate hand-painted effects on the screen and to create artistry and enjoyment in film and television productions; (3) images of tumours obtained using CT or MRI; and (4) industrial design simulation.
Figure 2 depicts a schematic of cartoon picture style transfer. Gatys et al. [1] proposed a texture model based on the feature space of a convolutional neural network, in which the texture is represented according to the interrelationships between the feature maps in each layer of the network, making the object information more and more clear while the texture extraction increasingly captures the stylistic content features of natural images. The original image is fed into the neural network, and the texture analysis is derived after feature extraction in different convolutional layers using the Gram matrix calculation, and the white noise image is fed into the neural network, and the texture loss function in different layers is calculated for texture synthesis. The authors then utilize the texture synthesis method to execute image style migration in oil painting style and then fuse it with the extracted texture image to create the final picture including diverse oil painting art styles. Gong et al. [26] employ the traditional approach for creative representation of pictures and films. The network is hundreds of times quicker when utilising a feed-forward convolutional network, which creates many samples of the same texture of variable size and transmits the creative style from one image to another. The style conversion method’s main component is a block matching-based operation for constructing target activities in a fixed layer, given a style and content image, a process known as “Style-swap,” or replacing the content image with a patch of the style image, a method that performs the relevant processing only at one layer. Real-time adaptive instance normalization for arbitrary style transfer uses the novel adaptive instance normalization to accomplish real-time arbitrary style transformation by aligning the mean and variance of content features with the mean and variation of style features. Different factors are resized and modified adaptively based on the style image. The novelty is that the variance and mean values derived from multiple style maps are computed directly as the style map goes through the convolutional neural network, eliminating the need to save them. The two images, image A and image B, are each other’s content map and style map. The conversion needs to come up with images with the mutual style of their original content. The two images of the method must have a high degree of similarity or use the VGG network for image feature extraction, the high-level convolution will extract the texture of the image. After VGG19 following feature extraction, the coarse-grained feature map created by the top convolution layer should seem quite similar, i.e., the feature maps should be almost identical. If they are the same, the top layer feature map of A may be deconvoluted to reconstruct the content of BA and then fused with the features of image B to produce the final picture BA, and vice versa for image AB. Furthermore, by image style migration based on high-definition image style migration, this article is more inclined to style migration between two photos, detail and clarity are the characteristics of this article. The input of the style image is a high-quality photo. The result can change from day to night, is a different style of conversion from high-definition photos to high-definition photos, using to the style image is no longer an artistic painting. To prevent distortion of the resulting pictures, a set of realistic regularization loss functions is introduced to the loss function during the optimization phase. The semantic segmentation is used to transfer the grass’s style features to the lawn, the sky’s style features to the sky, and so on. The resulting image will thereafter be more realistic by avoiding the mismatch of the dropped style migration content. People have more and more research and ideas in image processing due to the unending image processing activity in recent years. Image stylization, image recognition, and video stylization have all seen recent advancements. Deep learning has been trendy in recent years, especially in the domain of image style migration, and has offered more new ideas and better outcomes for image processing. People’s living standards are rising, and so is the pursuit of art. Deep learning-based picture stylization algorithms are becoming increasingly interesting, and there is a good chance that deep learning and other similar algorithms may provide even better image stylization outcomes in the future.

3. Method
In general, anime requires tidy and clear borders to avoid a big number of uneven color blocks. However, existing approaches tend to generate a checkerboard look and a huge number of uneven color blocks. To solve these issues, this section enhances previous approaches by adding network modification, parameter selection, and the formation of the training dataset, and eventually getting an appropriate method for anime-style transfer.
3.1. Proposed Method
Figure 3 depicts the overall network design. An image conversion network and a loss function computation network make up the network structure . When the network is trained, the conversion network and converts the training picture to the goal result image . Each loss function computes a to measure the difference between the intended result picture and the content and style images, and the two loss functions are averaged for back propagation to update the conversion network W’s parameters.

The parameters of the conversion network are updated in reverse using the perceptual loss function, which may then be trained into multiple style models using different style maps. The loss computation network can detect the semantic information difference and feature difference between the content image style image and the outcome picture by calculating the content loss function and style loss function. The VGG network can be applied in face recognition, image classification, etc., to better sense the semantic image information, and the VGG networks are often used as VGG16 and VGG19. In this paper, we use a pretrained network model for image classification to get the advanced perceptual ability and semantic difference, and the effect is better in the image reconstruction process. The content loss and content features are extracted using the pretrained VGG network with the content loss function as in Equations ((3) and (2)).
The perceptual loss function is produced from the perceptual loss network, and the parameters of the conversion network are updated in reverse using the loss function, which may then be trained into multiple style models using different style maps. The loss function computation network computes two loss functions, content loss function and style loss function, and can detect the semantic information difference and feature difference between the content image style image and the outcome picture. The VGG network may be used in face recognition, picture classification, and other applications to better perceive visual semantic information, and the VGG networks are commonly employed as VGG16 and VGG19. In this paper, we employ a pretrained network model for image classification to get advanced perceptual ability and semantic difference. In the image reconstruction process, the influence is stronger. The experiments employ Equation (2), where denotes the th convolutional layer, denotes the th convolutional layer of graph in the model, denotes the generated graph, is the input content graph, and is the dimension of the feature graph, which is the number of channels, height, and width, respectively. Style loss: because content loss causes the reconstructed image to maintain the original content, the style map is anticipated to match the original texture, lines, and other elements. Let represents the th layer of the network with having the input , the shape of the feature map , and we first calculate the Gram matrix of the style image features and the output image. The Gram matrix can represent the correlation between different channels of the image. The correlation is calculated as shown in Equation (3), and we use the VGG-19 pretraining model, which uses the dataset of ImageNet to do the pretraining: where is a feature of dimension, and the dimension of the feature obtained by each filter is ; then, the in Equation (3) is proportional to the covariance of the noncenter of the dimension. Each grid point may be considered as an independent sample, and pairwise covariance feature information can be obtained; for example, yellow corresponds to square and blue corresponds to circle. Then, as indicated in the equation, the style reconstruction loss is the squared Frobenius parametrization of the difference between the Gram matrix of the output picture and the target image (3-4).
3.2. Image Conversion Network
The image conversion network’s main body is made up of five residual blocks. In the original study, the conversion network has three convolutional layers, five residual blocks, and three interpolation and convolution layers. All three initial deconvolution layers are deleted in this article, and interpolation and convolution are employed for up sampling. This substitution substantially prevents the creation of checkerboard patterns. The spatial batch normalization in the original article is no longer employed after all nonlinear convolution layers except the output layer. The Tanh activation function is used in the output layer, which guarantees that the output picture is normalized to the Tanh activation function. Finally, the output layer utilizes the Tanh activation function to guarantee that the resultant picture’s pixels are between 0 and 255, and because a black-and-white anime map is required, the output layer employs the Tanh activation function to ensure that the resulting image is between 0 and 255 pixels.
The whole image conversion network’s input and output are color images of form . When an image is greater or smaller than is supplied, it is automatically resized and transformed after the input. Because the image conversion network employs all convolutional layers and lacks a fully connected layer, the input can be an image of any resolution when evaluated. First, there are three convolutional layers: the first uses a convolutional kernel of with a step size of one, while the following two use a convolutional kernel of with a step size of 2. We perform down sampling using convolution, replacing all of the original pooling layers because the convolutional layers may also be lowered by increasing the step length to minimize the size of the feature map. Increasing the perceptual field also improves network performance. Next, we replace the subsequent connected five residual blocks and last fully connected three layers, original deconvolution, by interpolation, and then convolution, the output of the results obtained for the original RGB image and grayscale map of two.
4. Experimentation and Evaluation
4.1. Data Set Composition
In this paper, we mainly use the MicrosoftCoCo and the standard black-and-white anime data sets for the experiments. The CoCo dataset mainly includes more than 80,000 real images, and the anime database mainly contains more than 1,000 black-and-white anime images. The batch size is set to 4, and the size of each training image is adjusted to . On the MicrosoftCoCo training dataset, 40,000 iterations are performed over two cycles, and on the anime data set, 40,000 iterations are conducted over 160 cycles. We used the Adam optimizer with a 0.0001 learning rate. The content reconstruction loss of the relu4 2 layer and the style reconstruction losses of the VGG loss network’s relu1 2, relu2 2, relu3 3, relu4 3, and relu5 3 layers are computed for all types of experiments. We use TensorFlow and cuDNN, which take roughly 12 hours to train on a single GPU.
4.2. Experimental Setup and Evaluation Metrics
We conducted the experiments with the software and hardware setting illustrated in Table 1.
The merit of image style migration outcomes is determined by the individual’s rating and esthetics in the picture assessment technique of image style migration. We employ a questionnaire approach for the results of four distinct viewpoints and examine the acquired data results because there is no standard objective evaluation method. The experimental findings were reviewed by comparing them to the results of other approaches as well as the results of other parameters and data sets.
4.3. The Effect of the Presence Method
As shown in Figure 4, we compare our method with that of Gatys et al. [1] and Pei et al. [2] as comparison methods 1 and 2, respectively. In this paper, the results of Gatys et al. [1] and Pei et al. [2] are compared as the results of comparison method 1 and comparison method 2, respectively, as shown in Figure 4: Figure 4(a) is the input content map, Figure 4(b) is the style map, Figure 4(c) is the result of comparison method 1 [1], and Figure 4(d) is the result of comparison method 2 [2], and it can be seen from the result graph that both methods produce uneven color blocks in the results. In Figure 4(d), the lines are retained more clearly, but the blank places in the content map in the red box still produce the style texture, and the texture of the face is denser, while there are also a few black color blocks. In Figure 4, Figure 4(a) represents the input content map, Figure 4(b) represents the style map, Figure 4(c) represents the result of comparison method 1 [1], and Figure 4(d) represents the result of comparison method 2 [2], and the resultant graph shows that both approaches generate unequal color blocks. In Figure 4(d), the lines are more clearly kept, but the vacant spots in the content map in the red box still form the style texture, and the texture of the face is richer, with a few black color blocks. Therefore, in this paper, we propose an improved method based on the anime-style migration network. First, the anime-style network essentially retains the network structure of baseline method 2 while modifying the transformation network structure to lessen the appearance of grid-like checkerboard texture. The appropriate texture synthesis settings are then determined by experimentation, and a standard anime training data set is created. The texture synthesis settings and the training set used in the training network are also updated and altered so that the backdrop and facial textures in the results are weakened and lowered to better match the features of anime style.

(a) Content Image

(b) Style image

(c) Contrast method 1

(d) Contrast method 2
4.4. Comparison of Results
Figure 5 depicts the experimental results of four distinct content maps, with Figure 5(a) representing the style map, Figure 5(b) representing the content map, Figure 5(c) representing the result of comparison method 1, Figure 5(d) representing the result of comparison method 2, and Figure 5(e) representing the result of our proposed method. Figure 5 shows that the baseline method 1 result has no obvious lines, irregular color blocks appear, and the overall result is cluttered; the comparison method 2 result has more overall background textures and cluttered background. Our proposed method removes the background and extra textures on the human face while retaining the boundary lines with clear original image content.

(a) Style map

(b) Content map

(c) Contrast method 1

(d) Contrast method 2

(e) Proposed method
4.5. Objective Experimental Comparison Results
FID is a popular assessment metric. A lower FID number indicates that the created machine-generated image is closer to the distribution of the genuine image, implying improved image quality and variety. The FID is assessed on the entire test set in this research to quantify the similarity between the produced anime images and the genuine anime images. The FIDs of the test set in the anime domain and the training set in the animation domain are calculated and named the target values in this paper. The FIDs of the training set in the photo domain and the training set in the animation domain are also calculated and named the initial values, which are the distances between the source and target domains. Table 2 depicts the objective experimental findings, demonstrating that the method’s produced pictures in this chapter are closer to the true image distribution.
Table 3 depicts how the quality of the reconstructed content picture varies with the number of iterations. At 1,000 iterations, the rendering network can basically reproduce the original image, and the reconstruction improves as the training progresses.
The proposed model shown in Figure 6 outperforms the previous models and generalizes the results, yielding an average accuracy score of 0.96. As seen in Figure 6 the suggested model is unable to learn enough features from training data when compared to our proposed model, resulting in lower scores for all assessment criteria. The picture also depicts the train and validation accuracy recorded and displayed using a tensor-board from training logs.

5. Conclusion
Style migration algorithm is becoming increasingly popular. We proposed a deep learning-based anime-style migration algorithm in this paper that can create anime-style photos with clear lines and simple background textures and compared the results with other approaches with varied settings and data sets. For the evaluation in real-world scenarios, we designed a survey questionnaire that was primarily aimed at young women aged 16 to 30 who are more familiar with anime. The findings of 354 data volumes obtained revealed that more individuals favored the anime migration pictures created by the suggested algorithm, and the outcomes were relatively better. The results suggest that the majority of them like the anime migration pictures produced by this paper’s enhanced method, and the outcomes are comparatively better.
Data Availability
The data sets used during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The author declares that he has no conflict of interest.