Abstract

Artistic portrait drawing (APDrawing) generation has seen progress in recent years. However, due to the naturally high scarcity and artistry, it is difficult to collect large-scale labeled and paired data and generally divide drawing styles into several specific recognized categories. Existing works suffer from the limited labeled data and naive manual division of drawing styles according to the corresponding artists. They cannot adapt to the actual situations, for example, a single artist might have multiple drawing styles and APDrawings from different artists might share similar styles. In this paper, we propose to use unlabeled and unpaired data and perform the task in an unsupervised manner. Without manual division of drawing styles, we take each portrait drawing as a unique style and introduce self-supervised feature learning to learn free styles for unlabeled portrait drawings. Besides, we devise a style bank and a decoupled cycle structure to take over two main considerations in the task: generation quality and style control. Extensive experiments show that our model is more adaptable to different style inputs than state-of-the-art methods.

1. Introduction

In recent years, studies on neural style transfer [13] have flourished. Researchers are no longer satisfied with a single image expression form and have begun to consider more complex and diverse image translation relations [416]. For example, CycleGAN [17] transfers a color photograph to a Monet painting. Neural style transfer [18] considers style transfer as a problem of texture transfer, taking the texture from a style image to a content one. Based on the same texture style modeling assumption, many other methods [19, 20] have been developed for neural style transfer.

However, artistic portrait drawing (APDrawing) generation is completely not a texture-based task in neural style transfer. Artistic portrait drawing (APDrawing) differs in styles from other portrait paintings in terms of strokes, color composition, etc. APDrawing generation is to transform a face photo with the characteristic of a human face reserved and generate a highly abstract artistic portrait drawing. With little texture information, the previous texture-based style transfer methods thereby are not suitable in the APDrawing generation task.

Two recent methods [21, 22] have been specifically proposed for the APDrawing generation task. APDrawingGAN [21] firstly constructed an APDrawing dataset, and developed a hierarchical structure and a distance transform loss. However, it requires paired data and the drawing style is thereby limited to a single one. The method [22] further improved the generation quality and increased the number of generation styles to three. It proposed the asymmetric cycle structure, including a truncation loss and a relaxed forward cycle consistency. However, it naively divided drawing styles according to the corresponding artists, which did not conform to the actual situations. As shown in Figure 1, the first artist uses parallel lines to draw shadows while the second artist often uses continuous thick lines and large dark regions, which is sometimes close to the drawing style of the third artist. The similarities or even the sameness in styles between drawings of different artists and the differences between drawings of the same artist, make it inappropriate to simply divide drawing styles by the artists.

Meanwhile, there exists no publicly available dataset with APDrawings of obviously different styles, which heavily relies on manual crawling and filtering. It is time-consuming and costly to collect large-scale labeled and paired training data. In addition, compared with other art paintings, the number of APDrawings is relatively rare. The shortage of training data leads to a great need for a new unsupervised paradigm.

In this paper, we propose two realistic assumptions for the APDrawing generation task, i.e., there is only access to unlabeled and unpaired data and any manually explicit style definition should be discarded. The first assumption enables us to naturally avoid the cost and difficulty of annotating large-scale labeled and paired data. Based on the second assumption, we turn to treating each APDrawing as a single style instead of relying on manual division. Specifically, contrastive self-supervised learning is introduced to learn styles for all input APDrawings, which actually ensures low-cost acquisition of large-scale training data. It pulls close each APDrawing and its augmented images and pushes away different APDrawings. In this way, our style feature extractor can explore latent relations among input data.

Intuitively, without a clear division of drawing styles, the self-supervised style features are accompanied by irrelevant information. It is not easy to embed such unconstrained style features into our generation process, preserving the generation quality and controlling the drawing styles as expected. Accordingly, we build up a style bank for all these styles. As a set of representative styles for style groups, the style bank can also be viewed as a way to reduce the dimension of the style feature space to stabilize the training process. However, we propose a decoupled cycle structure with two streams to guarantee the generation of vivid APDrawings and the generation for free styles.

In summary, the main contributions of our work are listed as follows:We propose to use unlabeled and unpaired data for the APDrawing generation task, which frees us from excessive dependence on labeled data.Without the naively manual division of drawing styles, we treat each APDrawing as a single style and introduce contrastive self-supervised learning to learn style features for them. It enables us to generate APDrawings with free styles for different style inputs.We propose a style bank to update the original style features and a decoupled cycle structure, which guarantees the stability and robustness of training with a set of unsupervised style features.

2.1. Deep Learning in APDrawing

With the help of deep learning techniques, great strides have been made towards more powerful and adequate artificial intelligence for many vision processing tasks. Drawing related applications, such as line drawing colorization [23] and artistic shadow creation [24], also benefit from deep learning, which can produce more creative and richer paintings with less human efforts. Zhang et al. [25] proposed a deep learning framework for user-guided line art flat filling. It included the split filling mechanism to directly estimate the result colors and influence areas of scribbles. Im2Pencil [26] translated from photos to pencil drawings by a two-branch framework that learned separate filters for outline and shading generation, respectively. It can generate pencil drawings with style control.

2.2. Image-to-Image Translation

Our method also takes the advantage of deep learning technique to image generation. Efforts on image-to-image translation usually fall into two categories: domain-level translation and instance-level translation. It was firstly proposed by [27] to use a nonparametric texture model to learn the translation function between a single training image pair. More recent methods mainly focus on the translation function between two domains, defined by two sets of datasets. Many of them resorted to conditional generative adversarial network (cGAN) to synthesis images. Pix2Pix [28] was built on cGAN and used paired data between domains to learn the translation function. For many tasks, paired data are not available. To overcome the limitation, cycle consistency was then proposed in CycleGAN [17] and DualGAN [2], which make use of the cycle consistency constraint. This constraint enforces that the two mappings from domains A to B and from B to A, when applied consecutively to an image, revert the image back to itself. It regularized the training by reconstructing an original image from its translated image. StarGAN [29, 30] and ComboGAN [31] were then proposed to extend the image-to-image translation between two domains to multiple domains based on the cycle consistency. Another line of methods [1, 32] was not restricted in the image level but in the feature level. They assumed a shared latent space but MUNIT [32] postulated that only part of the latent space should be shared rather than a full latent space proposed in UNIT [1].

However, these methods either cannot generate images of different styles, or are not suitable for our APDrawing generation task because of the lack of ability to describe facial features in detail.

2.3. Neural Style Transfer

Neural style transfer is closely related to image-to-image translation, which aims at preserving the content of an image but transferring from the style of another image. Classic neural style transfer usually refers to the example-guided style transfer, while image-to-image translation mainly refers to the domain-based image translation. Neural style transfer was firstly proposed in image style transfer [18] to introduce a CNN to reproduce famous painting styles on natural images. It penalised differences between high-level CNN features of the generated image and the content image, and used the Gram matrix statics of features in the CNN to measure the style similarity.

Many follow-up studies [4, 3338] were conducted to either improve or extend the method. These methods [20, 39] addressed the issue of slow optimization process by training feed-forward neural networks. Different from these methods, the method of [19] replaced the Gram-based modeling way of styles but used a Markov random field (MRF) regularizer. Combing deep convolutional neural networks (DCNNs) with MRF models-based texture synthesis can be applied to both photographic and nonphotorealistic synthesis tasks.

However, most methods model style as texture, which is not suitable for our task. With little textual information, APDrawing generation requires a high degree of abstraction and completeness of some facial details in strokes at the same time.

3. Method

3.1. Overview

Our method is proposed for the APDrawing generation task, transferring the style of an APDrawing in domain to the input photo in domain . The APDrawings and the photos are denoted as , where , and , where , respectively. and are the number of APDrawings and photos in our training set.

As illustrated in Figure 2, the training process can be divided into two phases, i.e., extracting style features for unlabeled APDrawings and generating APDrawings with the desired styles. The first training phase uses unlabeled and unpaired APDrawings and introduces contrastive self-supervised learning to learn styles for them. The style bank is built up and the style features are updated to the similarities with the style bank. The second stage uses a set of generators and discriminators. The generator and an inverse generator are included to generate vivid APDrawings from input photos and style features, and transform APDrawings back to input photos without edge information loss, respectively. The discriminators consist of and to guarantee the discrimination between the generated fake images and the real images in both domain and .

Next, we will introduce details of our proposed method from the following aspects: (1) unsupervised style feature extraction and (2) unsupervised portrait generation.

3.2. Unsupervised Style Feature Extraction

Previous methods [21, 22] for the APDrawing generation task are either limited to one single style drawn by an artist, or a predefined division of drawing styles based on corresponding artists. In fact, it is hard to divide the drawing styles of APDrawings into several specific categories. Meanwhile, there is a lack of public APDrawing datasets with several different drawing styles, and it is quite costly to collect such a large-scale labeled one. A new benchmark is required for the task, due to the high artistry in drawing styles and the scarcity in labeled data. The new benchmark should be able to use unlabeled data and adapt to various drawing styles.

We introduce the contrastive self-supervised learning to train our style extractor for unlabeled and unpaired APDrawings. It is capable of adopting self-defined pseudolabels as supervision and utilizing the learned style features in the next APDrawing generation phase. As a discriminative approach, contrastive self-supervised learning aims at grouping similar samples closer and separating diverse samples far from each other as shown in Figure 2. Specifically, the VGG19 network is adopted as our feature extractor, denoted as . We pull augmented versions of the same sample close to each other while pushing away style features from different samples. The formulation of the loss function is written as follows:where is a temperature coefficient, represents the number of negative samples of in a minibatch, and measures the cosine similarity between two input vectors. In order to keep the style-invariance of these APDrawings, we choose some data augmentation methods, including cropping, resizing, horizon-flip, and rotation. In a minibatch, the original APDrawing, the transformed version of the original APDrawing, and the aligned version of the APDrawing and its transformed version are positive samples to each other. In this self-supervised way, our feature extractor explores the underlying data structure in these APDrawings.

Considering the high dimensionality of the feature space and the scarcity of data, the model might not be able to learn a stable and robust mapping from APDrawings to style features. We turn to building up a style bank , where is the bank size. It is obtained by clustering style features of these unlabeled APDrawings using the K-means algorithm. On the one hand, it can reduce the dimension of the style feature space and stabilize the training process to obtain robust style features for APDrawings. On the other hand, the style bank can be viewed as a set of representative styles for style groups, which might benefit in alleviating the negative impact of irrelevant information of drawing styles brought by contrastive self-supervised learning. Finally, the updated style feature of is computed as the cosine similarities between the style bank and the original style feature, which is written as follows:where is a -dimensional vector and represents the style bank. All updated style features make up the set , denoted as .

3.3. Unsupervised Portrait Generation

With the updated style features as input, the portrait generation phase aims at transferring styles, defined by an input style image, to a face photo. During the generation process, the following two considerations need to be guaranteed: generation quality and style control. Generation quality ensures to generate a vivid portrait, preserving the facial features and less discrimination from the real ones. Style control enables to keep the drawing style unchanged, compared with the style input. The total loss function can be summarized in the following form:

3.3.1. Generation of Vivid Portraits

There are two generators, and , using the architecture of autoencoder with residual blocks. The discriminator set is based on PatchGAN [28]. It involves a global discriminator, without information loss in the holistic characteristics, and a set of local discriminators for the fine details in facial regions. and are used in the generation from input photos to APDrawings. and are optimized in the opposite direction. They are trained with the adversarial loss, the asymmetric cycle consistency loss, and truncation loss [22], which are formulated as follows:

In other style transfer tasks, color information is quite important for the style modeling, while it is irrelevant for our task. We propose to transform an input image to a gray-scale one before sending it to the generation network. It enforces our network focusing on the line strokes and shadow usage, bringing balance for training and in the second cycle. Besides, we decouple two cycles by sending different pairs of input, including APDrawings, photos, and style features. The richness of the pair combination is more conducive to our generation.

3.3.2. Generation of Portraits for Specific Styles

The style classification network, denoted as , shares the first few blocks with the global discriminator of . In order to achieve the purpose of style control, the style loss is formulated as follows:

For real APDrawing , outputs the predicted style feature to get close to which computed in the first style feature extraction phase, denoted as . For the generated APDrawing , its output style feature is specified by the input style feature . The style loss guides to produce the style features close to the real feature distribution and generate APDrawings close to the desired styles.

4. Experiments

4.1. Datasets

Although the training set of APDrawings in the method [22] have not been released, we have collected similar number of APDrawings to train our method. Due to the lack of public datasets with multiple styles, the APDrawings are crawled from the Internet to construct an APDrawing dataset with the size of 641, named APDrawingCrawl. It consists of 116 APDrawings of the artist Charles Burns, 45 APDrawings of the artist Yann Legendre, 89 APDrawings of the artist Kathryn Rathke, 233 APDrawings from vectorportral.com, and other 158 APDrawings without tagged/labeled artist/source information. 641 APDrawings and 1000 face images from CELEBA-HQ [40] form our training set. The testing set consists of 200 face photos, mainly from CELEBA-HQ.

All the training images, including APDrawings and face photos, are resized at the resolution of . For training local discriminators in , training images are aligned using facial landmarks and perform face-parsing from the model BiSeNet [41].

4.2. Implementation Details
4.2.1. Extraction of Style Features

The CNNs for image feature extraction consists of three Conv-BatchNorm-ReLU blocks with two 1/2-scale downsampling operations. For the transformer, the feature dimension is 256 and both encoder and decoder have 3 layers. We use the ReLU output of the 13th convolutional layer of VGG19 as the style feature. We set the initial learning rate to 0.001 for the first 10 epochs and decay it by 10 times in the next 10 and 25 epochs. The total training epoches are 30. The batch size is fixed to 16 and the bank size is 10.

4.2.2. Generation of APDrawings

We set the hyperparameter in equation (5) as , , and , where and are the current epoch and the total epochs, respectively. The learning rate is set to 1.5e − 5 for the first 100 epochs and is linearly decayed to 0 in the next 100 epochs.

4.3. User Study

To evaluate the effectiveness of our method, we conduct a user study to compare with CycleGAN [17] and the method [22]. We randomly sample 30 face photos from the test set of APDrawingCrawl and transform them to three different styles. There are 50 participants involved in the user study and each of them is provided with these 30 images, resulting in 1,500 votes. With a style example image, every 10 images are transformed to that style. There are 3 style images in total. All participants are given an input photo, a style example image and drawings generated by these three methods at the same time. The voting criterion is based on image integrity, image quality, style perseverance, and face characteristic similarity.

As shown in Table 1, we have similar performance with the state-of-the-art method [22], which is in line with the expectations. Our method aims at solving the problem of APDrawing generation with unlabeled and unpaired training data, preserving the input style and the content from the face photo. The method [21] requires paired data and the drawing style is thereby limited to a single one. The method [22] further increases the number of generation styles to three with more collected data with labels. However, they suffer from the limited labeled data and naive manual division of drawing styles according to the corresponding artists, which do not conform to the actual situations.

Besides, it is truly hard for unsupervised methods to obtain apparently better generation quality than supervised methods, without the restricted classification categories and abundant labeled data for each class. So we emphasize the superior of our method on higher flexibility and scalability for different style inputs, and less dependence on abundant labeled data.

4.4. Qualitative Results

We conduct qualitative model analysis on both style feature extraction phase and APDrawing generation phase to demonstrate our effectiveness.

4.4.1. Style Feature Extraction

We use K-means to divide the style images into five clusters and visualize the learned style features by t-SNE. As shown in Figure 3, our style extraction is able to learn well-separated features. However, there is still the problem of unclear boundaries between different clusters, which can be expected and explained. From the very beginning, we do not want to manually divide the styles and therefore the style bank is introduced. The style clustering is for generating the style bank and some unseparated samples demonstrate the inapplicability of simple divisions. Due to the high dimensionality of the feature space and the scarcity of data, it is hard for the model to learn robust style features. With the dimensionality constraints brought by the style bank, the learning process could be stabilized.

As shown in Figure 4, we show the portrait drawings nearest to the cluster centers (style bank) of all APDrawing styles. These drawings can be viewed as the prototype style images of each style in the style bank. From the line strokes and shadow usage of these APDrawings, our method has the ability to distinguish among various drawing styles and they are actually from diverse artists, i.e., Yann Legendre, vectorportral.com, Kathryn Rathke, Charles Burns, and an unknown artist.

4.4.2. APDrawing Generation

We compare our method with a method Yi et al. [22] designed for the same APDrawing task, two examples-guided neural style transfer methods, image style transfer [18] and linear style transfer [5], and a unpaired image-to-image translation method ComboGAN [31].

As shown in Figure 5, we make a comparison with Yi et al. [22] to demonstrate the limitation of relying on the manual division of styles based on artists. We list three input styles and the corresponding generation results. The first two style input images are from the same artist, Kathryn Rathke. According to the style division strategy defined in Yi et al. [22], the output transferred images of input style1 and style2 should be the same, while ones of style2 and style3 should be obviously different. However, in fact, we can easily distinguish different drawing techniques between the input style1 and the input style2. The style1 input image is distinguished by a large area of gray thick shadow on the face, while the style2 input image is not. The input images of style2 and style3 share the similar usage of dark regions. The manual style division according to the artist is apparently not suitable in the situation, resulting in confusion in the generated results of Yi. Compared to the method of Yi et al. [22], our method is more flexible to generate APDrawings with desired drawing styles.

As shown in Figure 6, we make a comparison with two example-guided methods and a state-of-the-art one, Yi et al. [22]. It can be easily seen that these two neural transfer methods either fail to capture the differences in input styles, or cannot generate images with acceptable generation quality. Yi et al. [22] can generate drawings with three distinct styles. However, our method can actually generate drawings according to different input styles, with higher flexibility. Our first generation result uses parallel lines to draw shadows, the second result tends to use clean lines, and the third one uses large area of dark regions.

As shown in Figure 7, we make a comparison with an image-to-image translation method and a state-of-the-art one. ComboGAN fails to generate vivid APDrawings with good generation quality. For example, there exist some strips and undesired gray color on the face. Yi et al. [22] can generate discriminative results for three fixed styles. Despite the lack of label information, our model can still capture differences between different styles. The generated images are consistent in a single style while they differ from images in other styles.

4.5. Ablation Study

We conduct ablation studies as follows: (1) generation without using the style bank and (2) generation without gray-scale inputs, i.e., using the original color inputs.

As shown in Figure 8(c), only the center areas have visible traces of the generated drawings, and the hair disappears. Figure 8(e) has finer details in the cap and the hair than Figure 8(c). Without using the style bank, our method falls into the situation that the negative influence is brought in by the accompanied redundant information of style features. It might lead to the leaving of a blank outside the fixed area or other undesired situations in our generated drawings. The introduction of style bank is to eliminate the interference of factors irrelevant to style and keep the network focusing on the key factors related to styles, such as line strokes. The model with the style bank guarantees the details and completeness of the drawings.

As shown in Figure 8(d), the color changes on the face lead to the extra or erroneous lines. With the color input, we can see that the results might be more sensitive to the areas with color changes, which is not desired in the APDrawing task. The introduction of using gray-scale inputs is to avoid being affected by the task-irrelevant inputs.

5. Conclusion

In this paper, we propose to perform the APDrawing generation task in an unsupervised manner, which can avoid the difficulties of collecting large-scale labeled data and the irrationality of dividing drawing styles into some specific categories. We introduce contrastive self-supervised learning to learn free styles of APDrawings by treating each as a single style. The style bank and corresponding decoupled cycle structure guarantee the generation quality and style control of the output APDrawings. Experiments show the flexibility and scalability of our method in generation of APDrawings with different styles, which is more adaptive compared to other state-of-the-art methods. In our future study, we will investigate how to improve the realism and faithfulness of the low-quality original photos, such as these with blurry texture, which would cause noisy and messy lines, failing to preserve fine details.

Data Availability

Previously reported (CELEBA-HQ) data were used to support this study and are available at https://doi.org/10.1007/978-3-030-01261-8_20. These prior studies (and datasets) are cited at relevant places within the text as references [41].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 62171057, 62101064, 62201072, 62071067, and 62001054, in part by the Ministry of Education and China Mobile Joint Fund (MCM20200202), and in part by the Beijing University of Posts and Telecommunications-China Mobile Research Institute Joint Innovation Center.