Abstract

Image advertising is widely used by companies to advertise their products and increase awareness of their brands. With the constant development of image generation techniques, automatic compositing of advertisement images has also been widely studied. However, the existing algorithms cannot synthesise consistent-looking advertisement images for a given product. The key challenge is to stitch a given product into a scene that matches the style of the product while maintaining a consistent-looking. To solve this problem, this paper proposes a new two-stage automatic advertisement image generation model, called Advertisement Synthesis Network (ASNet), which explores a two-stage generation framework to synthesise consistent-looking product advertisement images. Specifically, ASNet first generates a preliminary target product scene using Pre-Synthesis and then extracts scene features using Pseudo-Target Object Encoder (PTOE) and true target features using Real Target Object Encoder (RTOE), respectively. Finally, we inject the acquired features into the pretrained diffusion model and reconstruct them in the preliminary generated target goods scene. Extensive experiments have shown that the method achieves better results in all three performance metrics related to the quality of the synthesised image compared to other methods. In addition, we have done a simple and preliminary study on the effect of synthetic advertisement images on real consumers’ purchase intention and brand perception. The results of the study show that the advertisement images synthesised by the model proposed in this paper have a positive impact on consumer purchase intention and brand perception.

1. Introduction

With the development of the times and the transformation of the economic model, the current stage of the commodity market has moved from the product generation to the brand generation. Enterprise products no longer completely determine the competitive advantage of the market; branding and marketing, on the contrary, have become an important means for enterprises to stand out in the homogenisation quagmire. Therefore, advertising is the most effective and necessary means of marketing means [1].

It has been found that vivid and intuitive pictures can have a positive impact on consumers’ purchasing decisions [2]. A well-designed product promotional image can show the product’s characteristics through different scenes, which can inspire consumers to buy and change their attitudes and impressions of the product. Therefore, most scholars agree that the visual design effect of product promotional images in advertisements can directly affect the effectiveness of advertisements [3]. However, the design and production of product advertisement images in advertisements require huge time and capital investment. Enterprises always want to produce attractive advertisements with low cost to stimulate consumers to buy.

To reduce the cost of labor for the creation of these ads, technologies that support automatic synthesis of ads have received considerable attention, including automatic assembly of graphical elements using esthetic principles [4] and simultaneously creating a series of banners for different display sizes [5]. However, both of these methods automatically generate advertorials by simply splicing together elements such as product images, text messages, and brand logos, which does not generate adverts with images that are attractive enough to consumers.

Fortunately, with the proposed diffusion model [6] based on hierarchical construction of denoising autoencoders, the current stage of image synthesis techniques [7, 8] has achieved impressive results. It not only makes it possible to synthesise highly creative and artistic complex advertisement images but also greatly reduces the design cost of advertisement images, bringing a revolutionary change to the advertisement design industry.

More specifically, suppose we receive an order from a fruit company that wants us to create several product advertisements to promote the cherries and watermelons they sell. All we need to do is to write a reasonable prompt text and feed it into the Stable Diffusion (SD) model for image synthesis [9], and we can easily obtain a series of vivid images of fruit products (as shown in Figure 1).

However, existing image synthesis algorithms can only be applied to the production of advertisement images for some generic target products (e.g., various types of fruits) but cannot directly generate advertisement images for a specific target product (e.g., a specific brand of sports shoes). For example, suppose we need to make a promotional image for a Converse sports shoe; then, the direct use of writing prompts and feeding them to the SD model can only generate an advertisement image with a similar appearance to the target object (as shown in Figure 2). Obviously, such a product advertisement image cannot be used for product promotion and publicity. Although strongly in need, this topic is not well explored by previous researchers.

Therefore, we propose the Advertisement Synthesis Network (ASNet) in this paper to solve this challenge. Different from previous methods, ASNet is capable of generating consistent-looking, high-quality product advertisement images of the input target object with zero shooting. The specific meaning of consistent-looking is the complete preservation of the appearance details of the target object when ASNet generates advertisement images, which is the biggest advantage of ASNet.

To achieve this, we utilise a two-stage generation structure in the ASNet. Specifically, we first generate a pseudo-product advertisement image using the SD-based Pre-Synthesis model. The product shown in the pseudo-product advertisement image has similar appearance characteristics as the target product.

Then, we use PTOE to extract scene features and RTOE to extract real target features, respectively. Finally, we combine these features by injecting them into the pre-trained diffusion model for interaction and reconstruct the real advertisement image in the pseudo-product advertisement image.

In sum, our work makes the following contributions:(1)We propose a novel Advertisement Synthesis Network for the issue of automatic generation of advertisement images for a given product. ASNet is a two-stage structured end-to-end model that takes prompt text and target object images as inputs to synthesise consistent-looking product advertisement images. To the best of our knowledge, ASNet is the first fully automated advertisement image generation model without manual intervention.(2)By comparing with state-of-the-art image generation models, we obtain superior advertisement image synthesis results on test data. We believe that the two-stage generation protocol used in this paper breaks the paradigm of intrinsic advertisement image synthesis methods and can provide a generic solution idea for similar tasks.

2.1. Generative Models for Image Synthesis

Image generation tasks have been the most challenging in computer vision. Early Generative Adversarial Networks (GANs) [10, 11] are capable of sampling and generating high-resolution images, but they are difficult to optimise [1214] and capture the complete distribution of data [15]. In contrast, Variable Autoencoder (VAE) [16] and stream-based models [17, 18] are easier to optimise [1921], but the quality of the images they generate will be lower than GAN-based models.

Recently, diffusion modelling (DM) [6] has achieved state-of-the-art synthesis results on image data and beyond [22, 23] by decomposing the image formation process into a sequential applications of denoising autoencoders. The subsequently proposed latent diffusion models (LDMs) [9] achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based​ DM.

Diffusion model-based image generation methods have shown great promise in image generation, beating GAN-based methods in generating diversity, and their image synthesis has brought about unprecedented changes.

2.2. Advertisement Image Synthesis Model

Traditional methods for automatic advertisement image generation typically use graphical design strategies that are driven by design rules or structured data.

O’Donovan et al. [24] proposed that an energy function can be constructed by assembling various heuristic visual cues and design principles for optimising the layout of a single page and extended the function as an interactive tool for the automatic generation of advertisement images. Yang et al. [24] proposed a system for generating visual text presentation layouts for the generation of advertisement images, in which colours are automatically determined with the help of a colour harmony model and a colour tone model, and theme colours are defined by the designer. Liu et al. [25] introduced an intelligent banner release tool, Luban, which could automatically synthesise banners with different commodities.

With the recent great success of deep learning-based GAN and SD models for image generation [26, 27], they are widely studied in the field of advertisement synthesis. However, these methods, although capable of generating realistic and natural-looking images, are still rarely used for the automatic generation of advertisement images due to the difficulty of finding suitable data pairs for supervised learning. To address this problem, You et al. [28] created a dataset containing 13,280 advertisement images with rich annotations including the outline and colour of the elements, as well as the category and target of the advertisement, and constructed a new probabilistic model to guide the synthesis of the advertisement’s style. The aim is to use a data-driven approach to capture the relationships between individual design attributes and elements in an advertisement image and to automatically synthesise the input of elements into an advertisement image based on a specified style.

3. Method

3.1. Overview of the ASNet

The Advertisement Synthesis Network (ASNet) pipeline is shown in Figure 3. It is capable of generating high-quality and creative advertisement images after inputting specific target products and well-designed prompt texts. Unlike traditional methods, the ASNet proposed in this paper is able to synthesise advertisement images that match the appearance of the target product in an end-to-end manner without manual intervention.

Our core idea of building ASNet is to first generate preliminary target product scenarios using Pre-Synthesis and then extract representative scene features using Real Target Object Encoder (RTOE) and extract real target objects using Pseudo Target Object Encoder (PTOE). Finally, these features are injected into the pretrained diffusion model and recombined in the initial generated target product scenes.

3.2. Pre-Synthesis

The Pre-Synthesis (PS) is built on Stable Diffusion model, which is used to initially generate advertisement image scenarios of the target product in the form of a text-to-image. Mathematically, PS is described aswhere denotes the PS with network parameter θ. We used PS to convert the prompt text into initially generate advertisement image . It is worth noting that the initially generated advertisement image does not have the appearance of the target product; we just want to use its generated advertisement scene for secondary generation.

3.3. Target Object Encoder

The Target Object Encoder (TOE) module is shown in Figure 3. The TOE module can extract rich feature details and scene information from the input image for the secondary synthesis of target product advertisement images. TOE consists of PTOE and RTOE: PTOE is used to extract scene features for pseudo-advertisement image and RTOE is used to extract detailed features of real product image.

The network architecture of RTOE consists of a self-supervised model DINO-V2 [29] for feature extraction and a single linear layer T() serial for fine-tuning.

The input to the RTOE module is a target product image without a background . Product images without backgrounds help RTOE to get more neat and unambiguous features in the feature extraction phase. After obtaining the input of the real target product image, RTOE encodes it and fine-tunes the encoded features to finally obtain a spatially aligned feature , which is mathematically described as

However, it is not enough to generate an advertisement image using only the feature information of the real target product image. We also need additional guidance to complement the generation of scene information. Therefore, we constructed a PTOE to extract scene information from pre-generated pseudo-advertisement image using a ControlNet-style [30] network that generates a range of detailed feature information with hierarchical resolution. The above process expresses this aswhere denotes the scene features extracted from the pseudo-advertisement​ image.

3.4. Feature Injection

After obtaining and , we tried to stitch them together to synthesise an advertising image of the real target product. We inject them into a pre-trained text-to-image diffusion model, at which point we probabilistically sample the image using UNet and project it into the latent space using the stable diffusion model to guide the image synthesis.

We set the sampling process function of the UNet model as ; it starts denoising from an initial latent noise (), takes and as the condition to generate new image latent , and uses the decoder to generate the real target product advertisement image that we need to get in the end:where is the diffusion time step and and are denoising hyperparameters.

3.5. Loss Function

We employ the mean square error to construct a loss function for facilitating the training of the network:

4. Experiments

4.1. Network Model Parameter Setting and Evaluation Metrics
4.1.1. Implementation Details and Hyperparameters

The models covered in this paper were implemented using the PyTorch framework, and the models were trained and tested using four GeForce RTX 3090Ti GPUs. During training, we processed the image resolution to 512 × 512. We set the initial learning rate to and the optimiser to Adam.

4.1.2. Training Dataset Construction

Our proposed ASNet is a two-stage model, which first generates a suitable advertisement scene and then stitches the obtained scene with the target product. The process needs to ensure that the target product is consistent-looking, so the ideal training data for ASNet are image pairs of “the same object in different scenes,” but these image pairs cannot be directly constructed from existing datasets. To solve this problem, a video dataset is generally used to capture different frames containing the same object. In detail, we select two adjacent frames from a video and extract the mask of the foreground object. Then, we obtain the target object from the previous frame by the foreground mask. For the next number of frames, we obtain the remaining background image by masking the foreground object. Through this series of operations, we acquire the target object and the scene image, and the original data frames happen to serve as the ground truth of the data pairs. The list of raw video data being used to extract the image pairs is shown in Table 1, which encompasses all kinds of scenes and is conducive to improving the generalisation ability of the model.

4.1.3. Baseline

To the best of our knowledge, this paper is the first to propose an end-to-end approach to generating an advertisement image for a specific product, so there are no approximate algorithms available for comparison. So, we used two approximations to complete the comparison experiments. (1) Advertisement image synthesis for target goods using the reference image approach in Midjourney [37]—this approach takes as input a background-less image of the target product and a set of prompt texts. The reference image method will combine the above two inputs to generate an advertisement image with the characteristics of the target product. (2) Combine the text-to-image and image-to-image modes in the Stable Diffusion model to synthesise an advertisement image for the target product. (3) Dalle3 is a powerful image compositing model that gives us unprecedented possibilities. It serves as a powerful tool that helps us generate images with a high degree of consistency and coherence more easily.

Specifically, we first use the text-to-image mode in Stable Diffusion to generate an advertisement image. Then, we combine this advertisement image with the background-less image of the target product into the image-to-image mode and finally synthesise the advertisement image of the target product.

4.1.4. Evaluation Metrics

We observe in Figure 4​ that the proposed model in this paper is capable of synthesising complex, realistic images. In general, we can use traditional performance metrics such as FIDs [38] to evaluate the quality of the images generated by the model. However, the numerical results of FID do not always agree with actual human sensory judgement [39]. In order to better measure the generative capacity of our system, we introduced systematic human evaluations to quantitatively evaluate the model. Three performance metrics are included in systematic human evaluations: photorealism [40], caption similarity [41], and sample diversity [39].

For the performance metric of photorealism, users are asked to score the advertisement images synthesised by different methods, and images that look more realistic should receive higher scores from the users. For caption similarity, users will score the advertisements based on the corresponding headline cues, and images that match the headline better are given higher scores.

Similarly, for sample diversity, users are asked to score the diversity of the four synthetic advertisement images generated by the different models, with more diverse advertisement images receiving higher scores.

4.2. Experiment Data

The ASNet model proposed in this paper generates the corresponding advertisement images for a given product image. In the process of generating advertisement images, the ASNet model firstly needs to generate pseudo-advertisement images initially by using the prompt text corresponding to the target product. After that, we input the target product image without background to correct the information and finally synthesise the advertisement image which is consistent with the target product.

In order to demonstrate more intuitively the practical application effect of the proposed algorithm in this paper, we randomly selected background-free images of four typical commercial products (shown in Figure 5) and designed corresponding prompt texts for them as the basic input data in the experiment (shown in Table 2).

4.3. Experiment Result and Analysis
4.3.1. Quantitative Analysis

The main goal of our work is to synthesise end-to-end advertising images of the target products. In order to verify the validity of the work in this paper, we tested the effect of advertisement image synthesis on four different target products.

Table 3 shows the results of the systematic human evaluations, in which the values of objective evaluation metrics photorealism, caption similarity, and sample diversity obtained by ASNet proposed in this paper are higher than those of other algorithms.

4.3.2. Qualitative Analysis

Figure 4 shows the visualisation results of comparing our method with other method lines. From the visual analysis of the experimental results, the advertisement images synthesised by each method are clear, reasonable, and aesthetically pleasing.

However, when we compare the target product image with the synthesised advertisement image one by one, we can clearly find that neither the image synthesised by Stable Diffusion nor Midjourney can be consistent with the shape and texture details of the target product image. After careful comparison and summary, we find that the advertisement images synthesised by the algorithm proposed in this paper have the characteristics of consistent-looking and consistent-detail.

In terms of consistent-looking, we can clearly observe in the first line of Figure 4 that the Converse sneaker advertisement synthesised by the proposed method is basically consistent with the target product image in both product appearance and colour texture. On the contrary, in the advertisement image synthesised by Stable Diffusion, although the colour of the synthesised sneakers is similar to that of the target product image, its appearance is very different from that of the target product image. Further, we can see that the Midjournal image, although similar to the target product image in terms of shape, is particularly different in terms of colour and the original placement of the sneakers.

Similarly, we can observe the fourth row in Figure 4. At this stage, we need to generate a corresponding advertisement image using the target product image of Apple mobile phones. The original target product image has two mobile phones, one presenting the back and the other presenting the front, which are overlapped together. Our proposed algorithm synthesises a mobile phone advertisement image that has a high degree of similarity in appearance and a more consistent product pose with the target product image. On the other hand, the mobile phone advertisement images synthesised by the other two algorithms could not maintain the consistency of appearance with the target product image, and it even appeared that the generated advertisement images were completely inconsistent with the target product image. The above two sets of comparisons fully demonstrate the superior performance of the algorithms proposed in this paper in terms of appearance heterogeneity.

For consistent-detail, we can find that the advertisement images generated by the proposed algorithm in this paper have a better presentation of product details by observing the second and third rows in Figure 4. Specifically, for example, comparing the Chanel perfume in the second row with its target product image and the synthetic advertisement image, our proposed algorithm is able to effectively maintain the consistency of the trademark information, while the advertisement images generated by the other two algorithms either lose the trademark information or generate unrelated trademark information. Similarly, the advertisement graph of Coca-Cola in the third row has the same problem. Our proposed algorithm synthesised the advertisement image keeping the consistency of the trademark information, but the other two algorithms synthesised the advertisement image with partial misrepresentation of the Coca-Cola​ logo.

4.3.3. Robustness and Generalisation Experiments

In order to verify the robustness and generalisation ability of ASNet, we will choose unconventional product categories and low-quality product images as inputs to the model. As shown in the first row of Figure 6, we choose the universal charger for mobile phone batteries, a product that is almost nonexistent now, as the research object. Around 2000, mobile phone batteries were still removable, so universal chargers were widely used. However, with the integration of mobile phones, the batteries are no longer removable, so it is unlikely that universal chargers would appear in these recent datasets used for training. We observe Figure 6 but find that this type of unconventional product does not affect the performance of ASNet, and our model still has good detail preservation.

Unfortunately, when we look at the second row of Figure 6, we find that if we choose a low-quality product image as the input to the model, the resulting advertisement image is very disappointing, and the resulting advertisement image does not even have any practical meaning. The reason for this phenomenon is that low-quality product images have extremely limited feature information, and the model cannot understand these features. Therefore, we can see that the generated advertisement image shown in the second row of Figure 6 is similar to the target product image in some features, but it is totally inconsistent from the overall point of view.

5. User Purchase Intention Study

In order to further clarify the impact of ASNet generated advertising images on real consumers, we measure the impact of ASNet generated images on real consumers’ purchase intention and brand perception through the form of the simplest questionnaire.

We conducted the experiment through a street questionnaire. A total of 100 volunteers were recruited for this experiment, and their participation was voluntary. Each participant was shown a randomly disrupted image of an advertisement generated by a different model, along with the original image of the target product. We asked them to rate their willingness to buy and brand perception on a scale of 1–10 (the higher the value, the stronger their desire to buy or the better their perception of the brand) after viewing the advertisement images.

During the questionnaire survey, we collected the basic personal information of the subjects, which included gender, age, education level, and so on. After eliminating 29 invalid questionnaires, there were 71 valid questionnaires, and the specific information of these 71 people is shown in Table 4.

The impact of the advertisement images generated by various models on consumers’ purchase intention and brand perception was explored through a questionnaire survey. As shown in Table 5, the advertisement images generated by our proposed ASNet are more likely to have a positive impact on consumers’ purchase intention and brand perception. Combined with the results in Table 3, we can reasonably speculate that this result is due to the fact that the advertisement images generated by ASNet maintain the structural and detailed integrity of the reference target object very well, so they are more realistic.

6. Limitations and Future Work

The ASNet proposed in this paper is built from the SD model based on Markov chain before and after the diffusion process as a base model. It can recover the real data more accurately and has better ability to maintain the image details, so it can generate realistic and attractive advertisement images. But it also has certain defects and limitations. For example, when the quality of the input target object image is low, ASNet cannot maintain the consistent-looking of the target object well because ASNet inherits the characteristics of the SD model, and it will repair the unknown parts when it cannot accurately identify the detailed features of the target object. Future work should consider how to solve this problem and improve the generalisation ability and robustness of the model.

In addition, although ASNet can automatically generate advertisement images end-to-end, it still needs professionals to set up cue synthesis scenarios according to the product characteristics. In the future work, we can consider generating product descriptions into cue words automatically through text models, which can further improve the degree of automation of ASNet.

7. Conclusions

In this paper, we present a new Advertisement Synthesis Network model for advertisement image synthesis of targeted products. To the best of our knowledge, this is the first end-to-end automatic ad image synthesis model that can transform a simple target product image into a designer and aesthetically pleasing product advertising image through a two-stage generation approach. Advertisement Synthesis Network is likely to dramatically reduce the cost of advertisement design and revolutionise the advertisement design industry. At the same time, the two-stage generation solution used in this paper can provide a generic solution idea for similar tasks.

Data Availability

The datasets generated and analysed in this study are still in the research phase and are not publicly available but can be obtained from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

QW designed the study, built the method, implemented the software, and wrote the paper. PZ contributed to the programming.