Abstract

Encoder-Decoder network is usually applied to image caption to automatically generate descriptive text for a picture. Web user interface (Web UI) is a special type of image and is usually described by HTML (hypertext marked language). Consequently, it becomes possible to use the encoder-decoder network to generate the corresponding code from a screenshot of Web UI. The basic structure of the decoder is RNN, LSTM, GRU, or other recurrent neural networks. However, this kind of decoder needs a long training time, so it increases the time complexity of training and prediction. The HTML language is a typically structured language to describe the Web UI, but it is hard to express the timing characteristics of the word sequence and the complex context. To resolve these problems efficiently, a rapid combined model RCM (rapid combined model) is designed in this paper. The basic structure of the RCM is an encoder-decoder network. The word embedding matrix and visual model are included in the encoder. The word embedding matrix uses fully connected units. Compared with LSTM, the accuracy of the word embedding matrix is basically unchanged, but the training and prediction speed have been significantly improved. In the visual model, the pretrained InceptionV3 network is used to generate the image vector, which not only improves the quality of the recognition of the Web UI interface image but also reduces the training time of the RCM significantly. In the decoder, the word embedding vector and the image vector are integrated together and input into the prediction model for word prediction.

1. Introduction

It is rather time-consuming for web application front-end developers to build a Web UI in HTML and CSS (cascading style sheets) language. It would save more human resources if this process is partially automated.

Scholars have made a lot of efforts to automatically generate program code. For instance, Gaunt et al. [1] studied machine learning formulations of inductive program synthesis and proposed a TerpreT model composed of a specification of a program representation and an interpreter that described how programs mapped inputs to outputs. Kalyan et al. [2] introduced a kind of hybrid synthesis technique, the neural guided deductive search, which combines the best of both symbolic logic techniques and statistical models. Bunel et al. [3] presented an adaptive neural-compilation framework to address the problem of efficient program learning by considering correctness only on a target input distribution. Balog et al. [4] proposed DeepCoder, which combined traditional search techniques and used statistical prediction to generate computer code sequences. Solar-Lezama et al. [5] described the sketching approach to program synthesis, which left the low-level details of the implementation to an automated synthesis procedure. Ellis et al. [6] introduced a well-trained model which can convert hand drawings into a sequence of codes. Riedel et al. [7] presented a differentiable interpreter for the programming language Forth and made programmers can write program sketches with slots that can be filled with learnable behavior.

The encoder-decoder network encodes the input into a latent vector firstly and then converts it into the target output, so it is widely applied to the machine translation [8, 9]. When the input sentences in machine translation are taken place by the images, the network would be able to translate these images to sentences [1012]. Considering encoder-decoder network has made great progress in automatically describing image content [13, 14], scholars have begun to try to link computer vision with programming languages. Wu et al. [15] proposed a new model which can interpret objects in certain scenes. A recent example is pix2code [16], which combines CNN and LSTM to encode the input image and the original code sequence, and then uses LSTM to decode the code. Zhu et al. [17] introduced the attention mechanism to the process of input image feature extraction, and then the extracted block feature information guides the LSTM network in the decoder to generate the code sequence. But the overall structure of this model is still the encoder-decoder network, which is similar to the pix2code model, that is, the code sequence features and image features are used as input, and the LSTM network is used to predict the code sequence. However, an obvious shortcoming of this type of network is that the time complexity of the model is large, which is mainly due to the fact that the model is entirely composed of CNN and LSTM units [16, 17].

To resolve these problems, a fast end-to-end model is introduced to generate Web UI code. The model can not only learn the word embedding matrix of Web UI code but also automatically generate code sequences based on Web UI images. Experimental results show that the model has higher generation accuracy on the public Web UI data set, and can also improve the training speed.

This paper is divided into four parts. First, the research background and the latest progress of automatic code generation are introduced, and the advantages and disadvantages of common models are analyzed, which further leads to the RCM model proposed in this paper. The second part introduces the overall structure of RCM generated by the web user interface (Web UI) and then discusses the word embedding matrix model, visual model, and prediction model in detail. The third part introduces the training experiment of the RCM model and tests the RCM model from two aspects of single word prediction and whole code generation. The fourth part is the conclusion of the whole thesis and the prospect of future work.

2. Design of RCM

This section will introduce the RCM model, and the experimental data will be shown in the next section.

2.1. Structure of RCM

The overall structure of RCM is still the encoder-decoder network, but the encoder part is composed of word embedding matrix and visual model. The results of these two parts are superimposed as the output of the encoder. The word embedding matrix converts the code describing the Web UI into a vector sequence, and the visual model is responsible for converting the screenshot of the Web UI into its corresponding high-dimensional vector. The decoder part focuses on using the results of the encoder to predict the next word. The whole structure of RCM is shown in Figure 1.

Firstly, the word embedding matrix is used to convert the index vector of the context word sequence into the word vector (i represents the sample number, t represents the current moment), and meanwhile, the image is input into the visual model to obtain the image feature vector . Secondly, and are stitched together to obtain a feature vector representing sample i. Finally, is input into the prediction model, and the Softmax network is used to obtain the index vector of the next word . Before the start of the next time step, would be added to get a new . The calculation process of RCM is shown in equations (1)–(5).

2.2. Learning of Word Embedding Matrix

Since the word embedding model can be able to learn the distribution representation of each word [18], and people can use the information to calculate the occurrence probability of a certain sentence, the word embedding technique is commonly applied in NLP tasks. In order to obtain the feature vectors of the word sequence in Web UI code, the word embedding matrix is used to store the word vector of each word in Web UI.

The DSL (domain specified language) is usually used in a specialized field, which is more restrictive than general computer languages. After converting the Web UI code sequence into a DSL word sequence, the number of DSL words is 19, and the number of tags in H5 is 118. Therefore, modeling using DSL can better limit the complexity of modeling and reduce the search space of the model.

The size of the word embedding vector matrix is set to (19, 64). In order to get , it is necessary to build a DSL language model for Web UI. Considering that the Web UI interface usually adopts a nested structure, in order to verify whether its DSL sequence has obvious dynamic timing characteristics like natural languages [9], the sequence of word embedding vectors are processed separately by FC layer, LSTM layer [19], CNN layer [20], and BiLSTM (bidirectional LSTM) layer [21], and then the Softmax activation function is used to predict the next word. The structure of the DSL language model for Web UI is shown in Figure 2.

Firstly, the DSL Context at the t-th moment in the i-th sample is converted into the index vector . Secondly, the word embedding matrix is used to convert into the word embedding vector . Finally, let input FC (full connect), LSTM (long short-term memory), BiLSTM (Bidirectional LSTM), and CNN (convolutional neural networks) units for separate training and testing. During the training process, for the word sequences of all samples, the same word embedding matrix is shared.

Multiclass cross-entropy is applied to calculate the loss value of a single sample , as shown in the following equation:

In this equation, M is the length of the DSL word list and is the input sample. is the i-th component in the prediction result, that is, the probability of the next word is the i-th word. is the true probability that the next word is the i-th word.

Batch B contains N samples, then the loss function of batch B is the average of the loss values of all samples in the batch, as shown in the following equation:

FC, LSTM, BiLSTM, and CNN units are used to separately construct language models, which are trained by the same dataset (the details of the dataset are in “3. Experiments”). After having been trained for 10 cycles, the model with the highest accuracy in the verification set is taken as the best model, and its accuracy on the test set is then calculated. The experimental results are shown in Table 1.

It can be concluded from the experimental results that the FC layer is relatively stable and is not affected by the size of the batch. The accuracy of the verification data and the test data is relatively stable, basically about 91.6%. The LSTM layer is greatly affected by the batch. The larger the batch, the lower the accuracy of the LSTM layer. But when the batch size decreases, although the accuracy rate has increased, the training time increases exponentially with the batch size reduction. Especially the BiLSTM layer is about twice the training time of the ordinary LSTM layer. When the batch size is 512, the CNN layer has poor accuracy, but when the number of batches is reduced, the accuracy of the verification set and test set has increased.

The above analysis shows that, although LSTM and BiLSTM have achieved good results in the field of NLP, they performed mediocrely when they are applied to the DSL sequence of the Web UI interface. CNN does not perform as well as ordinary FC on the DSL sequence of the Web UI interface. Therefore, it can be judged that the timing characteristics of the DSL language describing the Web UI interface are not obvious. The LSTM and BiLSTM layers do not have obvious advantages in the code prediction tasks of Web UI, and their computation time is several times that of FC layers.

Consequently, the FC layer is chosen to build a DSL language model with a training period of 100 and a batch size of 512. The loss function uses the cross-entropy function defined by equations (6) and (7). The accuracy and loss of model training are shown in Figure 3.

The parameters of the DSL language model are 199,251, the training time is about 8 minutes, the minimum loss is 0.19178, and the maximum accuracy rate is 91.824%.

2.3. Design of Visual Model

In recent years, CNN networks are introduced to extract feature vectors from images, which can accurately classify images or recognize objects [22, 23]. In the research of image captioning, a common method is that CNN is used to extract feature vectors from images and recurrent neural network units are applied to generate image captions [2426].

Considering that CNN has achieved amazing achievements in the field of image recognition [27, 28], the convolutional networks are adopted in this visual model in RCM. In order to improve the training and prediction speed of the model, the pretrained InceptionV3 model [29] is adopted as the visual model of RCM, and some adjustments to the original InceptionV3 model are made to better handle the DSL prediction tasks.

The visual model in RCM discards the classification level of the upper layer of Inception V3, only uses the convolutional layer at the bottom to process the image of the Web UI and adds a global average pooling layer to generate the corresponding image feature vector , the vector size is (1,2048). The final visual model has a total of 21,802,784 parameters, and these parameters can be updated during the training of the RCM if the computing power allows. However, under the experimental conditions of this project, in order to improve the training speed of the entire model, all parameters of the visual model are locked.

2.4. Design of Prediction Model

The experimental results in 3.1 show that in the DSL sequence prediction task of the Web UI, the FC unit can greatly reduce the training time than the LSTM unit and the CNN unit on the premise of ensuring more than 90% accuracy, so the FCB (full connect block) is designed as a foundation to build a prediction model of RCM. The structure of FCB is shown in Figure 4.

The FCB consists of a fully connected layer, batch standardization layer 30, and dropout layer [31]. The specific calculation process is shown in equations (8)–(11). The variables and notations are shown in Table 2.

The prediction model in RCM consists of several FCBs connected in series, and its structure is shown in Figure 5.

The feature vector of sample i is composed of the word embedding vector and the image feature vector . The prediction model uses as an input to predict the probability of the next word. The number of units in the FC layer of the first 9 FCBs in the prediction model is set to 512, the FC layer of the last FCB contains 19 units, and the final Softmax layer is responsible for calculating the prediction probability of 19 DSL words. The DSL word corresponding to the maximum value in the prediction probability is used as the prediction result. The training parameters of the entire prediction model are 4,743,379.

3. Experiments

In order to verify the effect of RCM, the Web UI dataset in pix2code [16] is utilized to train and test RCM. The pix2code data set contains 1750 Web UI instances, each instance contains a Web UI code sequence and an image. The format of the code sequence and the corresponding image is shown in Figure 6.

Before the start of the experimental process, a total of 167958 samples with a context length of 48 are obtained from each instance in the pix2code dataset, and all of these samples contain 19 DSL words. Then, 24,108 samples are extracted to form the test set, with the rest forming the training set, within which 20% of the samples are randomly selected to form the verification set during the training process. Each sample contains a DSL sequence of length 48 and a predicted word, and an image in PNG format.

3.1. Training

Due to the limited computing power of the experimental computer, only the word embedding matrix and the parameters of the prediction model are trained, and the parameters in the visual model do not participate in the training. During the training process, the batch size is set to 512, the iteration period is set to 200, the learning rate is set to 0.00001, the Adam gradient descent algorithm [32] is used to update the parameters of the model, and equations (6) and (7) are used to evaluate the model loss. The accuracy and loss curves of the model during the training phase are shown in Figure 7.

The total parameters of RCM are 4,942,630 (parameters of language model and prediction model), which is about 1/20 of the total number of pix2code model parameters [16]. The training runs on the author’s desktop computer (Ubuntu18, AMD® Ryzen 5 1400 quad-core processor × 8, GeForce GTX 1660), which takes a total of 29 minutes (language model takes 8 minutes, prediction model takes 21 minutes), the highest accuracy rate is 97.685% and the minimum loss is 0.04805.

3.2. Test for One-Word Prediction and Codes Generation

Using the trained RCM to make a one-word prediction on the test set, the accuracy rate is 96.493%, and the loss is 0.08356. As its multiclass microaverage ROC curve is shown in Figure 8, its AUC value is 0.99984, so the classification effect of the model is quite ideal.

In order to test the ability of RCM on codes generation, Web UI images in the test set are input into the RCM to generate the code sequence (using the greedy search algorithm). Finally, the average error rate is 8.89%. It should be noted that when the length of the generated sequence does not match the length of the real sequence, the length difference is also included in the error count. The prediction error rate of RCM is 12.14% which is much lower than that of pix2code in the Web UI test set [16]. This experimental result further proves that RCM also has an advantage in generating code sequences while significantly improving the training speed.

A Web UI image in the test set (shown in Figure 9) (the original Web UI DSL code is shown in Figure 10) is selected as an input, and the RCM is used to generate a DSL code sequence (shown in Figure 11). After the DSL code is converted to HTML code, the resulting Web UI interface can be viewed through the browser (shown in Figure 12).

According to the model generation results of the above images, the DSL code sequence and the HTML code sequence generated by RCM using Web UI images are quite accurate, but there are also cases that the colors of some controls are inconsistent. In addition, it should be noted that the texts for controls such as buttons and DIV layers are not included in the DSL dictionary, they are randomly generated by the program.

4. Conclusion

With the continuous development of deep learning and especially the outstanding performance of CNN networks in the field of image recognition, it becomes possible to apply the model of the encoder-decoder structure to convert the graphical user interface into corresponding code sequences. Web user interface is usually described by HTML (hypertext marked language) and the basic structure of the decoder is RNN, LSTM, GRU, or other recurrent neural networks. However, this kind of decoder is with long training time, so it increases the time complexity and space complexity of training and prediction. The HTML language is a typically structured language to describe the Web UI, but it is hard to express the timing characteristics of the word sequence and define a complex logical function in HTML language. In order to solve the above problems, a fast end-to-end model is introduced to generate Web UI code. The model can not only learn the word embedding matrix of Web UI code but also automatically generate code sequences based on Web UI images.

This model greatly improves the training speed under the premise of ensuring higher prediction accuracy and a lower generation error rate. A word embedding matrix is constructed to represent each DSL word, which can express the relationship between DSL words more abundantly than using one-hot vectors. At the same time, the initialization process of the word embedding matrix also shows that the code sequence of the Web UI is more structure-oriented rather than sequential. Therefore, the RNN, LSTM, BiLSTM networks commonly used in the NLP field are not used in the prediction model, but the FC network is adopted to promote the training speed of the prediction model.

The visual model is built on the basis of the pretrained InceptionV3 model, which not only decreases the complexity of the entire model but also greatly reduces the search space of the parameters. The prediction model is built with FCB as the basic unit and can use both the context feature vector and the image feature vector to make predictions. It ensures not only a relatively high prediction accuracy but also a low generation error rate. Above all, the training speed is also greatly improved.

Experimental results show that the model has higher generation accuracy on the public Web UI data set, and can also improve the training speed.

In the follow-up work, the real HTML web page code and its image would be obtained from the Internet with the help of crawler technology, and the word embedding matrix in RCM would be redesigned so that it can handle HTML and CSS codes more effectively.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Hebei IoT Monitoring Engineering Technology Research Center (No.3142018055), the Key Research Program of Hebei Province (No.19270318D), Langfang Science and Technology Research and Development Plan Project (no. 2018011041), and the Self-selected Project of North China Institute of Science and Technology (no. HZXKT2020012).