Research on the Image Description Algorithm of Double-Layer LSTM Based on Adaptive Attention Mechanism

Qin, Cifeng; Gong, Wenyin; Li, Xiang

doi:https://doi.org/10.1155/2022/2315341

Mathematical Problems in Engineering

On this page

Abstract Introduction Analysis Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 2315341 | https://doi.org/10.1155/2022/2315341

Research on the Image Description Algorithm of Double-Layer LSTM Based on Adaptive Attention Mechanism

Cifeng Qin,¹Wenyin Gong,¹and Xiang Li¹

Academic Editor: Paolo Spagnolo

Received21 Dec 2021

Revised21 Apr 2022

Accepted03 May 2022

Published21 May 2022

Abstract

Image text description is a multimodal data processing problem in the computer field, which involves the research tasks of computer vision and natural language processing. At present, the research focus of image text description task is mainly on the method based on deep learning. The work of this paper is mainly focused on the imprecise description of visual words and nonvisual words in the description of image description tasks in the image text description. An adaptive attention double-layer LSTM (long short-term memory) model based on coding-decoding is proposed. Compared with the algorithm based on the adaptive attention mechanism based on the coding-decoding framework, the evaluation index BLEU-1 is improved by 1.21%. The METEOR was 0.75% higher and CIDEr was 0.55%, while the indexes of BLEU-4 and ROUGE-L were not as good as those of the original model, but the index was not different. Although it cannot surpass all the performance indicators of the original model, the description of visual words and nonvisual words is more accurate in the actual image text description.

1. Introduction

With the rapid development of the Internet in recent years, massive information is uploaded to Google, Facebook, TikTok, Sina Weibo, and other large Internet websites every day. Aiming at the process, the PB-level massive text, pictures, video, and audio information generated has become a key issue worthy of research in the era of big data. For different formats of data, there are different research methods such as for natural scene text [1], there are text recognition [2], text detection [3], automatic summarization, natural language processing, and other tasks; for image data processing, there are computer vision domain tasks such as image semantic segmentation, target recognition [4], and salient target detection; for audio data, there are speech recognition, music information retrieval [5], environmental sound recognition [6], etc. However, the algorithms mentioned above are generally based on single-mode data processing methods. At this stage, multimode data processing problems involving multiple computer fields are proposed, such as image text description [7], video text description [8], autopilot [9], and other research tasks of computer vision and natural language processing. The text description task of natural language processing images is to automatically generate the content description sentences of images. Image text description is the cornerstone of the field of multimodal data processing, and it also has extraordinary value in real life. The problem of image text description is a cross-cutting research topic. It is a comprehensive problem involving the two fields of computer vision and natural language processing. The input is an image, and the output is a text description of the content of the image. Image text description is the conversion between two different types of information. Compared with other image tasks, image text description tasks require a deeper understanding of images. It not only needs to accurately identify each object in the image but also describe multiple objects. The mutual relationship. How to make the computer quickly analyze the image data and get the semantics based on the image is of great significance. For example, for the visually impaired, the external image information can be converted into voice text information to help them better understand the surrounding environment; in the field of video monitoring, the automatic detection of abnormal objects in the video can be used to understand, which can be used to monitor the abnormal highway conditions for real-time warning; in the field of automatic driving, the driving environment around the vehicle can be understood in real time.

Compared with the ability of a human to recognize the content in the image and the ability of logical reasoning between the content, the ability of the computer to understand the image content is far from enough. With the increasing demand for image understanding, how to make the computer understand image content has become a research hotspot in the field of artificial intelligence and machine learning. For the problem of image description, how to generate the correct natural language to describe the text content is a very challenging problem because it not only needs to classify image objects but also needs to analyze the correlation between object attributes and objects. In addition, the sentences generated based on the above content must conform to a certain description of grammatical structure and semantic correctness, which means that in addition to visual understanding, text generation model is also needed. It is the above factors that lead to the slow progress of image text description, but the rise of deep learning in recent years has brought new opportunities to the content of image text description.

In recent years, thanks to the rapid development of the deep neural network, researchers proposed an image description generation model based on deep learning architecture. Inspired by the task of machine translation, a method based on DCNN (Diffusion Convolutional Recurrent Neural Network) and RNN (Recurrent Neural Network) cascading is proposed, which is called the encoding-decoding framework. Under this framework, the encoding process is to extract the visual features of the target image through DCNN, and the decoding process is to generate descriptive sentences from the obtained visual features through RNN.

While the basic coding-decoding framework has made initial progress in the image description, researchers draw on the application of the attention mechanism in machine translation and propose a coding-decoding algorithm based on the attention mechanism to describe sentence generation tasks. The algorithm introduces the attention mechanism in the decoding stage, combines the words and local visual information of the image generated at the previous moment, dynamically focuses on different regions of the image when the language module generates each word, and generates description sentences. In 2015, Xu et al. [10] first applied the attention mechanism to image description task, including the hard attention mechanism which only uses visual information of a single local region and soft attention mechanism which uses visual information of all local regions of the image. The different results of the two mechanisms in the task are analyzed and discussed. In 2016, Yang et al. [11] proposed an image description algorithm based on review-based attention, which can perform the multistep review and enhance the stability of the model. In 2017, Pedersoli et al. [12] proposed an image description method based on region attention, which associates the image region with RNN dynamic words. In each time step of RNN, the word at the next moment and the image region associated with the word are predicted to generate description sentences. In the same year, Lu et al. [13] proposed an adaptive attention model based on visual sentinel. The words in the description sentence are divided into visual words and nonvisual words. The former depends on the entities in the image, and the latter depends on the language model. In word generation, Visual Sentry is used to select whether the word generation information comes from image visual information or language model. In 2018, Lu et al. [14] proposed a deep learning model NBT that combines template-filled sentences. The NBT model uses Fast R-CNN to extract image features to predict words. If it is a visual word, it is directly filled into the sentence template, which effectively solves the problem of the traditional template filling the sentence. In 2019, Xinwei et al. [14]proposed a new visually dense semantic attention network model VD-SAN (Visual-Densely Semantic Attention Network), which considers the relationship between image attribute prediction tasks and image surface feature extraction tasks. In the same year, Huang et al. [15] proposed the AoA (Attention on Attention) module, which expands the conventional attention mechanism by adding an additional attention to achieve better results in the image text description. Zhou et al. [16] proposed the POS-SCAN (Part-Of-Speech) model in the same year. The POS-SCAN model uses weak supervision on the basis of the SCAN model to train the attention module, which effectively improves the image text. The quality of the description. In 2020, Liu et al. [17] proposed a Chinese image description model NICVATP2L, which uses a visual attention mechanism to reduce the semantic deviation of Chinese and English image descriptions, and uses a semantic model to improve the accuracy and precision of text descriptions. Sex. Pan et al. [18] proposed the attention mechanism X-Linear attention for high-order feature interaction. This model can use bilinear fusion technology to mine second-order to high-order feature interaction information between different modalities to enhance cross-modal content understanding. In the same year, Guo et al. [19] proposed the NG-SAN model. The NG-SAN model improved the self-attention model and proposed two new models: normalized self-attention (NSA) and geometrically-aware self-attention (GSA). The concept makes up for the main limitation that the transformer model cannot model the geometric structure of the input object. In 2020, Sammani et al. [20] proposed a visual language training model ETN model. ETN can iteratively polish the generated description, effectively improving the readability of the description sentence. From the above research, it can be shown that the encoding-decoding framework that introduces the attention mechanism can pay attention to different regions in the image when generating words and has achieved good results.

At this stage, the task of image text description has achieved gratifying results, but there are still some problems. When evaluating the generated image description, it is not only necessary to compare the evaluation indicators but also to evaluate the grammar, context, diversity, and completeness of sentences. When describing image content, the current image description algorithm is not accurate in describing visual words (such as “man,” “boat”) and nonvisual words (such as “is,” “the”) in sentences so that there are grammatical errors in the generated text description, and there is low readability; when describing images of complex scenes, the effect of describing the main content of the image is not accurate, and the description of the main body is incorrect, making the generated description and the original image. The content is quite different, so further research is needed.

1.1. Paper Structure

In the section Improved Adaptive Attention Double-Layer LSTM Model, we introduced the research motivation of the paper, and what kind of requirements led us to propose a double-layer LSTM model based on adaptive attention and the model structure, algorithm flow, and mathematical method of the proposed adaptive attention-based two-layer LSTM model.

In the section Experiment and Result Analysis, we introduce the experimental parameters and environment settings and introduce the comparison results of the experimental results and mainstream image description models on various indicators.

2. Improved Adaptive Attention Double-Layer LSTM Model

2.1. Research Motivation

Image text description task is essentially a problem of the integration of computer vision and natural language processing. The input of the image text description is an image, and the output is the text description of the image content. Image text description is the conversion between two different kinds of information. Compared with other image tasks, image understanding requires a deeper understanding of the image, which not only needs to accurately identify each object in the image but also needs to describe the relationship between multiple objects. At present, most of these problems are based on the attention mechanism of the encoding-decoding framework. Adding the attention mechanism to the encoding-decoding framework can improve the readability of the statement description. At present, most of the algorithms that use the encoding-decoding framework in image description algorithms use a single-layer LSTM model to process image features and text generation at the same time to generate image description sentences. The single-layer LSTM needs to capture visual attention words and describe the relationship between visual words in the decoding process, which leads to more reliance of nonvisual words on semantic information rather than visual information. In the text description generation, nonvisual words will reduce the effectiveness of visual information, resulting in inaccurate expression of visual words and nonvisual words. In order to solve this problem, this paper proposes an adaptive attention double-layer LSTM model based on the original adaptive attention mechanism algorithm. Figure 1shows the LSTM cell structure.

2.2. Adaptive Attention Double-Layer LSTM Model

In the original coding-decoding framework based on the attention mechanism, the single-layer LSTM model is generally used to process the image context information and image description task, which makes it easy to generate the problem of insufficient expression accuracy of visual words and nonvisual words in the task of generating text descriptions. So in this paper, we design a double-layer LSTM based on the adaptive attention mechanism (AAM-DLSTM).

AAM-DLSTM uses ResNet (Residual Network) 50 to extract the global features and spatial local features of the image. The input of LSTM-1 is the global feature of the image, the previous context content vector , and the word vector , and the output is the hidden layer state of LSTM-1 and visual attention sentinel . The input of LSTM-2 is the context content vector at the current moment and the hidden layer state of LSTM-1, and the output is the hidden layer state of LSTM-2, which is generated through multiple time steps. The text description of the image. The pseudo-code implementation of the AAM-DLSTM algorithm is shown in Algorithm 1.

	Pseudo-code implementation of AAM-DLSTM algorithm
(1)	for i:1-10000{
(2)	for batch:1-Total number of images in the training set/batch_size{
(3)	Read a batch_size number of images into the model;
(4)	for t:1-T{
(5)	Extract global features and spatial local features of the image
(6)	Update the hidden layer state and memory layer state of the LSTM-1 module, and output the visual attention sentinel vector vs_t;
(7)	Update the hidden layer state and memory layer state of the LSTM-2 module;
(8)	Generate the predicted word s_t for the current time step
(9)	}
(10)	Get the text description sentence S of the current image;
(11)	Update each parameter in the AAM-DLSTM model;
(12)	}
(13)	}
(14)	All the loops are over, and the parameters of the AAM-DLSTM model are obtained

As shown in Figure 2, first input the original image (as shown in box A in Figure 2) and extract the global feature vector and local feature vector of the image through ResNet (as shown in box B in Figure 2). The spatial local feature is sent to the adaptive attention module to obtain the context content vector at the current moment (as shown in box C in Figure 2). Taking into account the influence of the word generated at the previous moment, the word vector of the word is sent to the visual information module LSTM-1 to obtain dependency information (as shown in box D in Figure 2). is a word in in the generated image description, where LS indicates the length of the generated description sentence, indicates the start position, and indicates the end. The word vector can be expressed as , where is the word vector matrix. The input of the visual information module at the current moment is composed of the word vector , the global feature vector , and the context content vector , that is, . According to the LSTM principle introduced in the previous section, the hidden layer vector and the memory layer vector of the LSTM-1 at time can be updated as shown in the following formula:

Among them, and , respectively, represent the state of the hidden layer and the state of the memory layer of a unit in LSTM-1 at time . In order to obtain the source of information on which nonvisual words are generated, the visual sentinel needs to be output for LSTM-1. The vector (shown on the right side of box D in Figure 2) mainly depends on the memory layer state at the current moment of LSTM-1, and the calculation method of the vector is shown in the following formula:

Among them, and are the weight parameters obtained from model training, is the sigmoid function, is the XOR logic operator, and is the hyperbolic tangent function. LSTM-2 mainly relies on the visual sentinel vector to generate nonvisual words (as shown in box G in Figure 2) so that the readability of the text can be effectively improved when the text is generated, and the update principle of LSTM-1 is used to update the LSTM at time . The hidden layer state and the memory layer state of −2 are expressed as shown in the following formula:

In formula (4), is the input vector of the text generation module LSTM-2, which is composed of the context content vector and the hidden layer state of LSTM-1 (as shown in box F in Figure 2). Similarly, and , respectively, represent the state of the hidden layer and the state of the memory layer of a unit in LSTM-2 at the previous moment at time . For the context content vector (as shown in box E in Figure 2), its mathematical expression is shown in the following formula:

In the formula, represents the attention weight distribution at the current moment, is the visual sentinel, and the attention weight distribution is which is specifically shown in the following formula:

Among them, softmax is the normalized exponential function; is the hyperbolic tangent function; , , and are the weight parameter matrices obtained from model training. According to the updated current state , the predicted probability distribution of the word generated at the current moment is calculated (as shown in box G in the figure) as shown in the following formula:

Among them, is the weight parameter of the probability distribution calculation. The goal of the decoder optimization is to maximize the sum of the log-likelihoods of all training samples, as shown in the following formula:

Among them, is the total parameter of model training, is the total parameter after model optimization, is the image in the input training set, and is the sentence marked with the training image. The loss function of the decoder is to minimize the cross-entropy (CE) loss function, as shown in the following formula:

In the formula, is the length of the sentence. When the generated sentence reaches the maximum generated sequence length, the decoding is stopped, and the complete image description sentence is finally obtained (as shown in box H in the figure).

3. Experiment and Result Analysis

3.1. Experimental Parameters and Environment Settings

In the experiment, we use MSCOCO database which is based on the challenge of image recognition and image description held by Microsoft in 2014. MSCOCO 2014 database contains images and various manual annotations of images. Aiming at the problem of image description, each image includes 5 sentences of text description of the image. For MSCOCO data set, 123287 images are divided into training set 113287, verification set 5000, and test set 5000; when annotating text, use <start> to mark the beginning statement of the image description, use <end> to mark the end statement of image description, and use <UNK> to mark the unrecognized word group in the image description. For text annotation, one-hot vector coding is used to generate a matrix with lower dimension. GeForce GTX 1080ti is used in the whole experiment. The video memory of the card is 11 g, and the memory size of the server is 32 g. The software environment is TensorFlow, an open-source framework of Python. The image size is 224 ∗ 224, the number of iterations is 10000, and the vector dimension of the LSTM module is set to 1024; using Adam to optimize the model, the learning rate of training is 0.005. In training, the size of the data set is set to 32; in order to avoid overfitting in the experiment, dropout is set to 0.2 in the neural network. LSTM model parameter settings input dimension is set to120, hidden layer dimension is set to 1024, and layers are set to 2. Parameter settings are shown in Table 1.

4. Analysis and Comparison of Results

4.1. Comparison Test of Convolution Network

ResNet models that perform well in image classification tasks include ResNet34, ResNet50, and ResNet101. As the number of layers is higher, the model and effect are better. However, the higher number of layers will lead to higher hardware requirements and longer training time. The experiment uses the ImageNet data set to experiment with ResNet and selects BLEU-1 and BLEU-4 evaluation indicators to evaluate the experimental results. The experimental results are shown in Table 2.

The two evaluation indicators BLEU-1 and BLEU-4 evaluate the quality of the generated text. The larger the indicator data, the better the generated effect. According to the experimental data in Table 2, in the BLEU-1 and BLEU-4 indicators, the experimental data of ResNet50, ResNet101, and ResNet 152 are not much different, and the network models of ResNet 101 and ResNet 152 are relatively complex, and the hardware requirements of the machine are relatively high, Training time is longer, comprehensive training cost, time-consuming and other factors, this experiment selects ResNet 50 as the image feature extractor. The comparison of model training time is shown in Table 3.

4.2. Comparative Experiment of Classical Image Description Algorithms

From the perspective of the model structure, in order to verify that modifying the model structure can affect the accuracy of the image text description, the experiment set up a total of three AAM-DLSTM model variants for ablation comparison experiments. The model for the comparison experiment is the adaptive attention model, AAM-TLSTM model and AAM-DLSTM-A. In order to ensure the effect of the comparative experiment, the experiment environment used by different variant models in the experiment is kept consistent.

Adaptive Attention Model: the first model compared in this experiment is the adaptive attention model, in which there is only one LSTM module. Compared with the AAM-DLSTM model, the original LSTM model replaces the LSTM-1 visual information module and the LSTM-2 text generation module in AAM-DLSTM, and formula (3) needs to be changed to formula (10) and (4) needs to be changed to formula (11).where represents the hidden layer state of the LSTM module at the current moment, represents the memory layer state of the LSTM module at the current moment, represents the input of the single-layer LSTM module, represents the hidden layer state of the LSTM module at the previous moment, represents the memory layer state of the LSTM module at the previous time, represents the word vector formed by combining the words generated at the previous time, represents the image global feature vector, and represents the context content vector at the current time.

AAM-TLSTM Model: the second algorithm model compared in this experiment is the AAM-TLSTM model. This model considers adding a layer of the LSTM module on the basis of the AAM-DLSTM model to verify the influence of the number of LSTM layers on the effect of the image text description algorithm. The LSTM model added by AAM-TLSTM on the basis of AAM-DLSTM is named LSTM-3, and formula (12) is added after formula (4), and formula (13) is as follows:where represents the hidden layer state of the LSTM-3 module, is the hidden layer state of the LSTM-3 module, is the input of the LSTM-3 module, is the context content vector, and is the hidden layer state of the LSTM-2 module. The corresponding algorithm framework is shown in Figure 3.

AAM-DLSTM-A: the third algorithm model compared in this experiment is the AAM-DLSTM-A model. This model mainly considers the influence of the input of the text generation module LSTM-2 on the final generation of the complete sentence when the word at time is generated. Compared with the AAM-DLSTM model, AAM-DLSTM-A modifies the input of the original LSTM-2 text generation module, and the modified expression is shown in the following formula:where represents the input of LSTM-2 and represents the context content vector at the current moment. The meaning of this formula is that the text generation is only affected by the context content vector at the current moment. The algorithm framework diagram is shown in Figure 4, the input of LSTM-2 is only the context content vector .

Hard attention and soft attention are the first time to propose the attention mechanism for image text description tasks in 2018. Hard attention mainly uses reinforcement learning methods, and the hidden layer of the entire coding layer is not selected as input during the training process. In the input, a random process is adopted to sample and input the hidden layer. Soft attention mainly calculates coding through deterministic scores. When the attention is assigned to the probability distribution, the corresponding word is predicted for each attention value.

VD-SAN is a new type of dense semantic attention network proposed in 2019. This model mainly considers the correlation between image attributes and image feature extraction tasks and has achieved good results in image text description tasks.

AoANet is the AoANet model proposed in 2019. In the AoANet model, a module called AoA (attention on attention) is proposed. This module expands the conventional attention mechanism by adding an additional attention, making it possible to achieve a better effect in the image text description.

NBT is an image description method using “slots” proposed in 2018. NBT uses a neural network model to extract sentence templates and classifies the words generated at the current moment. When classified into visual words, the visual words are directly generated from the characteristics of the corresponding regions of the image, and when classified into text words, they are generated from the text lexicon to a certain extent. The above solves the problem of the traditional template filling sentence dull.

ETN is a vision-language training model proposed in 2020, which can iteratively polish the generated description and effectively improve the readability of the description sentence.

In order to verify the effect of the AAM-DLSTM algorithm in the task of image text description, this experiment compares the AAM-DLSTM algorithm with multiple public algorithm models and uses multiple objective quantitative scoring methods to evaluate the test results. The scoring table is shown in Table 4.

BLEU is a bilingual mutual evaluation aid tool. It is used to analyze the degree of the common occurrence of N-tuples in the candidate translation and the reference translation. BLEU-1 and BLEU-4 represent N values of 1 and 4, respectively.

ROUGE-L is calculated based on the recall rate and is the evaluation standard for automatic summarization tasks. L refers to the longest common subsequence, and the longest common subsequence of the machine translation C and the reference translation S is used in the calculation of ROUGE-L.

METOR is a single-precision weighted harmonic mean and single-word recall metric that aims to address some of the inherent flaws in the BLEU standard.

CIDEr is a combination of BLEU and a vector space model. It regards each sentence as a document and then calculates the cosine angle of the TF-IDF vector to obtain the similarity between the candidate sentence and the reference sentence.

SPICE encodes the objects, attributes, and relationships in the caption using a graph-based semantic representation.

Comparative experiments were carried out on the three AAM-DLSTM model variants mentioned above, and the experimental results are shown in Table 4. From the experimental results, it can be seen that increasing the number of LSTM layers can improve the effect of image text description. When the number of LSTM layers is 2, the trained model has achieved good results on most evaluation indicators; when the number of LSTM layers is 1, the input information of the single-layer LSTM module in the adaptive attention model is reduced, which leads to the deterioration of the effect of image text description; when the number of LSTM layers is 3, the input of LSTM-2 in the AAM-TLSTM model contains the state of the hidden layer of LSTM-1. The input of LSTM-3 contains the state of the hidden layer of LSTM-2. During model training, LSTM-3 has more reference information in the input when generating words, which makes the area of the current moment of attention in the image larger, but the accuracy of generating words is reduced, which leads to the description accuracy of the AAM-TLSTM model not as high as the accuracy of the AAM-DLSTM model; it can be seen that in this experiment, the stacking on the number of layers does not linearly improve the accuracy of the model. When the number of LSTM layers is 2, the information input by the text generation module is more abundant than that of the single-layer LSTM module, and the redundant information is more concise than the three-layer LSTM module, so the accuracy of the generated image text description is higher. Comparing the data of the AAM-DLSTM model and the AAM-DLSTM-A model, it is verified that when the input of the text generation module LSTM-2 is composed of the context content vector and the hidden layer state of LSTM-1, the effect of image description is better. When the input of LSTM-2 is only the context content vector, LSTM-2 cannot consider the hidden layer state in the LSTM-1 visual information module. When generating words, it only relies on the context content vector for prediction, which has a strong dependence on the spatial and local features of the image, resulting in the generated words tend to ignore the global features of the image, that is, the text description of the context image does not match the context in the image. Compared with soft attention, hard attention, NBT, and VD-SAN, the AAM-DLSTM model has improved some indicators. Compared with the adaptive attention model with unchanged structure, it has increased by 1.21% on BLEU-1, 0.75% on METEOR, 0.55% on CIDEr, and 1.96% on SPICE; the evaluation index of BLEU-4 and ROUGE-L is not much different from the evaluation index of adaptive attention. The evaluation indicators of BLEU-4 and ROUGE-L are not much different from those of adaptive attention. Compared with NBT, AAM-DSLTM is only slightly worse on BLEU-1. Other indicators are better than the NBT model, which verifies that the AAM-DLSTM model can improve the accuracy of image text description, but AAM-DLSTM is compared with the AoANet model. The ETN model has a limited effect on the improvement of image text description.

4.3. Comparison of Experimental Results

Although the evaluation indicators in the previous section can reflect the effect of the generated description sentences to a certain extent, these evaluation indicators cannot reflect the context and accuracy of the actual image text description sentences. In the experiment, the adaptive attention model, AAM-DLSTM model, and ETN model using adaptive mechanism are selected as the comparison objects of the experiment. The experiment uses the images in the MSCOCO 2014 test set to conduct a comparative experiment and shows the description effect (as shown in Figure 5).

(a)

(b)

(c)

(d)

As shown in Figure 5(a), the adaptive attention model uses the word “play” when describing nonvisual words, while the AAM-DLSTM model only describes the action of the stick, ignoring some details. The description of the main body of the image by the ETN model is more accurate, and the description of the visual word “baseball player” and the nonvisual word “swing” is more accurate; (b) in the figure, the adaptive attention model describes the dog playing on the green. The playing object and the frisbee in the mouth are not described. The AAM-DLSTM model does not capture the dog’s playing movements but only describes the dog holding the frisbee on the grass. ETN plays with both visual word “flying saucer” and nonvisual word “play”. Very good description; (c) in the figure, the adaptive attention model is less accurate when describing the gathering of people. The AAM-DLSTM model and ETN can effectively identify the situation of people. The ETN model is better than AAM in capturing details. DLSTM is more excellent and can describe the distribution of personnel well; (d) in the figure, the adaptive attention model ignores some visual words in the image such as padded jacket and sled when introducing the image; the AAM-DLSTM model is in the description, and there are similar problems. The above examples show that the AAM-DLSTM model is more accurate than the adaptive attention model, and the generated image description is more in line with the context, but not as good as the ETN model, which can describe the details of the image more clearly. In summary, when performing text descriptions on images in the test set, the AAM-DLSTM model has improved image description accuracy compared to the adaptive attention model and has a better description of visual and nonvisual words. But the description of the details in the image is not as good as the ETN model. It also verifies that AAM-DLSTM has the problem of unclear description of some details in the image and also points out the optimization direction for subsequent research.

5. Conclusions

This section first pointed out that the existing image description algorithm of adaptive attention mechanism has a poor description effect, discussed the influence of different structure models on image description effect, and proposed a two-layer LSTM model based on adaptive attention mechanism. The image description algorithm AAM-DLSTM uses a two-layer LSTM model on the basis of the encoding-decoding framework. The two-layer LSTM model includes the visual information module LSTM-1 that can capture the visual words in the image at the current moment and the text generation module LSTM-2 that receives the context content vector and the description sentence generation. Through comparative experiments on the MSCOCO 2014 data set, it is verified that compared with some classic image description algorithms, AAM-DLSTM has improved some evaluation indicators, and the generated text description is more in line with the image context.

Data Availability

The (experimental) data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China, Research on Adaptive Integrated Evolutionary Algorithm for Multi-Root Solving of Complex Nonlinear Equations (62076225).

References

A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324, Las Vegas, NV, USA, June 2016.
View at: Google Scholar
B. Shi, B. Xiang, and Y. Cong, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
View at: Google Scholar
Q. Ye and D. Doermann, “Text detection and recognition in imagery: a survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
View at: Publisher Site | Google Scholar
H. Yin, Bo Chen, C. Yi, and Z. Liu, “Overview of vision based object detection and tracking,” Acta Automatica Sinica, vol. 42, no. 10, pp. 1466–1489, 2016.
View at: Google Scholar
S. Li, C. C. M. Yeh, J. Y. Liu, J. C. Wang, and Y. H. Yang, “A systematic evaluation of the bag-of-frames representation for music information retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 5, 2014.
View at: Google Scholar
Q. Huang, Research on Music Genre Classification Model Based on Convolutional Neural Network [D], Jilin University, Changchun, China, 2019.
R. Krishna, Y. Zhu, O. Groth et al., “Visual genome: connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
View at: Publisher Site | Google Scholar
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.
View at: Publisher Site | Google Scholar
F. Kroger, Automated Driving in its Social,Historical and Cultural Contexts[M], Springer, BerLin, Germany, 2016.
K. Xu, J. Ba, R. Kiros et al., “Show, attend and tell: neural image caption generation with visual attention,” in Proceedings of the International Conference on Machine Learning, pp. 2048–2057, Lille, France, July 2015.
View at: Google Scholar
Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen, “Review networks for caption generation,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 2361–2369, Barcelona, Spain, December 2016.
View at: Google Scholar
M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek, “Areas of attention for image captioning,” in Proceedings of the International Conference on Computer Vision, pp. 1251–1259, Venice, Italy, December 2016.
View at: Google Scholar
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: adaptive attention via a visual sentinel for image captioning,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 3242–3250, Hawaii, USA, December 2016.
View at: Google Scholar
J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural baby talk,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7219–7228, Salt Lake City, GA, USA, June 2018.
View at: Google Scholar
L. Huang, W. Wang, J. Chen, and X. Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4634–4643, Seoul, South Korea, October 2019.
View at: Google Scholar
Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, April 2020.
View at: Google Scholar
M. Liu, H. Hu, L. Li, Y. Yu, and W. Guan, “Chinese image caption generation via visual attention and topic modeling,” IEEE Transactions on Cybernetics, vol. 1, no. 99, pp. 1–11, 2020.
View at: Google Scholar
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, June 2020.
View at: Google Scholar
L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, June 2020.
View at: Google Scholar
F. Sammani and L. Melas-Kyriazi, “Show, edit and tell: a framework for editing image captions,” 2020.
View at: Google Scholar

Copyright

Copyright © 2022 Cifeng Qin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies