Abstract

With the rapid development of short video, the mode of sports marketing has diversified, and the difficulty of accurately detecting marketing videos has increased. Identifying certain key images in the video is the focus of detection, and then, analysis can effectively detect sports marketing videos. The research of video key image detection based on deep neural network is proposed to solve the problem of unclear and unrecognizable boundaries of key images for multiscene recognition. First, the key image detection model of the feedback network is proposed, and ablation experiments are conducted on a simple test set of DAVSOD. The experimental results show that the proposed model achieves better performance in both quantitative evaluation and visual effects and can accurately capture the overall shape of significant objects. The hybrid loss function is also introduced to identify the boundaries of key images, and the experimental results show that the proposed model outperforms or is comparable to the current state-of-the-art video significant object detection models in terms of quantitative evaluation and visual effects.

1. Introduction

Vision is the main way humans receive information from the outside world, and according to research in the field of neuroscience, about 108 to 109 bytes of data enter the human eye every second [1]. This is because of the selective role of the visual attention mechanism, which allows the visual system to selectively ignore irrelevant information and pay attention to relevant information, just like separating the grains of wheat from the husk. In this Internet era where the amount of data is exploding, how to get the information of people’s concern from the huge amount of information in a labor and material-saving way has gained a lot of attention. Therefore, introducing attention mechanisms into data processing tasks and prioritizing the allocation of data processing resources to more critical information can help improve the efficiency of processing information [26].

In 1998, Borji and Itti [7] proposed the first computational model of visual saliency based on Koch et al.’s theory and the classical feature integration theory of cognitive psychology [8] and the pointing search model [9], whose algorithmic process contains three main steps: extraction of three primary visual features: color, luminance, and orientation. Three types of key features are computed at multiple scales using central-peripheral contrast (key feature extraction); the feature maps are normalized and then synthesized (feature fusion), and the key targets in the images are labeled using the WTA mechanism. The algorithm has had a significant impact on subsequent research on computational models of visual criticality in the field of computer vision, especially since mainstream criticality detection algorithms used a similar framework before deep learning techniques were used on a large scale.

Early image salient object detection models [10] were mainly based on a bottom-up approach using different underlying visual features, such as color, edges. Since salient object detection is closely related to the human eye attention detection task and both model the human visual attention mechanism, the early salient object detection models also borrowed some basic theories of the human visual attention mechanism, including the classical contrast assumption, center-surround assumption. For example, both assumptions were used by Liu et al. [11] and Achanta et al. [12], and a similar assumption was used by Cheng et al. [10], who considered color contrast information on both local and global scales, and the algorithm was concise and straightforward and received wide attention from the academic community. In addition, Yan et al. [13] proposed to complete the apparently consistent image representation at different scales by over-segmenting the image at different scales and to extract and fuse the salient features at different scales for optimization to obtain the final salient object detection results. Visual center bias is also a commonly used hypothesis based on human attentional mechanisms [13]. The hypothesis is based on the phenomenon that the human visual system has a tendency to assign higher attentional weights to the center of the scene when observing the scene. After that, the popular hypothesis is the background prior hypothesis, which was proposed by Wei et al. [14] in 2012. Unlike the center-periphery hypothesis and the visual center shift hypothesis, which attempt to define “what is more likely to be the salient region,” this hypothesis attempts to define “what is more likely to be the background.” This assumption is based on the observation that in most scenes, the parts around the edges of the image have a higher probability of belonging to the background. This assumption can be considered as a further development of the visual center bias assumption. Before the large-scale application of deep learning techniques, the background prior assumption was the most effective assumption in the field of saliency detection, and the majority of high-performing models [1519] were based on this assumption. These works focus on how to further improve the accuracy of the background prior assumption and how to apply more advanced one-class classifiers. By the background prior assumption, which is equivalent to obtaining a class of (background) samples, the problem can be considered as a one-class classifier giving only one class of samples.

With the great success of deep learning techniques in image classification problems, the focus of research in the field of significant object detection has gradually shifted to deep learning-based models. Slightly earlier work used deep learning features as a more effective key representation and trained using fully convolutional neural networks. Lee et al. [20] used depth features as high-level information and Gabor-filtered response and color histogram as bottom-level features to fuse different levels of significant information for significant prediction. These models achieve better performance but have some drawbacks, such as the large number of parameters and loss of spatial information due to the use of fully connected layer-based classification networks and the high computational cost of these algorithms due to the need for significant/insignificant classification of each superpixel or target object alternative.

With the rise of fully convolutional neural networks, in recent years, significant object detection efforts based on deep learning have used or adapted full convolutional neural networks for pixel-level critical prediction. There is some work [21] inspired by the pixel-level semantic segmentation task, proposing the fusion of features from different neural network layers for critical object detection. Since the shallower layers of deep neural networks can retain more fine-grained underlying visual features, and the deeper layers can extract higher-level, semantic-level features, the fusion of features from different neural network layers can retain the original underlying spatial information and obtain higher-level semantic information. Currently, the main research focus of the work on significant object detection based on deep learning techniques is to explore more efficient network structures that can retain more spatial details. Wang et al. [22] proposed an ASNet model for detecting visually salient objects by means of visual attention prior. The model treats visual attention as a high-level understanding of the whole scene, which is learned through higher-level neural network layers, and the salient object detection task is considered as a more fine-grained, object-level saliency detection, with visual attention providing top-down guidance. The ASNet model is based on a stacked convolutional long and short-term memory neural network, which has a unique recurrent structure that can iteratively optimize saliency detection results. This work provides a deeper understanding of the visual attention mechanism and reveals the correlation between salient object detection and human eye attention detection. As a whole, the deep learning-based salient object detection model achieves much better performance than traditional models [2326].

In response to the current research status, this paper investigates video salient object detection based on deep neural networks as follows, extracting richer spatial saliency information and better capturing the overall shape of salient objects. In this paper, an attention feedback network-based video salient object detection model is proposed. To further obtain clearer bounds, a new hybrid loss function is introduced in this paper based on the video salient object detection model and the attentional feedback network.

2. Deep Neural Networks

2.1. Convolutional Neural Network

When people read or watch a video, they perceive and understand the current content based on the text or images they have already observed before and do not completely forget what they have observed before, and their brain goes blank to understand the content that follows. Traditional neural networks cannot predict salient information in later frames based on the salient object regions in the previous video frames. The emergence of recurrent neural networks makes the network memorable, and its network structure is shown in Figure 1. Assuming that {Xt}t=0t is a set of inputs with (t + 1) time steps and {Ht}t=0t is the corresponding output of the network, network N receives at time step t not only Xt but also the value of the first (t − 1) value of the hidden state at a time step, that is, the network processes the current input with reference to the previous memory.

However, when the video sequence is long, the interval between the current video frame to be processed and the related video frame may be large, and at this time, the RNN may lose the memory of distant video frames due to problems such as gradient disappearance. To address the problem of long-term dependence, Hochreiter et al. [27] proposed a long-term and short-term memory network, as shown in Figure 2, where the contents of the three stages indicate the forgetting phase, updating state phase, and output phase, respectively.

All three stages contain a sigmoid layer that maps the input information to between [0, 1] and then selectively filters the useful information and forgets the useless information by a per-bit multiplication operation.

The forgetting stage is used to filter the useful information and forget the useless information. The current input is xt, connecting xt with the hidden state ℎt-1 of the previous moment, denoted as Jt, and ※ denotes the connection operation, as shown as follows:

The sigmoid layer is then used to map Jt to between [0, 1] to obtain the output gate ft, where Wf and bf denote the weight and bias vector of the network layer, respectively, and σ denotes the sigmoid operation, as shown as follows:

Then, the corresponding element multiplication operation (∴) is performed with the cell state Ct−1, thus selectively filtering the useful information and forgetting the useless information, and the cell state at this point is noted as .

The update cell state phase allows the control cell state to selectively absorb relevant information from J. Jt passes through the sigmoid layer and generates the input gate it.

The information obtained by multiplying the feature obtained by Jt after the tanh layer with the corresponding element of it is the information added to the cell state, and the new cell state is obtained by adding this information to the Ct obtained in the forgetting phase by bits.

The output phase controls what information is output at the current moment. Jt is inputted into the sigmoid layer to get an output gate Ot.

Let Ot and the current cell state Ct be multiplied bitwise by the features obtained through the tanh layer to obtain the output at the current moment Ht.

2.2. Loss Function

When performing pixel-level salient object detection, it can be viewed as a binary partitioning problem, where pixels belonging to the salient object are labeled as 1 and pixels belonging to the background are labeled as 0. Assume that yi denotes the label of sample xi, the desired output, and denotes the probability value of yi=1 for a given sample xi.

1 −  denotes the probability value of yi = 0 given sample xi.

When xi occurs, the probability of yi occurrence can be expressed by P(yi | xi). From the perspective of maximum likelihood, P(yi|xi) can be expressed in the following form.

When the real mark yi = 0, 1, and take the logarithm operation. Since the smaller the value of the loss function, the more favorable it is, and the log takes a negative value, and the loss function is calculated as follows:

2.3. Feedback Network

In order to reduce the loss of necessary visual criticality information due to repetitive stride and pooling operations and to learn richer static criticality information, AFNet is used as the main skeleton of the static criticality module. Stimuli in Figure 3 show the input image frames, and the encoding and decoding networks consist of five convolutional blocks of VGG16 (denoted as Ei and Di, respectively, i ∈ {1, 2, 3, 4, 5}), where the information transfer between the corresponding convolutional blocks is controlled by the attention feedback module.

3. Design of Deep Neural Network

3.1. Feedback Network Detection Model

The NHM model is proposed to capture richer spatial criticality information and thus better capture the overall shape of key images. The NHM model uses the attentional feedback network as the backbone of the static criticality module to reduce the loss of visually critical information caused by scale-space issues and to guide the correct fusion of multiscale features from coarse to fine scales. The multiscale feature maps extracted from the five decoding blocks of the attentional feedback network are then fused and fed to the pyramidal expansion convolution module to retain more spatial visual critical information. After that, the time-critical information is captured using a key object transfer-aware convolutional long short-term memory network in consideration of attention-aware transfer, and finally, the parameters of the model are optimized by gradually reducing the value of the loss function through continuous iterations. The algorithm is divided into three parts: extraction of multiscale spatial features, integration of spatio-temporal critical information, and loss minimization.

To mitigate the negative effects such as the loss of visual information generated by the scale-space problem, the backbone of the static criticality detection module consists of AFNet and PDC modules connected together. AFNet as a novel codec forms the design of a fully convolutional network, its encoding and decoding network consists of five convolutional blocks, and Ei and Di denote the encoder and decoder blocks, respectively, where i ∈ {1, 2, 3, 4, 5}, indicating that Ei and Di each contain five convolutional blocks, where each layer of the encoder block transmits its critical information through the feedback module in AFNet to the corresponding decoder block. The feedback module uses a two-step iterative learning approach, where the time steps are denoted by i ∈ {1, 2}, which helps to correct inaccurate predictions generated in the previous network by simulating a feedback mechanism that multiplies the ternary map pixel by pixel with the obtained feature map, thus helping to capture the overall shape of the key object. Facing the global spatial criticality detection problem, AFNet uses the global perception module to overcome the problem that the fully connected operation ignores local information and generates redundant data. A multiscale segmentation strategy is used to divide the feature map into 4, 16, and 36 parts, which are then stacked and reorganized for global convolution operation to make full use of the global and intraregional saliency information.

The key image in the dynamic scene is detected directly by the image key object detection model. The key object detection can only detect the spatial differences of color contrast, direction contrast, brightness contrast, and so on. However, in dynamic scenes, the temporal factor is usually used as an important clue for the criticality detection. Second, detection only on each individual frame without reference to the criticality information contained in previous frames may be highly incoherent, because the target and background may differ significantly in appearance in different frames, which will lead to incoherent detection results between frames. Finally, video content often contains significant redundancy, as consecutive video frames require enough similar content to provide a smooth viewing experience. Simply ignoring content redundancy can lead to higher computational costs. Therefore, VSOD needs to consider both temporal and spatial saliency information, so a dynamic saliency detection module is used to integrate temporal and spatial saliency information. In order to better simulate the perceptual function of the human visual system, temporal saliency information is learned, and the process of attentional perceptual transfer is captured, and this paper uses SSLSTM as a dynamic saliency detection module, which combines the powerful spatio-temporal feature extraction capability of ConvLSTM with the attentional transfer mechanism.

Deep neural networks gradually optimize the network by iteratively minimizing the loss function. The loss function measures the difference between the value predicted by the model and the true value, and the weights of the network are updated by gradient descent.

The meaning of each symbol is shown in Table 1, because the video significant object detection dataset contains relatively few human eye focus annotations, so lt to indicate whether the dataset contains human eye focus annotations, when the dataset does not contain human eye focus annotations, the loss function at this time does not contain the ltAt term, the error will not be back-propagated. The meaning of each symbol is shown in Table 1. Since the video important target detection data set contains relatively few eye focus annotations, it is used to indicate whether the data set contains eye focus annotations. When the data set does not contain eye focus annotations, the loss function at this time does not contain ItAt term, and the error will not be propagated back.

3.2. Loss Function Design

A novel hybrid loss function is proposed based on the boundary enhancement loss, and the function consists of the loss La of the predicted attention-perception feature map, the loss of the final key object prediction result, and the loss of the final predicted target boundary.where ω1, ω2 are used as the learning rate parameters for object-level loss and object-boundary loss of the control target, respectively, and let ω1 : ω2 = 1 : 10 to emphasize the learning of the target boundary.

The dataset used for part of the training does not contain human eye focus annotations, so the predicted loss La of the perceptual attention feature map can be divided into two parts: loss calculated using human eye focus annotations and loss calculated using salient object annotations.La =  when δ(1) = 0, La =  when δ(1) = 1. The final key object prediction results are denoted by St. That is, the loss can be calculated.

When δ(1) = 0, La =  ; When δ(1) = 1, La =  . St is used to represent the prediction result of the final key object, and Mt represents the object level annotation of the key object. The loss can be calculated as follows:

The average pooling operation can be used to extract smooth boundaries. Suppose it is necessary to extract the boundary B(X) of the image X and take the absolute value after making a difference between X and . The final predicted target boundary loss is as follows:

On the basis of NHM, a mixed loss function for capturing clear boundaries is added. The loss function is based on the boundary enhancement loss and is composed of the attention perception feature map predicted by the model, the prediction results of key images, and the prediction results of key image boundaries. The model is recorded as LNSM.

4. Experiments and Results

4.1. Experimental Design

The experiments were run on an Nvidia GTX1080TI GPU. The experiments in this paper were implemented using the Python language on Caffe’s deep learning framework, and Matlab was used for quantitative evaluation of performance. The training set of DAVIS, DAVSOD, and FBMS and the validation set of DAVSOD were also used to train the proposed model, where the weights of the network model were initialised by the AFNet model, and video was processed per batch, and the number of time steps for the conLSTM network layer processing was set to 3. The training process was set up as follows: first, the static key model was pretrained with a base learning rate of 10−9; then, the entire model was trained by setting the learning rate of the dynamic key module to 10−8 and the learning rate of the static key module to 10−10; finally, the static key module weights were fixed, and the dynamic key module was fine-tuned with the learning rate set to 10−10. The LNSM module was trained using 32 hours and 64 k iterations.

4.2. Compare Other Model

In this paper, the proposed LNSM is compared with four advanced video critical object detection models, MBNM, PDBM, and SSAV, on datasets created specifically for the VSOD task (the entire dataset for ViSal and UVSD, a test set for VOS, and a simple test set for DAVSOD), and the experimental results of the quantitative evaluation are shown in Table 1. It can be seen from Table 1 that the three indexes of the model proposed in this paper are better than other models on DAVSOD and ViSal datasets. Especially on the simple test set of DAVSOD, the f-value index and average absolute error based on pixel error and the structural index measuring the overall structural difference have improved the performance by 0.06, 0.03, and 0.064, respectively, compared with SSAV; advanced performance has also been achieved on other datasets. Moreover, ViSal is the first test benchmark especially designed for video key object detection; DAVSOD dataset takes into account the transfer of visual attention and its selectivity when labeling and can represent the real attention behavior of the human visual system in dynamic scene. These two datasets are very representative. The experimental results show that the LNSM model has good performance for creating datasets especially for VSOD and DAVSOD datasets that mark key images according to human eye concerns.

5. Conclusion

This paper focuses on key image detection based on deep neural networks to complete the detection of sports marketing videos. For the detection of multiple scenes, a feedback network-based video off-image detection model and a hybrid loss function are proposed to solve the detection problem of key images. The LNSM model proposed in this paper is compared with the quantitative evaluation and visualisation results of the three state-of-the-art models on six representative datasets. The quantitative results demonstrate that LNSM outperforms other advanced models in all three evaluation metrics on the DAVSOD and ViSal datasets and achieves advanced performance comparable to other models on widely used datasets.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.