한국   대만   중국   일본 
Video captioning with stacked attention and semantic hard pull [PeerJ]

Video captioning with stacked attention and semantic hard pull

View article
PeerJ Computer Science

Introduction

The incredible success in the image captioning domain has led the researchers to explore similar avenues like video captioning. Video captioning is the process of describing a video with a complete and coherent caption using Natural Language Processing. The core mechanism of video captioning is based on the sequence-to-sequence architecture?( Gers, Schmidhuber & Cummins, 2000 ). In video captioning models, the encoder encodes the visual stream and the decoder generates the caption. Such models are capable of retaining both the spatial and temporal information which is essential for generating semantically correct video captions. This requires the video to be split up into a sequence of frames. The model uses these frames as input and generates a series of meaningful words in the form of a caption as output. In Fig. 1 , an example of a video captioning task has been shown.

Video captioning task.

Figure 1: Video captioning task.

Video captioning has many applications, for example, the interaction between humans and machines, aid for people with visual impairments, video indexing, information retrieval, fast video retrieval, etc. Unlike image captioning where only spatial information is required to generate captions, video captioning requires the use of a mechanism that combines spatial information with temporal information to store both the higher level and the lower level features to generate semantically sensible captions. Although there are good works in this field, there is still plenty of opportunity for investigation. One of the main opportunities is improving the ability of models to extract high-level features from videos to generate a more meaningful caption. This paper primarily focuses on this aspect.

In our paper, we propose a novel architecture that is based on the seq2seq model proposed by Venugopalan et al. (2015) . Our novel architecture tries to improve upon this work following the guidelines laid out by preceding literature. Our novel architecture aims to show a possible direction for potential future research. The goal of our model is to encode a video (presented in the form of a sequence of images) in order to extract information from it and decode encoded data to generate a sentence (presented in the form of a sequence of words). On the encoder side, along with the bi-directional LSTM layers, our model uses the combination of two novel methods?a variation of dual-attention ( Nam, Ha & Kim, 2017 ), namely, Stacked Attention, and a novel information extraction method, namely, Spatial Hard Pull. The Stacked Attention network sets the priority to the object in the video layer-by-layer. To overcome the redundancy of similar information being lost in the LSTM layers, we introduce the Spatial Hard Pull layer. On the decoding side, we employ a sequential decoder with a single layer LSTM and a fully connected layer to generate a word from a given context produced by the encoder.

Most text generation architectures use BLEU ( Papineni et al., 2002 ) as the scoring metrics. But due to it’s inability of considering recall, few variations, including ROUGE ( Lin, 2004 ), METEOR ( Banerjee & Lavie, 2005 ), etc., are introduced. Though these automatic scoring metrics are modified in different ways to give more meaningful results, they have their shortcomings ( Kilickaya et al., 2016 ; Aafaq et al., 2019b ). On top of that, no scoring metrics, solely used for the purpose of video captioning, are available to the best of our knowledge. Some relevant works ( Xu et al., 2017 ; Pei et al., 2019 ) have used human evaluation. To get a better understanding of the captioning capability of our model, we perform qualitative analysis based on human evaluation and propose our metric, “Semantic Sensibility Score” or “SS Score”, in short, for video captioning.

Related Works

For the past few decades, much work has been conducted on analysing videos to extract different forms of information, such as, sports-feature summary ( Shih, 2017 ; Ekin, Tekalp & Mehrotra, 2003 ; Ekin & Tekalp, 2003 ; Li & Sezan, 2001 ), medical video analysis ( Quellec et al., 2017 ), video finger-print ( Oostveen, Kalker & Haitsma, 2002 ) and other high-level features ( Chang et al., 2005 ; Divakaran, Sun & Ito, 2003 ; Kantorov & Laptev, 2014 ). These high-level feature extraction mechanisms heavily relied on analyzing each frame separately and therefore, could not retain the sequential information. When the use of memory retaining cells like LSTM ( Gers, Schmidhuber & Cummins, 2000 ) became computationally possible, models were only then capable of storing meaningful temporal information for complex tasks like caption generation ( Venugopalan et al., 2015 ). Previously, caption generation was mostly treated with template based learning approaches ( Kojima, Tamura & Fukunaga, 2002 ; Xu et al., 2015 ) or other adaptations of statistical machine translation approach ( Rohrbach et al., 2013 ).

Sequence-to-sequence architecture for video captioning

Video is a sequence of frames and the output of a video captioning model is a sequence of words. So, video captioning can be classified as a sequence-to-sequence(seq2seq) task. Sutskever, Vinyals & Le (2014) introduce the seq2seq architecture where the encoder encodes an input sentence, and the decoder generates a translated sentence. After the remarkable result of seq2seq architecture in different seq2seq tasks ( Shao et al., 2017 ; Weiss et al., 2017 ), it is only intuitive to leverage this architecture in video captioning works like ( Venugopalan et al., 2015 ). In recent years, different variations of the base seq2seq architecture has been widely used, e.g. , hierarchical approaches ( Baraldi, Grana & Cucchiara, 2017 ; Wang et al., 2018a ; Shih, 2017 ), variations of GAN ( Yang et al., 2018 ), boundary-aware encoder approaches ( Shih, 2017 ; Baraldi, Grana & Cucchiara, 2017 ) etc.

Attention in sequence-to-sequence tasks

In earlier seq2seq literature ( Venugopalan et al., 2015 ; Pan et al., 2017 ; Sutskever, Vinyals & Le, 2014 ), the decoder cells generate the next word from the context of the preceding word and the fixed output of the encoder. As a result, the overall context of the encoded information often got lost and the generated output became highly dependent on the last hidden cell-state. The introduction of attention mechanism ( Vaswani et al., 2017 ) paved the way to solve this problem. The attention mechanism enables the model to store the context from the start to the end of the sequence. This allows the model to focus on certain input sequences on each stage of output sequence generation ( Bahdanau, Cho & Bengio, 2014 ; Luong, Pham & Manning, 2015 ). Luong, Pham & Manning (2015) proposed a combined global-local attention mechanism for translation models. In global attention scheme, the whole input is given attention at a time, while the local attention scheme attends to a part of the input at a given time. The work on video captioning enhanced with these ideas. Bin et al. (2018) describe a bidirectional LSTM model with attention for producing better global contextual representation as well as enhancing the longevity of all the contexts to be recognized. Gao et al. (2017) build a hierarchical decoder with a fused GRU. Their network combines a semantic information based hierarchical GRU, and a semantic-temporal attention based GRU and a multi-modal decoder. Ballas et al. (2016) proposed to leverage the frame spatial topology by introducing an approach to learn spatio-temporal features in videos from intermediate visual representations using GRUs. Similarly, several other variations of the attention exists including multi-faceted attention ( Long, Gan & De Melo, 2018 ), multi-context fusion attention ( Wang et al., 2018a ) etc. All these papers use one attention at a time. This limits the available information for the respective models. Nam, Ha & Kim (2017) introduce a mechanism to use multiple attentions. With their dual attention mechanism, they have retained visual and textual information simultaneously. Ziqi Zhang and team have achieved commendatory scores by proposing an object relational graph (ORG) based encoder capturing more detailed interaction features and designed a teacher-recommended learning (TRL) method to integrate the abundant linguistic knowledge into the caption model ( Zhang et al., 2020 ).

Methodology

As shown in Fig. 2 , this paper proposes a novel architecture that uses a combination of stacked-attention (see Fig. 3 ) and spatial-hard-pull on top of a base video-to-text architecture to generate captions from video sequences. This paper refers to this architecture as Semantically Sensible Video Captioning (SSVC).

Proposed model with stacked attention and spatial hard pull.

Figure 2: Proposed model with stacked attention and spatial hard pull.

Diagram of stacked attention.

Figure 3: Diagram of stacked attention.

Data pre-processing and representation

The primary input of the model is a video sequence. The data pre-processor converts a video clip into a usable video sequence format of 15 frames before passing it to the actual model. Each converted video sequence contains 15 frames placed separated by an equal time gap. The primary output of the model is a sequence of words. The words are stacked to generate the required caption.

Visual feature extraction

A video is nothing but a sequence of frames. Each frame is a 2D image with n channels. In sequential architectures, either the frames are directly passed into ConvLSTM ( Xingjian et al., 2015 ) layer(s) or the frames are individually passed through a convolutional block and then are passed into LSTM ( Gers, Schmidhuber & Cummins, 2000 ) layer(s). For our computational limitations, our model uses the latter option. Like “Sequence to Sequence?Video to Text” ( Venugopalan et al., 2015 ), our model uses a pre-trained VGG16 model ( Simonyan & Zisserman, 2014 ) and extracts the fc7 layer’s output. This CNN layer converts each (256?×?256?×?3) shaped frame into (1?×?4096) shaped vectors. These vectors are primary inputs of our model.

Textual feature representation

Each video sequence has multiple corresponding captions and each caption has a variable number of words. In our model, to create captions of equal length all the captions are padded with “pad” markers. The “pad” markers help create a uniformity in the data structure. The inclusion of “pad” markers do not create any change in the output as they are omitted during the conversion of tokenized words to complete sentences. A start marker, and an end marker, marks the start and end of each caption. The entire text data is tokenized, and each word is represented by a one-hot vector of shape (1?×? uniquewordcount ). So, a caption with m words is represented with a matrix of shape ( m ?×?1?×? uniquewordcount ). Instead of using these one-hot vectors directly, our model embeds each word into vectors of shape (1?×? embeddingdimension ) with a pre-trained embedding layer. The embedded vectors are semantically different and linearly distant in vector space from others on the basis of relationship of the corresponding words.

Base architecture

Like most sequence-to-sequence models, our base architecture consists of a sequential encoder and a sequential decoder. The encoder converts the sequential input vectors into contexts and the decoder converts those contexts into captions. This work proposes an encoder with double LSTM layers with stacked attention. The introduction of a mechanism that stacks attention and the mechanism of pulling spatial information from input vectors are the two novel concepts in this paper and are discussed in detail in later sections. The purpose of using the hard-pull layer is to bring superior extraction capabilities to the model. Since the rest of the model relies on time-series information, the hard-pull layer is necessary for combining information from separate frames and extract general information. The purpose of stacking attention layers is to attain a higher quality temporal information retrieval capability.

Multi-layered sequential encoder

The proposed method uses a time-distributed fully connected layer followed by two consecutive bi-directional LSTM layers. The fully connected layer works on each frame separately and then their output moves to the LSTM layers. In sequence-to-sequence literature, it is common to use stacked LSTM for encoder. For it, our intuition is, the two layers capture separate information from the video sequence. Figure 4 shows having two layers ensures optimum performance. The output of the encoder is converted into a context. In relevant literature, this context is mostly generated using a single attention layer. This is where this paper proposes a novel concept. With the mechanism mentioned in later sections, our model generates a spatio-temporal context.

Comparing Stacked Attention with variations in encoder attention architecture.
Figure 4: Comparing Stacked Attention with variations in encoder attention architecture.
(A) BLEU1, (B) BLEU2, (C) BLEU3, (D) BLEU4, (E) SS-Score.

Single-layered sequential decoder

The proposed decoder uses a single layer LSTM followed by a fully connected layer to generate a word from a given context. In relevant literature, many models have used stacked decoder. Most of these papers suggest, each layer of decoder handles separate information, while our model uses a single layer. Our experimental results show that having stacked decoder does not improve the result much for our architecture. Therefore, instead of stacking decoder layers, we increased the number of decoder cells. Specifically, we have used twice as many cells in decoder than in encoder and it has shown the optimum output during experimentation.

Training and inference behaviour

To mark the start of a caption and to distinguish the mark from the real caption, a “start” token is used at the beginning. The decoder uses this token as a reference to generate the first true word of the caption. Figure 2 represents this as “first word”. During inference, each subsequent word is generated with the previously generated word as reference. The sequentially generated words together form the desired caption. The loop terminates upon receiving the “end” marker.

During training, if each iteration in the generation loop uses previously generated word, then one wrong generation can derail the entire remaining caption. Thus error calculation process becomes vulnerable. To overcome this, like most seq-to-seq papers, we use the teacher forcing mechanism ( Lamb et al., 2016 ). The method uses words from the original caption as reference for generating the next words during the training loop. Therefore, the generation of each word is independent of previously generated words. Figure 2 illustrates this difference in training and testing time behaviour. During training, “Teacher Forced Word” is the word from the reference caption for that iteration.

Proposed context generation architecture

The paper proposes two novel methods. The methods show promising signs to make progress in the field of video captioning.

Stacked attention

Attention creates an importance map for individual vectors from a sequence of vectors. In text-to-text, i.e. , translation models, this mapping creates a valuable information that suggests which word or phrase in the input side has higher correlation to which words and phrases in the output. However, in video captioning, attention plays a different role. For a particular word, instead of determining which frame (from original video) or frames to put more emphasis on, the stacked attention emphasizes on objects. This paper uses a stacked LSTM. Like other relevant literature ( Venugopalan et al., 2015 ; Song et al., 2017 ), this paper reports separate layers to carry separate information. So, if each layer has separate information, it is only intuitive to generate separate attention for each layer. Our architecture stacks the separately generated attentions and connects them with a fully connected layer with tanh activation. The output of this layer determines whether to put more emphasis on the object or the action. f a t t n h , s s = a s W 2 ? a t a n h W 1 h , s s + b 1 + b 2 c a t t n = d o t h , f a t t n h , s s c s t = a r e l u W s t c a t t n 1 , c a t t n 2 , , c a t t n n + b s t where,

  • h ?= encoder output for 1 layer

  • ss ?= decoder state is repeated to match h ’s dimension

  • a s x = exp x ? max x exp x ? max x a t t e n

  • n ?= number of attention layers to be stacked

  • c attn ?= context for single attention

  • dot () function represents scalar dot-product

  • c st ?= stacked context for n encoder layers.

Equation (1) is the attention function. Equation (2) uses the output of this function to generate the attention context for one layer. Equation (3) combines the attention context of several layers to generate the desirable spatio-temporal context. The paper also refers to this context as “stacked context”. Figure 3 corresponds with these equation. In SSVC, we have particularly used n ?=?2, where n is the number of attention layers in the stacked attention.

The stacked attention mechanism generates the spatio-temporal context for the input video sequence. All types of low-level context required to generate the next word is available in this novel context generation mechanism.

Spatial Hard Pull

Amaresh & Chitrakala (2019) mentions that most successful image and video captioning models mainly learn to map low-level visual features to sentences. They do not focus on the high-level semantic video concepts - like actions and objects. By low-level features, they meant object shapes and their existence in the video. High-level features refer to proper object classification with position in the video and the context in which the object appears in the video. On the other hand, our analysis of previous architectures shows that almost identical information is often found in nearby frames of a video. However, passing the frames through LSTM layer does not help to extract any valuable information from this almost identical information. So, we have devised a method to hard-pull the output of the time-distributed layer and use it to add high-level visual information to the context. This method enables us to extract meaningful high-level features, like objects and their relative position in the individual frames.

This method extracts information from all frames simultaneously and does not consider sequential information. As the layer pulls spatial information from sparsely located frames, this paper names it “Spatial Hard Pull” layer. It can be compared to a skip connection. But unlike other skip connections, it skips a recurrent layer, and directly contributes to the context. The output units of the fully connected (FC) layer of this spatial-hard-pull layers determines how much effect will the sparse layer have on the context. Figure 5 indicates the performance improvement in the early stages due to SHP layer and the fall of scores in the later stages due to high variance.

Evaluating model performance with varied hard-pull units.
Figure 5: Evaluating model performance with varied hard-pull units.
(A) BLEU1, (B) BLEU2, (C) BLEU3, (D) BLEU4, (E) SS-Score.

Proposed Scoring Metric

No automatic scoring metric has been designed yet for the sole purpose of video captioning. The existing metrics that have been built for other purposes, like neural machine translation, image captioning, etc., are used for evaluating video captioning models. For quantitative analysis, we use the BLEU scoring metric ( Papineni et al., 2002 ). Although these metrics serve similar purposes, according to Aafaq et al. (2019b) , they fall short in generating “meaningful” scores for video captioning.

BLEU is a precision-based metric. It is mainly designed to evaluate text at a corpus level. BLEU metric can be calculated in reference to 1 sentence or in reference to a corpus of sentences ( Brownlee, 2019 ). Though the BLEU scoring metric is widely used, Post (2018) ; Callison-Burch, Osborne & Koehn (2006) ; Graham (2015) demonstrate the inefficiency of BLEU scoring metric in generating a meaningful score for tasks like video captioning. A video may have multiple contexts. So, machines face difficulty to accurately measure the merit of the generated captions as there is no specific right answer. Therefore, for video captioning, it is more challenging to generate meaningful scores to reflect the captioning capability of the model. As a result, human evaluation is an important part to judge the effectiveness of the captioning model. In fact, Figs. 6 , 7 , 8 and 9 , and 10 show this same fact that a higher BLEU score is not necessarily a good reflection of the captioning capability. On the other hand, our proposed human evaluation method portrays a better reflection of the model’s performance compared to the BLEU scores.

In (A) (Source: https://www.youtube.com/watch?v=6t0BpjwYKco&t=230s) and (B) (Source: https://www.youtube.com/watch?v=j2Dhf-xFUxU&t=20s), our model is able to extract the action part correctly and gets decent score in both SS and BLEU score. In (C) (Source: https://www.youtube.com/watch?v=uxEhH6MPH28&t=29s), the output is perfect and SS Score is high.

Figure 6: In (A) (Source: https://www.youtube.com/watch?v=6t0BpjwYKco&t=230s ) and (B) (Source: https://www.youtube.com/watch?v=j2Dhf-xFUxU&t=20s ), our model is able to extract the action part correctly and gets decent score in both SS and BLEU score. In (C) (Source: https://www.youtube.com/watch?v=uxEhH6MPH28&t=29s ), the output is perfect and SS Score is high.

However, BLEU4 is 0.
In (A) (Source: https://www.youtube.com/watch?v=VahnQw2gTQY&t=298s) and (B) (Source: https://www.youtube.com/watch?v=YS1mzzhmWWA&t=9s), our model is able to extract only the action part correctly.

Figure 7: In (A) (Source: https://www.youtube.com/watch?v=VahnQw2gTQY&t=298s ) and (B) (Source: https://www.youtube.com/watch?v=YS1mzzhmWWA&t=9s ), our model is able to extract only the action part correctly.

The generated caption gets mediocre score in both SS and BLEU score.
In (A) (Source: https://www.youtube.com/watch?v=R2DvpPTfl-E&t=20s) and (B) (Source: https://www.youtube.com/watch?v=1hPxGmTGarM&t=9s), the generated caption is completely wrong in case of actions, but BLEU1 gives a very high score.

Figure 8: In (A) (Source: https://www.youtube.com/watch?v=R2DvpPTfl-E&t=20s ) and (B) (Source: https://www.youtube.com/watch?v=1hPxGmTGarM&t=9s ), the generated caption is completely wrong in case of actions, but BLEU1 gives a very high score.

On the contrary, SS Score heavily penalizes them.
In the figure, the black car is driving away while being chased by a police car (Source: https://www.youtube.com/watch?v=3opDcpPxllE&t=50s). Our SSVC model only predicts the driving part.

Figure 9: In the figure, the black car is driving away while being chased by a police car (Source: https://www.youtube.com/watch?v=3opDcpPxllE&t=50s ). Our SSVC model only predicts the driving part.

Thus, the generated captions only partially capture the original idea. However, BLEU evaluates them with very high score where SS Score evaluates them accordingly.
In (A) (Source: https://www.youtube.com/watch?v=D1tTBncIsm8&t=841s), (B) (Source: https://www.youtube.com/watch?v=Cv5LsqKUXc&t=71s), and (C) (Source: https://www.youtube.com/watch?v=2FLsMPsywRc&t=45s), the generated caption is completely wrong, but BLEU1 gives a very high score where SS Score gives straight up zero.

Figure 10: In (A) (Source: https://www.youtube.com/watch?v=D1tTBncIsm8&t=841s ), (B) (Source: https://www.youtube.com/watch?v=Cv5LsqKUXc&t=71s ), and (C) (Source: https://www.youtube.com/watch?v=2FLsMPsywRc&t=45s ), the generated caption is completely wrong, but BLEU1 gives a very high score where SS Score gives straight up zero.

Therefore, BLEU performs poorly here.

Semantic sensibility(SS) score evaluation

To get a better understanding of captioning capability of our model, we perform qualitative analysis that is based on human evaluation similar to Graham, Awad & Smeaton (2018) , Xu et al. (2017) and Pei et al. (2019) . We propose a human evaluation metric, namely “Semantic Sensibility” score, for video captioning. It evaluates sentences at a contextual level from videos based on both recall and precision. It takes 3 factors into consideration. These are the grammatical structure of predicted sentences, detection of the most important element (subject or object) in the videos and whether the captions give an exact or synonymous analogy to the action of the videos to describe the overall context.

It is to be noted that for the latter two factors, we take into consideration both the recall and precision values according to their general definition. In case of recall, we evaluate these 3 factors from our predicted captions and match them with the corresponding video samples. Similarly, for precision, we judge these factors from the video samples and match them with the corresponding predicted captions. Following such comparisons, each variable is assigned to a boolean value of 1 or 0 based on human judgment. The significance of the variables and how to assign their values are elaborated below:

S grammar S g r a m m a r = 1 , if grammatically correct 0 , otherwise S grammar evaluates the correctness of grammar of the generated caption without considering the video.

S element S e l e m e n t = 1 R i = 1 R S e l e m e n t r e c a l l i + 1 P i = 1 P S e l e m e n t p r e c i s i o n i 2 where,

  • R = number of prominent objects in video

  • P = number of prominent objects in caption

As S action evaluates the action-similarity between the predicted caption and its corresponding video, S element evaluates the object-similarity. For each object in the caption, the corresponding S element precision receives a boolean score and for the major objects in the video, the corresponding S element recall receives a boolean score. The average recall and average precision is combined to get the S element .

S action S a c t i o n = S a c t i o n r e c a l l + S a c t i o n p r e c i s i o n 2 S action evaluates the ability to describe the action-similarity between the predicted caption and its corresponding video. S action recall and S action precision separately receives a boolean score (1 for correct, 0 for incorrect) for action recall and action precision respectively. By action recall, we determine if the generated caption has successfully captured the most prominent action of the video segment. Similarly, by action precision, we determine if the action mentioned in the generated caption is present in the video or not.

SS score calculation

Combining equations Eqs. (4) , (5) and (6) , the equation for the SS Score can be obtained. S S S c o r e = 1 N n = 1 N S g r a m m a r ? S e l e m e n t + S a c t i o n 2

During this research work, the SS Score was calculated by 4 final-year undergraduate students studying at the Department of Computer Science and Engineering at the Islamic University of Technology. They are all Machine Learning researchers and are fluent English speakers. Each caption was separately scored by at least two annotators to make the scoring consistent and accurate.

Results

Dataset and experimental setup

Our experiments are primarily centered around comparing our novel model with different commonly used architectures for video captioning like simple attention ( Gao et al., 2017 ; Wu et al., 2018 ), modifications of attention mechanism ( Yang et al., 2018 ; Yan et al., 2019 ; Zhang et al., 2019 ), variations of visual feature extraction techniques ( Aafaq et al., 2019a ; Wang et al., 2018b ) etc that provide state-of-the-art results. We conducted the experiments under identical computational environment - Framework: Tensorflow 2.0 , Platform: Google Cloud Platform with a virtual machine having an 8-core processor and 30GB RAM , GPU: None . We used the Microsoft-Research Video Description (MSVD) dataset ( Chen & Dolan, 2011 ). It contains 1970 video snippets together with 40 English captions ( Chen, Li & Hu, 2020 ) for each video. We split the entire dataset into training, validation, and test set with 1200, 100, and 670 snippets respectively following previous works ( Venugopalan et al., 2015 ; Pan et al., 2016 ). To create a data-sequence, frames from a video are taken with a fixed temporal distance. We used 15 frames for each data-sequence. After creating the data-sequences, we had almost 65000 samples in our dataset. Though there is a large number of samples in the final dataset, the number of distinct trainable videos are only 1200. 1200 videos is not a large enough number. Having a larger dataset would be better for the training.

For the pre-trained embedding layer, we used ‘glove.6B.100d’ ( Pennington, Socher & Manning, 2014 ). Due to lack of GPU, we used 256 LSTM units in each encoder layer and 512 LSTM units in our decoder network and trained each experimental model for 40 epochs. To analyse the importance of the Spatial Hard Pull layer, we also tuned the Spatial Hard Pull FC units from 0 to 45 and 60 successively.

One of the most prominent benchmarks for video data is TRECVid. Particularly, the 2018 TRECVid challenge Awad et al. (2018) that included video captioning, video information retrival, activity detection, etc could be an excellent benchmark for our work. However, due to our limitations like lack of enough computational resources, rigidity in data pre-processing due to memory limitation and inability to train on a bigger dataset, we could not analyse our novel model with global benchmarks like TRECVid. On top of that, some of the benchmark models use multiple features as input to the model. However, we only use a single 2D based CNN feature as input as we wanted to make an extensive study on the capability of 2D CNN for video captioning. So, we implemented some of the fundamental concepts used in most state-of-the-art works on our experimental setup with single input 2D CNN feature. Thus, we performed ablation study to make a qualitative and quantitative analysis of our model. The performance of our two proposed novelties shows potential for improvement.

We used the BLEU score as one of the two main scoring criteria. To calculate BLEU on a dataset with multiple ground-truth captions, we used the Corpus BLEU calculation method ( Brownlee, 2019 ). The BLEU scores reported throughout this paper actually indicates the Corpus BLEU Score. Our proposed architecture, SSVC, with 45 hard-pull units and 2 layer stacked attention gives the BLEU score of “BLEU1”: 0.7072, “BLEU2”: 0.5193, “BLEU3”: 0.3961, “BLEU4”: 0.1886 after 40 epochs of training with the best combination of hyper-parameters. For generating the SS Score, we considered the first 50 random videos from the test set. We obtained an SS Score of 0.34 for the SSVC model.

Ablation study of stacked attention

  • No attention: Many previous works ( Long, Gan & De Melo, 2018 ; Nam, Ha & Kim, 2017 ) mentioned that captioning models perform better with some form of attention mechanism. Thus, in this paper, we avoid comparing use of attention and no attention mechanisms.

  • Non stacked (or single) attention: In relevant literature, though the use of attention is very common, the use of stacked attention is quite infrequent. Nam, Ha & Kim (2017) have shown the use of stacked (or dual) attention and improvements of performance that are possible through it. In Fig. 4 , the comparison between single attention and stacked attention indicates dual attention has clear edge over single attention.

  • Triple Attention: Since the use of dual attention has improved performance in comparison to single attention, it is only evident to create a triple attention to check the performance. Figure 4 shows that triple attention under-performs in comparison to all other variants.

Considering our limitations, our stacked attention gives satisfactory results for both BLEU and SS Score in comparison to the commonly used attention methods when performed on similar experimental setup. Graphs in Fig. 4 suggest the same fact that our stacked attention improves the result of existing methods due to improved overall temporal information. Moreover, we can clearly see that the 2 layer LSTM encoder performs much better than single or triple layer encoder. Combining these two facts, we can conclude that, our dual encoder LSTM with stacked attention has the capability to improve corresponding architectures.

Ablation study of spatial hard pull

To boost the captioning capability, some state-of-the-art works like Pan et al. (2017) emphasized on the importance of retrieving additional visual information. We implemented the same fundamental idea in our model with the Spatial Hard Pull. To depict the effectiveness of our Spatial Hard Pull (SHP), we conducted experiments with our stacked attention as a constant and changed the SHP FC units with 0, 45 and 60 units successively. Figure 5 shows that as the number of SHP FC units are increased from 0 to 45, both BLEU and SS Score get better and again gradually falls from 45 to 60. The performance improvement in the early stages indicate that SHP layer is indeed improving the model. The reason for fall of scores in the later stages is that the model starts to show high variance. Hence it is evident from this analysis that our approach of using SHP layer yields satisfactory result compared to not using any SHP layer.

Discussion

By performing various trials on a fixed experimental setting, we analysed the spatio-temporal behaviour of a video captioning model. After seeing that single layer encoder LSTM causes more repetitive predictions, we used double and triple layer LSTM encoder to encode the visual information into a better sequence. Hence, we were able to propose our novel stacked attention mechanism with double encoder layer that performs the best among all the variations of LSTM layers that we tried. The intuition behind this mechanism is that, as our model separately gives attention to each encoder layer, this generates a better overall temporal context for it to decode the video sequence and decide whether to give more priority to the object or the action. And the addition of Spatial Hard Pull to this model bolsters its ability to identify and map high level semantic visual information. Moreover, the results also indicate that addition of excess SHP units drastically affect the performance of the model. Hence, a balance is to be maintained while increasing the SHP units so that the model does not over-fit. As a result, both of these key components of our novel model greatly contributed to improving the overall final performance of our novel architecture, that is based upon the existing fundamental concepts of state of art models.

Although the model performed good in qualitative and quantitative analysis, our proposed SS Scoring method provides greater insight to analyse video captioning models. The auto metrics, although useful, cannot interpret the videos correctly. In our experimental results, we can see a steep rise in the BLEU Score in Figs. 4 and 5 at early epochs even though the predicted captions are not up to the mark. These suggest the limitations of BLEU score in judging the captions properly with a meaningful score. SS Score considers these limitations and finds a good semantic relationship between the context of the videos and the generated language that portrays the video interpreting capability of a model into language to its truest sense. Hence, we can safely evaluate the captioning capability of our Stacked Attention with Spatial Hard Pull mechanism to better understand the acceptability of the performance of our novel model.

Conclusion and Future Work

Video captioning is a complex task. This paper shows how stacking the attention layer for a multi-layer encoder makes a more semantically accurate context. Complementing it, the Sparse Sequential Join, introduced in this paper, is able to capture the higher level features with greater efficiency.

Due to our computational limitations, our experiments use custom pre-processing and constrained training environment. We also use a single feature as input unlike most of the state-of-the-art models. Therefore, the scores we obtained in our experiments are not comparable to global benchmarks. In future, we hope to perform similar experiments with industry standard pre-processing with multiple features as input.

The paper also introduces the novel SS Score. This deterministic scoring metric has shown great promise in calculating the semantic sensibility of a generated video-caption. However, since it is a human evaluation metric, it relies heavily on human understanding. Thus, a lot of manual work is to be put behind it. For the grammar score, we can use Naber (2003) ’s “A Rule-Based Style and Grammar Checker” technique. This will partially automate the SS Scoring method.

Supplemental Information

SS score (semantically sensible score) for different variations of SSVC model to analyze the effectiveness of semantic hard pull and stacked attention

Each sheet corresponds to a different model architecture. The name of each sheet denotes the architecture details. In each sheet, each row corresponds to a particular sample from the validation set, and each column represents the corresponding epoch. The values in each cell are the corresponding SS scores.

DOI: 10.7717/peerj-cs.664/supp-1
2 Citations   Views   Downloads