Abstract

Dynamic signs in the sentence form are conveyed in continuous sign-language videos. A series of frames are used to depict a single sign or a phrase in sign videos. Most of these frames are noninformational and they hardly effect on sign recognition. By removing them from the frameset, the recognition algorithm will only need to input a minimal number of frames for each sign. This reduces the time and spatial complexity of such systems. The algorithm deals with the challenge of identifying tiny motion frames such as tapping, stroking, and caressing as keyframes on continuous sign-language videos with a high reduction ratio and accuracy. The proposed method maintains the continuity of sign motion instead of isolating signs, unlike previous studies. It also supports the scalability and stability of the dataset. The algorithm measures angular displacements between adjacent frames to identify potential keyframes. Then, noninformational frames are discarded using the sequence check technique. Pheonix14, a German continuous sign-language benchmark dataset, has been reduced to 74.9% with an accuracy of 83.1%, and American sign language (ASL) How2Sign is reduced to 76.9% with 84.2% accuracy. A low word error rate (WER) is also achieved on the Phoenix14 dataset.

1. Introduction

Sign language, a visual language, is used by the majority of hard-to-hear people. Both static and dynamic gestures are used to represent words and phrases in sign language. Continuous sign language (CSL) is a collection of sign expressions that can be expressed as a sequence of motions in both space and time. The continuous sign language recognition and translation (CSLRT) task aims to bridge the gap between sign and spoken language by recognizing a series of continuous gestures and translating them into natural language expressions. One sign sentence can contain 100–250 frames (approximately 9 words), depending on the frame rate of the recording device. All these frames are not required for sign interpretation to be performed. Transition frames and noninformative frames can be removed from the frameset, leaving about 1–5 key frames per word. The most informative frames in CSL are keyframes, which contain extensive sign gesture and motion information. This method reduces storage and execution overheads. With a proper keyframe set, neural models can extract spatial and temporal features more precisely. With applications in fields such as action detection [1], video summarization [2], educational video summarization [3], video segmentation [4, 5], and video copyright protection [6], keyframe extraction from videos is one of the thoroughly investigated topics that keep the scientific community interested. Keyframe extraction from sign language videos is considered challenging as we have no indicators to identify the start and end frames of signs in the video and need to identify small motions that may be a part of a sign. Most of the existing keyframe extraction methods are not suitable for sign language video representation, as they do not meet requirements such as multielement identification, minor motion detection, continuity, scalability, and stability.

1.1. Keyframe

Keyframe extraction can be defined as if a video F is represented as a set of frames, F: {,}, where n is the total number of frames in F.

The keyframe set is then represented by K such that KF and abstractly represents the original video with a frameset of length m less than n.

The following is a representation of the keyframe extraction algorithm.

Let be a keyframe extraction algorithm, then keyframe elements can be defined as follows:where is the reduced video.

The efficiency of the algorithm depends on the reduction rate it attained and the accuracy by which the sign can be recognised. The reduction rate can be expressed as follows:

Let  =  be a keyframe sign sequence and  =  be the ground truth sign frames on a video of size r frames, then the accuracy is the proportion at which the frame is correctly recognised against the ground truth.where n() gives the total number of frames in the set. The concept of keyframe extraction is depicted in Figures 1 and 2. The sign language representation of the word “LIEB” from the Pheonix4 dataset [7] is illustrated in Figure 1. The keyframes for the same word can be thought of as in Figure 2(a), with two frames handling gesture structure and motion. A graphical depiction of the word “LIEB” based on the signwriting [8] technique taken into consideration for evaluation is shown in Figure 2(b). It is noteworthy that the word with 14 frame length may be reduced to a two-frame representation, which also acts as the keyframe and properly communicates the sign word concept.

The lack of distinct word breaks and continuous gesture transitions make keyframe extraction from CSL video challenging. The gesture position, orientation, and direction of movements must be considered while finding keyframes or in eliminating uninformational frames. Minor and substantial variations in hand forms, motions, positions, nonmanual elements, context, and the signer speed all provide challenges to the keyframe extraction process.

Noninformational frames and informational frames are two instances of frames in CSL videos. The suggested method aims to find a set of keyframes that accurately and efficiently reflect the maximum sign information from a continuous sign-language video with a good reduction rate. The orientation information contained in sign words must also be preserved for a sign to be correctly recognized. By integrating all keyframes, an abstract of a specific sign can be obtained. The motivation for the use of continuous sign-language keyframes is strong since it reduces processing time and storage requirements for representation learning models and other related computer vision tasks.

The various keyframe extraction paradigms utilized in video-related computer vision tasks are cluster methods [9, 10], motion energy-based methods [11], sequence methods [1216], and machine learning methods [17, 18]. Different sequential approaches and machine learning methods are the most acceptable techniques used in keyframe extraction from continuous sign-language videos.

Keyframe extraction strategies employed in motion analysis, video summarising, or compression cannot directly enhance CSL videos. The spatial, temporal, and directional characteristics of gesture frames in CSL must be evaluated to determine whether they are informative. Certain signs differ solely in the direction of motion of the sign elements. So, the direction of motion is an important information in the sign to fully interpret the link between movements and hence the gesture. This is the first time the concept of gesture orientation has been examined on the keyframe extraction task.

A majority of current sign language key extraction research focuses on dynamic gesture videos (word level), with a few attempts using continuous sign language (sentence level) with the hand as the region of interest [19, 20], leaving the nonmanual elements unresolved. A combination of image entropy and density clustering is used to obtain the keyframes for the hand gesture video in [21]. Minor motions and motion directions cannot be taken into account by this method due to its static threshold value. The method is, therefore, ineffective for CSL videos. The research [22] identifies significant frames and treats each gesture as a separate, isolated gesture using a gradient-based key frame extraction technique. The direction of motion continuity and minute motions are left unresolved. Most sequential approaches use static thresholds such as in [20, 23], which make it difficult to record small, repetitive movements. Specifically, tapping or rubbing does not propagate data over successive frames, preventing static thresholds from distinguishing movements between such frames. Solutions based on threshold values like entropy or sampling do not address scalability or signer independence [14, 24].

This work handled these sign gestures effectively and consistently throughout the huge dataset, which had never been studied before. The proposed work offers an interesting, simple, and efficient approach for extracting successive keyframes from CSL video, which may be fed into a CSLR system for speedy decision, while taking into account hurdles and flaws in earlier works. The following contributions make up this work:(1)This study proposes a new approach for choosing keyframes from continuous sign video, which significantly reduces computation overhead in time and space dimension(2)Angular displacement metric is used to evaluate the motion between the frames(3)The decision of keyframe selection is based on the whole frame; thus, all sign elements are considered(4)A sequence check metric and frame pixel difference with an adaptive threshold are used to reduce frameset from candidate keyframe set(5)To analyse and visualise the suggested technique, this work utilise the sign representation method, SignWriting [8].(6)WER is calculated in conjunction with existing sign language recognition systems to analyse performance of the reduced dataset.

The remaining sections are organized as follows. Section 2 reviews keyframe extraction techniques used in sign language recognition systems. The proposed FSC2 (frame sequence count check) keyframe extraction algorithm is described in Section 3. The experimental results are presented in Section 4. Lastly, a summary of the proposed work and some suggestions for further research are presented.

1.2. Related Work

This section discusses the keyframe extraction techniques that were employed in prior research of sign language recognition tasks.

Keyframe extraction utilising time-varying parameter detection was proposed by [25]. They used statistical analysis of variables such as position, posture, orientation, and motion to detect discontinuities in frames considering only the major motion elements. In [26], fewer gesture motions such as preparation motion and unnecessary movement between sign phrases were deleted using fuzzy partitioning and state automata. For filtering uninformative frames, the authors of [27, 28] employed a gradient-based keyframe extraction method. In [29], the authors randomly sampled 10–50 keyframes from each video and translated directly the sign video representations to spoken language. A method for extracting keyframes in a trajectory density curve using a sliding window is proposed in [19]. In [30], an online low-rank approximation of sign videos to choose keyframes is employed. A method for locating video frames representing single signs in a one-hand finger alphabet is provided in [20], which uses a combination of object tracking and visual attention. In [31], the angular and distance metric of a 3D trajectory skeleton is used for keyframe detection.

The ARSS approach for optimal sampling and alignment of RGB and depth input is proposed in [32], and a relatively complete keyframe set of the video is acquired. In [33], a new sample approach called keyframe-centred clips (KCCs) sampling was given, with the goal of selecting a specific number of frames to describe the entire sign language video. In comparison to other sampling methods, KCC has greater recognition performance. To improve keyframe-centred clips (KCCs) sampling, a new method termed optimised keyframe-centred clip (OptimKCC) sampling was proposed in [14] to optimise the KCC sampling using the DTW distance. In all of the preceding studies, signs are considered as isolated.

The authors proposed two types of distances in [34], interkeyframe distances and model set distances. The sum of the distances to other keyframes and the average distances from the model set are used to pick the set of keyframes K. In [35], Zernike’s moments were used to detect the keyframe in a dynamic gesture video clip. A keyframe is one in which Zernike’s moments’ (ZMs) difference between neighbouring frames is greater than a value (value is set to 50). In [36], a random sampling method is applied. A sequence technique based on the statistical of elements such as colour, picture difference, and weighted frames is proposed in [13] to detect keyframes from dynamic sign-language videos. Edge detection and discrete wavelet transform are used in [37] to extract keyframes. A hybrid clustering approach is provided in [38] and two sets of keyframes are obtained; the spliced original keyframe picture represents the spatial dimension feature, and the optical flow keyframe image represents the time dimension feature. The author of [24] proposed the median of entropy of mean frames (MME) approach for keyframe extraction, which uses the mean of consecutive k frames of video data with a sliding window of size k/2 to select the frame that satisfies the median entropy value.

The methodology used in [39] considers multievaluation factors to select critical frames from raw videos. For creating high-quality video clips, essential frames are chosen based on their hand height, hand movements, and frame blurriness levels. In [40], the parameter used for sampling the keyframes was hand coordinates. In [41], the author proposed a clip summary approach to choose the important video clips. In [42], the author used DTW for keyframe extraction.

In comparison to other computer vision tasks that use keyframe extraction such as video summarising and compression, there are few works on keyframe extraction of sign language videos, and it remains a challenging research subject for researchers. The majority of the work is focused on the word level or small phrase extraction which comes under isolated sign. For its complexity, there is very little literature in the realm of continuous sign-language videos. A continuous sign-language sentence stream can have over 250 frames, with a few keyframes functioning as representative frames and the rest as transitional or noninformational frames. Due to the little variation between two consecutive frames and the long length of the input, the demand for modelling temporal sequence of signs at the sentence level is rather stringent.

The majority of early techniques used threshold settings that varied depending on the dataset, which reduced stability and scalability. Repetitive signs and signs with little momentum are disregarded, which results in information loss. Most early research treats the principal hand structure as a single region of interest that is retrieved using a segmentation method in order to condense the gesture space. In addition, each sign phrase’s beginning and ending frames were manually chosen and continuous signs were transformed into isolated frames to control the motion. When designing an algorithm for a continuous sign video challenge that heavily relies on continuous data, such restrictions must be minimized.

This work proposes a keyframe extraction algorithm for handling the significant difficulty of keyframe extraction in CSL videos based on the difference in angular displacement of pixels between frames and a sequence check metric.

2. Proposed Method

2.1. Frame Sequence Count Check (FSC2) Keyframe Extraction Algorithm

The FSC2 keyframe extraction algorithm is designed in simple and statistical steps to keep it light weight and effective. The proposed FSC2 keyframe extraction architecture is shown in Figure 3. The FSC2 algorithm has three phases of execution; motion analysis, wrapper, and reduction. Motion analysis uses the Gunnar Farnebäck optical flow algorithm [43] to obtain optical flow data between two nearby neighbouring frames. These data are fed to wrapper where the value is calculated, which are the mean of the angular displacement obtained from optical flow data. The frames are then arranged in two boxes depending on the value by the selector and weighed which form the candidate key. The sequencer receives these frames and counts how many of each one can be found in an order and updates the weights depending on the sequence. The frames are sequence checked inside the reducer and then they are reduced using the s-reduction algorithm. S-reduction counts the number of sequences. For a sequence of 3, if the middle element has a positive value, it is kept and the other frames are discarded; otherwise, the middle element is discarded. In the case of a count of 2 and one is from a box with a negative value, it will be rejected; otherwise, both will be kept. If the sequence is greater than 3, then the mean pixel difference is used as the threshold and is reduced. The output is a collection of keyframes which form the abstract of signs in CSL videos.

The FSC2 keyframe extraction algorithm evaluates a second-order frame difference by employing a two-frame optical flow calculator (Gunnar Farnebäck) as the first order difference and two successive optical flow differences as the second-order difference. In order to analyse the motion on frames, the algorithm relates to the subsequent three frames. This information allows the algorithm to represent the motion of three subsequent frames, which aids in capturing minute interframe motions.

2.2. Motion Analysis

By obtaining two consecutive frames, the optical flow algorithm calculates the motion of each pixel in a frame. The Gunnar Farnebäck optical flow method was employed for this study to determine the optical flow information between two successive sign frames.

2.2.1. Gunnar Farnebäck Optical Flow

Gunnar Farnebäck [43] is a two-frame motion estimation algorithm developed to produce dense optical flow results. The algorithm is broken down into four steps. Optical flow is determined by quadratic polynomials representing the local neighbourhood of an image in the first step. These quadratic polynomials are used in the second stage to generate a new signal from a global displacement. The following step involves equating quadratic polynomials to calculate global displacements. The coefficient is then calculated by using a weighted least squares estimate of the pixel.

The Gunnar Farnebäck two-frame method was chosen for this study because it can be used to examine each individual pixel displacement between subsequent frames and depends on the notion that sign language frames have a lot of small motion embedded in neighbouring frames.

The mathematical representation of the algorithm is as follows.

Image intensity model with quadratic function for the first frame at pixel location x can be represented as where A is a symmetric matrix, b is a vector, and c is a scalar. Coefficients are obtained by fitting the weighted least squares to the intensity values in the neighbourhood. For the second frame with global displacement d,

On expanding and substituting,

Displacement equation becomes

Further reading can be found in the paper [43].

2.3. Wrapper

The Gunnar Farnebäck optical flow algorithm generates an optical flow vector for each pixel that lies between two adjacent frames. By using polarization, angular displacement, , is calculated from the vector data. The next step is to determine the difference in angular displacement between adjacent pairs of flow data, which corresponds to the angular displacement between three successive frames as shown in equation (8). The parameter utilised for first level candidate keyframe selection, , is then derived as the mean of the angular displacement difference of the flow data and is represented in equation (9). This process discards a small number of frames.

Thus, the wrapper selects candidate keyframes that may be part of the keyframe set. The selector and sequencer are the two components that determine the rating for the frames. The selector checks the value, distributes the frames into appropriate boxes, and assigns each frame a weight based on it. Let  =  be the weight assigned to frames in box with and  =  be the weight assigned to the other. This work assumes greater priority to frames in box with , i.e., .

The sequencer uses these weighted frames to determine the sequence check, or the number of frames that follow each other, and divides them into three boxes, designated S1, S2, and S3. Boxes S1, S2, and S3 have score value(s) 2, 3, and 4, respectively. Frames with sequence number two are kept in S1, frames with sequence number three are kept in S2, and frames with sequence number greater than three are kept in S3. Single frames without any adjacent frames are discarded in this step as any abrupt change in motion is considered uninformational.

For example, consider the scenario that a box contains set of frames. Then, frame is discarded, frames are put in S3 box (for all counts 3, is set to 4), frames are placed in S2 box, and will get S1 box. Then, each frame’s weight in each box is updated in accordance with equation (9).

These weighted frames are what make up the candidate keyframes. In this way, the wrapper initially reduces frames. Then, the frames are combined and sent to the reduction procedure.

2.4. Reduction

Upon receiving the candidate keyframes, the reduction unit starts the reduction process. S-reduction and P-reduction are performed based on the sequence count and pixel difference. The approach is based on the assumption that a significant number of information frames is kept in the box with > zero.

2.4.1. S-Reduction

There are two types of reductions involved in S-reduction or sequence-check reduction. The first step to determining potential keyframes is to count the continuous frame sequence in the candidate set. For sequence count two, if any one of the frame is from , that frame is discarded; otherwise, both frames are kept. From the set with sequence count 3, if the frames , then is discarded; otherwise, are discarded. Frames with a sequence count greater than three is sent for P-reduction.

2.4.2. P-Reduction

A key frame is chosen by comparing pixel differences between succeeding frames to an adaptable threshold. The mean pixel difference of the current sequence set is used to determine adaptive thresholds. The final output will be the key frameset that represents the sign video abstractly. The algorithmic representation of FSC2 keyframe extraction is given in Algorithm 1. It takes in the frame sequence from the sign video and output the keyframe set K. represents the frame index and represents the frame weight.

Input: : set of all frames in a CSL video
Output: : set of all keyframes
foreach framedo
 flowGunnar_FarnebäckOpticalFlow (, )
  foreach element in flow do
   polarze
   Call Wrapper (A)
   Call Reduction (CK)
  end
end
Function Wrapper ():
mean ()//selector
if, then, ;
else Bo, ;
 //sequencer
for each , do
  Findsequencecount, s
   if (), then ;
   else if (), then ;
   else if (), then ;
end
return
End Function
Function Reduction ():
 //Find the sequence count, s
 //S-reduction
if s = = 2, then
  if , then
  ;
end
if s = = 3, then
  if , then
   ;
  else ;
end
if then
  //P-reduction
  for all frames f sequence, do
    pd subtract
  end
  
  if , then ;
end
return
End Function

The number of frames for a given sign is chosen by the FSC2 algorithm with no reference to any specific parameters. Each sign’s motion dictates how many keyframes the FSC2 algorithm selects for it. The choice of keyframes for small signs is one or two. If a sign moves a lot, the algorithm will select more keyframes to identify it.

3. Experimental Results and Analysis

Two datasets were tested using the FSC2 keyframe extraction algorithm, the RWTH-PHOENIX-Weather 2014 dataset [7] and the How2Sign dataset [44]. RWTH-PHOENIX-Weather 2014 includes German sign language weather data captured at 210 × 260 pixels per frame at 25 frames per second. Extracting the keyframe in an exact way is an important research perspective since the dataset serves as the baseline for all the current sign language research studies. There are more than 80 hours of sign language videos recorded in parallel by 11 signers in How2Sign, a multimodal and multiview continuous American sign language dataset. The backgrounds of both datasets are static. Three sentences of varying length and signer are taken from datasets for analysis and visualization. Table 1 details the sentences used for evaluation and analysis. Two sentences are from the Phoenix4 dataset and one is from the How2Sign dataset.

Figure 4 demonstrates the output achieved for the 176 frames recording “LIEB ZUSCHAUER ABEND WINTER GESTERN loc-NORD SCHOTTLAND loc-REGION UEBERSCHWEMMUNG AMERIKA IX” and the corresponding sign writing notation for each word. The suggested approach reduces the frameset 176 frames to 48 frames, and the figure shows that all informational frames are effectively captured while the directional information is preserved. The sign for the word “LIEB” is well captured as a rubbing gesture as notated in signwriting. The “WINTER” gesture is a modest forward and backward motion of both hands which is also captured well with a low frame count.

3.1. Analysis of the Value

The value is the difference between two consecutive angular displacement data obtained from Gunnar Farnebäck optical flow algorithm. In Figure 5, a trace of the over the ground truth frames from the original video for the sentences in Table 1 is depicted. Sentences with varying word lengths and signers and finger signs are taken at random from the datasets. Estimation of the ground truth is done manually. The ground-truth mapping chart illustrates that most signs appear at . Thus, this study, by prioritizing , is capable of identifying the informational frames of signs. As can be seen from Figure 6, the value can capture both small and large displacements in a sign and benefit wrapper and reduction algorithms.

3.2. Experimental Setup

The procedure is divided into two sections. The main contribution is the generation of keyframes from continuous sign-language videos. An AMD Ryzen 5000 series CPU system is used for this study. Python 3.10 was used to create the algorithm. Google Colab is utilised for training and testing in the second task, which involves sign language recognition.

3.3. Performance Analysis

Three metrics have been used to evaluate the effectiveness of the proposed algorithm:(1)Reduction rate (R)(2)Accuracy (A)(3)Word error rate (WER)

Table 2 shows the obtained reduction rate and accuracy for two datasets when applied on different keyframe extraction methods. As the value implies, FSC2 performs well on both datasets, capturing the majority of significant frames while eliminating unimportant frames.

In Figure 6, the accuracy chart is presented for different sentences. The keyframes obtained from the FSC2 keyframe extraction algorithm is traced across ground-truth sign frames. A Venn diagram is plotted for the same in Figure 7 to demonstrate the reduction and the accuracy rate. It is demonstrated in Table 3 that the approach is scalable and stable by providing the representation across different sentences, signers, and sentence lengths.

From the figures, it is evident that the FSC2 keyframe extraction algorithm efficiently captures almost all the major and minor gestures in the continuous sign video. WER is evaluated by giving the reduced frameset to two recognition systems. This work chooses SAN [45] and VAC [46] as recognition systems and the obtained results are shown in Table 4. SAN [45] is a transformer-based architecture and with some data augmentation, the network is able to attain better WER when trained with keyframes. VAC [46] uses an iterative training scheme on the CNN framework. Both SAN and VAC are trained and tested with different datasets obtained from methods such as pixel difference, gradient-based approach, Zernik’s moment, and the FSC2 algorithm. The outcomes demonstrate that the proposed algorithm efficiently collects information frames while eliminating transitional frames that can be efficient on both global as well as local receptive fields. Figure 8, shows the percentage variance of the WER value obtained after keyframe extraction based on Table 4. The findings indicate that compared to the previous methods, the FSC2 keyframe extraction reduces WER more successfully. As compared to other algorithms, the proposed algorithm reduces the WER relative to the baseline.

3.4. Computational Complexity

The computational complexity of neural network models is commonly assessed using train time complexity, run time complexity, and space complexity. Using the FSC2 keyframe extraction algorithm, the computational complexity can be reduced by a factor of m, where m is the new size of the dataset.

3.4.1. Space Complexity

Space complexity can be assessed by the amount of space required to store the model input.

Worse case space complexity = O(n), where n = total number of frames without reduction; average or best case space complexity = O(m), where m = total number of keyframes after reduction,

Table 2 shows that the algorithm can reach a reduction rate of approximately 75%. As a result, the space complexity is decreased from n to m, which is a 75% reduction and a cost-effective solution.

3.4.2. Time Complexity

Time complexity can be estimated as the train time complexity or run time complexity, when the keyframe set is fed as input to the neural network model.

Worse case time complexity = O(n), where n = total number of frames without reduction. Average or Best case time complexity = O(m), where m = total number of keyframes after reduction,

Because the number of input frames is 75% less than in the original dataset, train and run time complexity can be reduced by 75%, allowing the network to extract and learn features faster.

3.5. Scalability and Stability

Scalability refers to the capacity of keyframe extraction methods to run on various kinds of datasets captured under a variety of circumstances and yield exact results. The scalability of keyframe extraction techniques can be impacted by variables including data independence, signer independence, and phrase or word length. The FSC2 keyframe extraction algorithm can be used to reduce any sign language dataset, regardless of the type of sign language or the frame rate of the video and thus data independent and scalable. Table 2 shows the reduction rate and accuracy obtained using two datasets. Both the design statistics and the lack of a threshold value lend credence to this benefit. On four distinct signs executed by three distinct signers, the algorithm offers the best and most precise reduction, as shown in Table 3. The signs “MORGEN,” “GESTERN,” “LIEB,” and “KNIFE” are taken into consideration for analysis, and it is discovered that the signs are accurately reduced when performed by various signers. It was, therefore, determined to be accurate and stable, regardless of the signer and language. The word length on 5670 videos in the Pheonix4 dataset is less than 9 words, on average. The How2Sign dataset includes finger signing as well as long and short sentences. Without using any additional parameters, the FSC2 keyframe extraction algorithm effectively extracts all categories accurately and efficiently.

3.6. Comparison of the FSC2 Keyframe Extraction Algorithm across Various Approaches

Table 5 compares the FSC2 keyframe extraction method with a few benchmark keyframe extraction techniques. Stability (S) examines the ability to extract a keyframe for the same sign done by a different signer in a consistent manner. The static threshold (T) parameter checks the use of any static threshold value for keyframe selection. Scalability measures the data independency (DI) and the application of an algorithm to different datasets without any changes and is tested to ensure that it can work with multisigner, multilanguage, and variable length data. Direction or continuity (C) of sign is also an important element in sign recognition. So, keyframes must also contain directional information. By using transition frames, continuous signs were prevented from becoming isolated signs. A qualitative analysis of the keyframe extraction algorithms can be found in Table 5. The analysis shows that the algorithm successfully meets the abovementioned three significant qualities when extracting keyframes from CSL videos.

4. Ablation Study

4.1. Changing the Criteria

The main notion of FSC2 is that keyframes may be identified at 0. A study is carried out with 0. When compared to the previous notion, the obtained result is less precise. There was inadequate similarity between ground truth and keyframes. Figure 9 shows the results for the parameters extracted with keyframe count m, reduction rate R, and accuracy A with two values of when applied to three sentences. The keyframe count and reduction rate are depicted by bars, while the accuracy is represented by a line. Figure 9 shows that the choice of >0 gives a better results.

4.2. Motion Analysis Using the Lucas–Kanade Method

The Gunnar Farnebäck algorithm is replaced by the Lucas–Kanade method, and the results demonstrate that the GF algorithm is superior to the LK methodology because the GF algorithm can capture motion between two successive frames and captures all motions in the signs. Figure 10 shows the performance of both optical flow algorithms when applied on three sentences. Accuracy is represented by the line chart and it is clear that Gunnar Farnebäck gives a better value as it can capture the small motions between two frames.

4.3. Changing the Sequence Value

For keyframe extraction, the FSC2 algorithm examines three sequence values, i.e., 2, 3, and more than 3 to capture both long and short signs. The sequence values are altered in various orders in order to catch the important frames. The outcome fell short of the standard set by FSC2 in the reduction rate and accuracy.

5. Conclusion

The proposed FSC2 keyframe extraction method is developed to extract keyframes from a video of continuous signs. As a result of the extraction process, every informational frame was successfully extracted and also achieved a high reduction rate. This enables researchers to complete CSL-related tasks in less time, with less sophisticated computational hardware and with less storage. In contrast to previous works, the algorithm extracts gesture information from videos while maintaining factors such as continuity and motion direction. Despite the computationally expensive nature of optical flow techniques, FSC2 keyframe extraction is efficient for both long and short sign sequences in terms of accuracy and stability. With statistical methods on optical flow data that function on all basic hardware, the algorithm design is kept simple. The results showed that the suggested strategy produced highly competitive outcomes when compared to the state-of-the-art approaches. Thus, the algorithm solves six major problems related to keyframe extraction from CSL videos such as stability, scalability, preserving direction information, detecting small and repeated movements in sign, low information loss with great accuracy, and good reduction rate. An evaluation of the algorithm’s performance is conducted on existing systems to ensure that it performs the task efficiently. All datasets included in this study have static backgrounds. The angular displacement and optical flow data are impacted by background object movement. As a result, the motion estimate employed in the FSC2 approach cannot precisely determine the sign when the background is changing. Therefore, the proposed algorithm performs poorly compared to how it does with static data. Additional investigation about real-time sign language with different static backgrounds is necessary.

Data Availability

The data used to support the findings of this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.