Abstract

Anomaly detection and behavioral recognition are key research areas widely used to improve human safety. However, in recent times, with the extensive use of surveillance systems and the substantial increase in the volume of recorded scenes, the conventional analysis of categorizing anomalous events has proven to be a difficult task. As a result, machine learning researchers require a smart surveillance system to detect anomalies. This research introduces a robust system for predicting pedestrian anomalies. First, we acquired the crowd data as input from two benchmark datasets (including Avenue and ADOC). Then, different denoising techniques (such as frame conversion, background subtraction, and RGB-to-binary image conversion) for unfiltered data are carried out. Second, texton segmentation is performed to identify human subjects from acquired denoised data. Third, we used Gaussian smoothing and crowd clustering to analyze the multiple subjects from the acquired data for further estimations. The next step is to perform feature extraction to multiple abstract cues from the data. These bag of features include periodic motion, shape autocorrelation, and motion direction flow. Then, the abstracted features are mapped into a single vector in order to apply data optimization and mining techniques. Next, we apply the associate-based mining approach for optimized feature selection. Finally, the resultant vector is served to the k-ary tree hashing classifier to track normal and abnormal activities in pedestrian crowded scenes.

1. Introduction

Regarding human video-based surveillance systems, the visual interpretation of abnormal behaviors is a field of crucial significance. Anomalies in camera footage comprise illegalities or potentially harmful conditions that pose a risk to the public. Throughout outdoor spaces around the world, billions of video surveillance equipment are being installed. However, the majority of the devices are actively gathering videos without any functionalities. Considering the huge amounts of data created by the camcorders every second, it is unbearable for humans to comprehend this massive database of video sequences. How to enhance the visual surveillance method for abnormal event prediction, estimation and identification of pedestrians, road accidents, crimes, robberies, violent conflicts, unauthorized intrusions, and other anomalous activities has become a major research area in the various domains such as smart home and state security. In addition, researchers require computer vision techniques to detect irregularities in videos automatically.

Due to human cognitive limits, such as decreasing attention after a specific length and constraints in spatial cognition modules, evaluating huge datasets can be challenging [1]. In order to overcome these limitations, visual surveillance technology has gained importance in various fields, including security, healthcare, and education [2]. These technologies enable the efficient extraction of meaningful data and information from enormous volumes of video sequences and the automatic monitoring of aberrant circumstances, hence lowering the amount of manual labor necessary to do these activities [3]. In the real world, various domains seem to be fit for this technology such as smart management systems, the educational sector, smart healthcare and monitoring, data and information encryption methods [4], and surveillance and public places. Although there are a variety of approaches for designing intelligent video surveillance, the fundamental strategy is to characterize regular events and recognize those that do not fit the given framework [5]. The researcher prefers the preceding smart methodology because the anomalous characterization differs depending on the content, i.e., scenarios regarded anomalous for one sequence may be considered positive for another moment, and due to the difficulty of locating anomalous behavior variations.

Throughout the previous systems, a significant amount of research has been put into creating sure “hand-crafted” characteristics that effectively reflect actual activities. By using traditional artistic methods to characterize an object’s motion, item sequences of typical activities were retrieved [6]. Thereafter, such items associated with tracks that diverged from the patterns that were previously learned were regarded as an anomaly. The use of global rectangular shapes to simulate the path track has taken the role of trend line approaches, which were usually observed to be impracticable for evaluating problematic situations, which include moderate aspects that are obtained from two-dimensional video frames or nearby 3D multimedia frames, such as temporal and spatial shading [7], histograms of optical flow mixtures of dynamic patterns, and motion characteristics [8]. The approaches are based on classifiers which have a key drawback in that they are hard to adjust toward the wide range of detection and segmentation present in various scenarios.

Previously, efforts were made to practice supervised learning for the activity of anomalous prediction in movie-based media, which responds to the remarkable results of supervised learning on video processing tasks such as classification, identification, object detection, and motion detection [9]. Furthermore, researchers [10] initiated the use of convolution neural networks for the abnormal activity identified due to its acceptance. Most of these techniques start by extracting the features using machine learning and optimization algorithms, followed by training classification models, like a one-class classifier. However, researchers were not developed and customized for the entire issue, and such feature maps are not at their optimal [1113].

Our research proposed a robust and smart approach to the pedestrian anomaly prediction (PAP) framework. We utilized the human-crowded video-based data as input for the proposed method. Initially, we perform the different preprocessing steps including denoising, frame conversion, background subtraction, and RGB image to binary image conversion. After this, we identify the human from the given input data, using texton-based segmentation. Then, we applied Gaussian smoothing and crowd clustering while analyzing the crowd for further calculations and results. Once we find the human and settle with crowd analysis, the next step is to find various features. Therefore, we extracted a bag of features involving three types: periodic motion, motion direction flow, and shape autocorrelation. After extracting features, we map them in a single vector and apply data optimization and mining techniques. For this, we adopt the associate-based mining technique, which gives us the optimized vector for further classification. Finally, we apply the K-ary tree hashing algorithm to find normal and abnormal activities in pedestrian crowd-based videos. Our main contribution is as follows:(1)To predict indoor and outdoor pedestrian behavior in crowd-based scenes, we proposed a robust approach.(2)We introduced a bag of features in which we extract various features such as periodic motion, motion direction flow, and shape autocorrelation to predict normal and abnormal activities.(3)To reduce the data, optimization, and mining, we apply an associate-based mining technique, and for classification, we apply a K-ary tree hashing algorithm over two benchmark datasets.(4)We compared the performance of our proposed PAP system with other state-of-the-art classifiers (such as XGBoost and SVM) on two benchmark datasets (including Avenue and ADOC). The results have shown that the proposed PAP system significantly outperformed other classifiers.

The remainder of our work is structured as follows: Section 2 presents the background study of the proposed work. Section 3 introduces the primary approaches applied in PAP methodology, covering denoising, human and crowd analysis, feature extraction (bag of features), data optimization, and classification. The experimental evaluation with results is presented in Section 4. Finally, the research concludes with a discussion of the PAP system in Section 5 and summarizes the conclusion drawn with the insight gained in Section 6.

We briefly discuss the essential requirements that culminate in dramatically varied pedestrian video abnormal action identification methodologies. The variety of potential anomalous occurrences is the primary obstacle to the anomaly estimation problem. Numerous researchers improve this challenge by explicitly describing abnormalities and essential characteristics that can be used efficiently for image classification, with gesture trajectory being the most popular. These investigations seek to establish sequences of item trajectories based on regular occurrences [14]. The process consists of four major phases: object recognition, monitoring, trajectory-based semantic segmentation, and identification [15]. The benefits of procedures in this segment include their straightforward adoption and rapid implementation. Several saliency approaches have attained state-of-the-art results [16]. Rather than specifying the exact and predicting distinct anomaly properties, other researchers view a given series of images as a sequence of smaller 3D regions. Meaningfully, a collection of subsequent frames are integrated with the spatiotemporal direction and then separated into same-size 3D areas, including a window sliding on the current structure [17]. During the inference phase, every 3D patch obtained from untrustworthy sources is represented as a complex combination of a training dataset’s episodes. The reconstruction error will serve as the criterion for validating the conclusion. In addition to space partitioning, gating factor analysts are also used to construct 3D sections [18], while other scholars begin to understand the association among learning regions according to their frequency or pattern-matching identification [19].

Sparse coding presupposes that all instances may be estimated as a linear transformation of several elements of learned dictionaries and has been frequently employed in anomaly detection [4]. In particular, sparse coding-based anomaly detection generates a vocabulary under the feature space requirement during training. It employs the restoration decline to identify unusual frames (i.e., abnormal activities) during detection. The other primary benefits of instance segmentation are its operational effectiveness, which complicates practical examples like surveillance tape evaluation. For instance, the authors of [4] suggest an interactive sparse coding strategy to combat such a disadvantage. Representation [20] means abandoning the feature space requirement and learning many tiny translations to encapsulate texture features at various scales. In addition to computational effectiveness, these conventional works have also been plagued by the constraints of deeply embedded algorithms.

Recently, specific approaches have been utilizing LISTA [21] for anomaly identification. Although these techniques could benefit from swift implication frequency and concurrently learn dictionaries and sparse illustration due to their machine learning execution, researchers have sustained from the first limited sources because they are comparable to the conventional solutions [22]. Generally, researchers use a nonadaptive generating technique and do not consider existing data while optimizing. Sometimes, the proposed approach may be inadequate and result in lower productivity [23]. For example, sparse/large data typically necessitate a per-dimension patching strategy to conserve sufficient processing resources. In addition, optimal control social evaluations [24] have demonstrated that adding available information improves the generalization capability of optimization techniques.

In [25], researchers developed a movement filter that uses temporal as well as spatial data among successive frames of video to recover erroneous detection methods. Deep convolution neural network (DCNN) is the foundation of our architecture for estimating crowds in low population videos. Visual Geometry Group (VGG16), Zeiler and Fergus (ZF), and VGGM are utilized in the context of a region-based DCNN for object tracking. In [26], researchers describe a two-stage head identification approach that uses a fully convolutional network (FCN) to produce scale-aware suggestions, accompanied by a convolutional neural network (CNN) that separates each recommendation into two groups, i.e., head and backdrop. Experimental findings indicated that the utilisation of spectrum recommendations generated by FCN leads to improvements in the recall frequency as well as the mean and median precision (mAP) of the item”. In [27], researchers propose a solution that utilizes a sophisticated model that functions as a head detection and takes into account the scale fluctuations of heads in movies. The strategy is based on the premise that the head is the most conspicuous body feature in sporting venues where vast crowds congregate. In order to address the issue of varying scales, they produce spectrum head ideas based on even a density. The convolutional neural network (CNN) is then fed spectrum recommendations and returns a reaction matrix comprising the presence percentages of humans seen across picture dimensions. The authors [28] present an end-to-end methodology for spectrum head identification that can accommodate a wide assortment of scales. By simulating a collection of magnitude convolutional neural networks with distinct receptive fields, we establish that scale changes may be accounted for. Several spectrum monitors are incorporated into a single communications system whose settings are optimized from the beginning to the end.

Multiscale histogram of optical flow (MHOF) is obtained to characterize activity in [29], and anomalies are identified and predicated on a sparse reconstruction cost (SRC). The authors in [30] revised the SRC classification algorithm to include a sparse collection of learning and 3D contour characteristics to characterize an activity. Furthermore, Del Giorno et al. [31] proposed identifying variations in video footage by distinguishing images from prior images. As an enhancement of the bag of video words (BOV) technique, Javan Roshtkhari and Levine [32] proposed a probability distribution function for encoding spatiotemporal arrangements of video dimensions using gradient characteristics. They identified the anomalous performance of the suggested statistical tests by merging statistical properties, including HOG and HOF, to indicate the activities. Colque et al. [33] introduced a new dynamical feature selection method, histograms of optical flow orientation and amplitude and probability, via adding mobility and stochastic variables to HOF (HOFME). Spatial cuboid-based approaches may try to perceive long-term actions, including hovering, despite their robustness when contending with complicated environments. It is because hovering is tied to a person’s long-term mobility instead than their relatively local cuboid mobility.

Most anomaly prediction-based anomaly detection methods utilize a few prior frames to anticipate the objective session. Compared to probability estimation methodologies, researchers are more logical and sensible in learning, and the systems are straightforward to configure. In contrast to the frame modelling techniques [34], the frame estimation techniques evaluate the anomaly’s presentation and orientation and activity. Furthermore, the benefits of frame projection strategies outweigh those of frame creation systems. P-GAN, a representative effort, predicted future images using the U-Net as the GAN generation [35]. The obtained frame estimated the intensities loss, gradients decline, optical flow destruction, and integrity verification decline of the GAN classification model [36]. The net loss characteristic of the entire network was developed using the aforementioned different indicators. Lastly, the anomaly detection utilized the discrepancy image and its estimate given by the learning algorithm [37]. The model’s structure is uncomplicated, but its effectiveness is exceptional. Moreover, the anomaly prediction model maintains accurate results in anomaly identification at an adequate level.

All the above methods have some limitations in dealing with this; we proposed a robust approach. Initially, we incorporated data and converted it into images and frames, denoised; after this, we applied texton-based segmentation and crowd clustering. The next step is to extract features from image sequences, periodic motion, motion direction flow, and shape autocorrelation for data optimization and data mining via associate-based data mining and data identification of normal and abnormal prediction and classification using the k-ary tree hashing algorithm. We apply this methodology over two publicly available datasets and achieve significant improvements. Finally, we compare it with the existing state-of-the-art method.

3. Materials and Methods

This section focuses mostly on our suggested PAP approach, which comprises multiple subparts. Initially, we collect RGB data via a surveillance camera and multifunctional camera in video and image format. After receiving the input, we apply it to preprocesses such as video-to-frame conversion, frame scaling, denoising, and RGB-to-binary conversion. The next phase separates humans from subhumans, clusters the population, and analyzes data. The bag of the features extraction technique is then applied, and three robust features are extracted: periodic motion, motion direction flow, and shape autocorrelation, for data optimization and data mining via associate-based data mining and data identification of normal and abnormal prediction and classification utilizing the k-ary tree hashing algorithm. Figure 1 presents a comprehensive graphical summary of the proposed strategy.

3.1. Preprocessing

In this section, we discuss the preprocessing for the proposed method; initially, background subtraction is performed using change detection and connected components-based method and approaches. We connected a component labeling method for segmenting the human silhouette and identifying skin pixels as salient sections. Upon obtaining the skin components, we segmented the human contour using histogram-oriented thresholding. Using Otsu’s technique, various threshold values of (equation (1)) were modified, and the maximum color intensity of stochastic histogram is expressed as follows:where is the overweight, is a threshold which is proposed by Otsu’s method, and is the largest location of skin frequency over extracted histogram index. This method is functional for every grey scale sector of given image, which is expressed as follows:

Figure 2 presents the example results’ background subtraction over given set of images from Avenue state-of-the-art dataset.

3.2. Human Detection and Crowd Analysis

In this section, human recognition and crowd analysis are discussed. After extracting the human contour, a mask was placed on the human figure to identify the human body shape. The frame encompasses all regions of humans, with red representing humans. Once human validation is accomplished in crowd-based datasets, the next step is to detect normal and deviant human behavior through crowd grouping and evaluation. Initially, Gaussian smoothing [38] is used to recognize humans. Gaussian smoothing functions similarly to thresholding because it softens an image. The Gaussian average difference measures the degree of reduction. The Gaussian gives an “evaluation model” of a given pixel’s surroundings, with the estimate increasing toward the central pixel’s illumination. Figure 3 shows the results of human detection and crowd analysis.

3.3. Bag of Features

In this step, we discuss the bag of features and the extraction approaches to finding these features over state-of-the-art datasets. We utilized three robust feature extraction frameworks: periodic motion, motion direction flow, and shape autocorrelation features. Algorithm 1 defines the complete picture of the feature extraction procedure.

Input: Frame_data
Output: Feature_vect
 ← []
 ← GetDatal_F()
Data_size_F ← GetData_F_size()
Procedure PAP (Video, Images)
 ← []
Denoise_Input_Data ← Preprocessing (Win, Median)
Sampled_Data (Denoise Data)
While exit invalid state do
 ← ExtractlFeatures (sample data)
 ← []
Return MainfeaturesVector
3.4. Periodic Motion

This spatial feature detects human motion over body segments. The region of attention is the component of the human body that triggers repeating patterns. Such a region is identified using a fundamental examination of the human anatomy. A bounding section displays the area of focus. Equation (3) illustrates the mathematical formula for periodic motion.where represents periodic motion and αsin (ωk + t) illustrates the human motion that is recurring in any specified order of images (see Figure 4).

3.5. Motion Direction Flow

In the motion flow features, a color scheme separation model was implemented for the provided data images. With it, we identify the motion direction and colorize it. We can now obtain such indicators and convert them to feature vector for use in future estimation and computations. These motion flow properties are formulated as follows:where defines the motion flow vector, is the detected RGB atrributes, and are the input image and frame data. Figure 5 illustrates the result of direction flow features.

3.6. Shape Autocorrelation

We implemented the virtualization methodology to the detected human body as part of the shape-based autocorrelation feature. We considered particular positions to be centered and constructed a 5-by-5-pixel window from the middle of image to image . Then, we replicated this procedure for every discovered human body. After getting a 5 × 5 frame of all critical human body areas, autocorrelation was calculated. The median of is determined using the given method described in the following equation:where is the median and is the given image sequence while time is represented aswhere is the total variance of given data and is the correlogram index values.

3.7. Data Optimization: Associate-Based Data Mining

The association rule-based features mining approach enables us to select the most distinctive characteristics by eliminating redundant and discordant elements from the obtained sample, which typically can reduce motion activity identification and pedestrian anomaly prediction accuracy and reliability. This is an underside method that begins with empty features and functionality and gradually adds new features depending on the selection of an optimization technique. This could reduce the variance, resulting in greater precision. Various sectors, including surveillance equipment, smart healthcare systems, and computer vision-dependent intelligent systems, frequently employ the association rule-based features mining methodology.

In the current infrastructure, the Bhattacharyya location estimation features minimization method is applied for numerous event-based classifications. It can calculate the separation rating () between x and b regions and then validate it. This methodology enables the reduction of the significant aspects of domain material, but the appropriate outcome for pedestrian anomaly prediction is reliant on the optimization method of the features mining techniques.where is the optimized features vector, are the mean index values, are the class index covariance, and are the value average, and are the class covariance of for statistics of normal and abnormal classes. The optimized predicted solution is calculated as follows:

An identification assessment methodology is presented for predicting pedestrian behavior for the given dataset to gather feature extraction that is anticipated to decrease prediction errors and increase interclass accurateness across feature information. For the Avenue dataset, periodic motion, motion direction flow, and auto-shape correlation-based features are extracted. For the ADOC dataset, periodic motion, motion direction flow, and auto-shape correlation-based features are extracted. Figure 6represents the features optimization results of Avenue and ADOC datasets.

3.8. Classification: K-Ary Tree Hashing

The K-ary tree hashing technique utilized the procedures of the rooted tree which defines the node having maximum K children. For the recognition and classification approach, the lowest hashing approach has been applied which is considered as the prestep of K-ary tree hashing. This approach is based upon the conditional property utilizing the similarity check between two subsets and . set of hashing value estimator for is used; furthermore, the min hashing operator for is .

To produce a permutation index, we have

The are the available random values which are extracted from the dataset. The K-ary tree hashing technique uses two strategies to discover the optimal result (see Figure 7): a naive approach for determining the frequency of neighboring nodes and min hashing to define the size of whatever statistic. The naive method is described in Algorithm 2.

Require: L, K i
Ensure: T( )
% N is neighbor, L is Data, and T is size fixing approach%
(1) Temp ← sort (L(Ki))
(2)j ← min (j, |Ki|)
(3)t(i) ← [i, index (temp (1: j))]

4. Experimental Results and Analysis

4.1. Validation Methods

Numerous validation approaches have been studied in existing literature in order to assess the performance of a proposed system. These approaches include train-test holdout validation, leave-one-out, and k-fold cross-validation, respectively. However, subject overlap occurs when these methods are applied to datasets with several instances per subject. Therefore, in reviewing the literature and to prevent the challenges of subject overlap, leave-one-subject-out (LOSO) cross-validation is the best applicable approach. In this context, we adopted the LOSO cross-validation method for our proposed system. In LOSO, one subject is separated for testing while the system is trained on the provided data of the remaining subjects. The identical method is repeated for all subjects. Finally, the mean recognition rate of all subjects is determined.

4.2. Evaluation Metrics

Different evaluation metrics, such as accuracy, precision, recall, and F1 score, have been employed to assess the performance of the recommended systems. In our context, LOSO CV accurately recognized pedestrian anomalies in the benchmark datasets. Precision measures how well a proposed system can predict a certain class. Recall indicates the number of times a proposed system was able to recognize a particular category while f-measure delivers an individual score that addresses both precision and recall covers in a single number.

4.3. Dataset Descriptions

The Avenue database [30] contains 16 activities and 21 testing films. The videos were recorded on the central avenue of the Chinese University of Hong Kong (CUHK), with a range of 30652 pictures. The exercise videos are an example of formal conditions. Challenging clips involve both conventional and abnormal actions. The database consists of few more difficulties, including low camera calibration, anomalies, and median shapes, along with several samples of unusual entities, such as strange, inappropriate, and strange.

The second dataset is our PAP framework architecture design, ADOC [39], a 24 hrs. The second dataset in our study is the PAP framework architecture design, specifically the ADOC dataset [38]. This dataset consists of a 24-hour workout videos that includes 25 distinct sequence activities and a total of 721 instances. This is the largest gathering available with unique entity identifiers for security monitoring. A surveillance video camera installed on a large university campus has acquired the statistics. It exposes a walkway that incorporates different facilities and depicts the hectic everyday activity of youngsters, instructors, and staff.

4.4. Hardware Platform

MATLAB (R2022a) and Google Colab are used for all analysis and experimentation while Intel (R) Core (TM) i7-8665U CPU 1.90 GHz with 64-bit Windows 11 Pro was used as the main portable device. Additionally, the notebook involves 16 GB of internal memory (RAM).

4.5. Experimental Evaluation and Results

All the experiments are performed on two challenging benchmark datasets including Avenue and ADOC, respectively. The next step is to distinguish between normal and abnormal activities in pedestrian crowd scenes with the support of the K-ary tree hashing algorithms. Table 1 presents the confusion matrix for the avenue database, highlighting an anomaly detection rate (ADR) of 88.3%, a false positive rate of 11.6%, whereas Table 2 represents the confusion matrix of ADOC database with ADR of 89.3% and ER of 10.6%.

In the next step (see Figure 8), we compare the performance of our proposed PAP system with other state-of-the-art classifiers on two benchmark datasets. The results have revealed that the proposed PAP system significantly outperformed other classifiers.

Table 3 presents the comparison of the proposed PAP method performance with other state-of-the-art systems using the Avenue and ADOC databases, respectively.

5. Discussion

The proposed PAP system is designed to predict the extraneous behavior of pedestrians using a k-ary tree hashing and associate-based feature mining. The research study is based on the denoising, human detection, bag of features, cues optimization, and anomaly detection steps. The proposed system is more accurate than the current state-of-the-art method. However, working on a real-time application presents possible challenges as it may involve processing overhead. Object occlusion is another issue that might make it harder to recognize pedestrians in crowded areas. This occurs when persons are obstructed from view by other individuals or objects, and it might make it more difficult to spot abnormal behavior.

Because human behavior is very diverse and complicated, it is challenging to construct anomaly prediction algorithms that precisely capture all conceivable actions. A considerable amount of training data is typically required for anomaly prediction algorithms to detect anomalies accurately. Yet, it can be difficult to acquire enough data to train the system in crowded environments, particularly for rare or unexpected behavior.

6. Conclusions

Humans are able to optimally find what they are searching for as the pedestrian vision system is adjusted to distinctive visual descriptors that make subjects of interest prominent. We introduce a method to replicate the human sensory system via bag of features such as periodic motion, shape autocorrelation, and motion direction flow. In addition, we optimized them to acknowledge the presence of pedestrians. Therefore, the resulting pedestrian anomaly predictor is an efficient and robust top-down saliency system.

The experimental results indicate that our PAP detector attains state-of-the-art performance on two benchmark datasets, Avenue and ADOC. Moreover, in both benchmark datasets, we determined that it outperformed all other current state-of-the-art methods evaluated. Based on these findings, it is promising to continue to explore more features impacted by human visual systems to detect more complex scenarios in various settings, including security, transportation, and emergency care.

Data Availability

The data are publicly available as described in the main text.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The research was cofunded by the European Union within the REFRESH project-Research Excellence for Region Sustainability and High-tech Industries ID No. CZ.10.03.01/00/22_003/0000048 of the European Just Transition Fund and by the Ministry of Education, Youth and Sports of the Czech Republic (MEYS CZ) through the e-INFRA CZ project (ID: 90254) and also by the MEYS CZ within the project SGS ID No. SP 7/2023 conducted by VSB-Technical University of Ostrava.