Abstract

Human activity recognition is a time series classification problem that is difficult to solve (HAR). Traditional signal processing approaches and domain expertise are necessary to appropriately create features from raw data and fit a machine learning model for predicting a person’s movement. This work aims to demonstrate how a hybrid deep learning model may be used to recognize human behavior. Deep learning methodologies such as convolutional neural networks and recurrent neural networks will extract the features and achieve the classification goal. The suggested model has used wireless sensor data mining datasets to predict human activity. The model’s performance has been assessed using the confusion matrix, accuracy, training loss, and testing loss. Thus, the model has achieved greater than 96% accuracy, superior to other state-of-the-art algorithms in this field.

1. Introduction

Human activity recognition (HAR) is concerned with recognizing people’s daily routines using sensor-based time-series information [1]. Sensors, the Internet of Things (IoT), cloud computing, and edge computing have all advanced in the last decade. Because sensors are affordable and can be easily incorporated or implanted in both portable and nonportable devices, HAR research has shifted to sensor technologies [2]. Using IoT wearable devices with sensors, it is possible to swiftly capture a variety of body movements for human activity detection [3]. Due to the rapid growth of low-cost smartphones and smartwatches, wearable inertial measurement units (IMUs), which consist of accelerometers and gyroscopes that can easily identify and pinpoint human movements, have become widely available in recent years. The fact that HAR based on wearable sensors can be used to identify and document numerous daily acts including eating, drinking, brushing your teeth, and detecting sleep disorders demonstrates the importance of HAR.

Actions in HAR are divided into two categories: primary and sophisticated [4]. Brushing your teeth or dribbling a ball is examples of complicated human activities that need two separate actions: the simple human activity itself and the transitional action. In machine learning, the HAR challenge is one of multivariate time series categorization as well as supervised learning [5]. Traditional algorithms like SVM and Random Forest have been combined with newer deep learning approaches like ANN, CNN, and RNN for previous activity recognition research [6]. With conventional algorithms, feature engineering and feature extraction require a lot of manual labor and take a long time. When it comes to categorizing human actions, however, deep learning approaches are better suited because they can learn features automatically from data [7].

Deep learning models that can accurately classify 18 daily routine complex and specific human activities from the WISDM dataset into three categories: (a) ambulation-oriented activities such as jogging and other ambulatory activities; (b) nonambulation-oriented activities such as driving and other motorized activities. Writing, typing, and other general hand-oriented activities are included in this category [8]. Aside from that, data gathered from low-level time series sensors, such as the accelerometers and gyroscops found in smartwatches and smartphones, can be utilised to study hand-oriented behaviours like eating chips and spaghetti [9].

It is possible to extract local features using a one-dimensional convolutional neural network (CNN) [10]. An alternative approach that takes advantage of neural networks with built-in memory and the ability to retrieve previously stored information is GRU. With the help of CNN and GRU, humans can perform complex activities with great precision. This hybrid model was compared to baseline models like InceptionTime and DeepConvLSTM, which were built using AutoML based on the McFly open-source python module [11].

In this paper, hybrid deep learning model is implemented by combining convolutional neural network and recurrent neural network for analysing the human activities. Mainly six types of body activities are recognized by using the proposed model.

The work is presented in 6 sections. The first section contains the introduction of the work. Section 2 provides a literature assessment of human activity recognition algorithms and methodologies. Methodology and framework for recognizing human activities are laid out in Section 3. Section 4 compares the performance metrics of the various models. To compare and contrast different models, Section 5 presents the findings and conclusions drawn from those other models’ outputs and future directions. The last section covers the references.

2. Literature Survey

In the past, methods based on machine learning were used to detect human activities. Decision trees are favored over more sophisticated models because of their ease of interpretation and low processing costs [12]. SVMs and -nearest neighbor (KNN) are two of the most common HAR classifiers that use sensor data [13]. Particle swarm optimization by Tharwat et al. [14] produced an improved -NN classifier that outperformed -nearest neighbor classifiers when applied to accelerometer data. -next neighbor, an instance-based learning (IBL) technique, is computationally demanding since it compares an incoming instance with every training example. Consequently, it is able to rapidly react to new information and discard outdated data. For six different indoor activities, implemented SVMs that did not require a lot of processing power or memory to identify them accurately.

Classifiers, such as the bagging and boosting ensemble meta-algorithm, can be combined to increase the final classification quality. Ensemble algorithms, on the other hand, require a larger number of base-level algorithms to be trained and evaluated. Diverse studies have documented deep learning algorithms’ current state-of-the-art performance. For the opportunity dataset, Ordonez and Roggen [15] implemented DeepConvLSTM with a convolutional and recurrent layer combination to achieve 95.8%, whereas Hammerla et al. [16] used bidirectional LSTM with two parallel recurrent layers that reach into the “future” and the “past.” Wearable sensors are required in order to collect both datasets, which necessitates the use of multiple sensors, making the technology more intrusive. Human action signals were employed in a categorization model established by Sikder et al. [17]. The frequency and power parameters of the collected signals were used. Using the fast Fourier transform (FFT) of each input channel, Ronao and Cho trained a four-layer CNN using raw sensor data [18]. In contrast, Ignatov [19] used a shallow CNN using raw sensor data input, 40 statistical characteristics, and a 95.2 percent success rate in order to achieve a 95.2 percent accuracy rate. The real-time solution would require more time and computing power since noise filters were applied to the accelerometer and gyroscope signal before sampling the dataset in fixed-width sliding windows with 50% overlap (128 readings/window).

The WISDM (wireless sensor data mining) lab developed two datasets: WISDM and Actitracker. All six physical activities (upstairs, downstairs, walking, jogging, sitting, and standing), four of which are the same and two that are unique, are captured by triaxial accelerometers in both datasets (upstairs and downstairs for the WISDM dataset and stairs and lying down, respectively, for the Actitracker dataset) [20]. The outcomes of these implementations are uneven because there is no preselection of datasets for training and testing.

All previous attempts on the WISDM dataset have produced solutions that are dependent on the user, with the exception of. Using the leave-one-out testing technique, Kolosnjaji and Eckert [21] used hand-crafted features and random forest or dropout classifiers on top of them to achieve 83.46 percent and 85.36 percent scores using a cross-validation procedure seven times with five participants excluded from the testing dataset and the remainder included in the training dataset. Manzi et al. [22] make use of depth camera data to identify human activity. This technique uses a machine-learning system to distinguish human activity. The action is represented using a different number of clusters obtained independently from the activity instances, in contrast to earlier methodologies. CAD-60 and TST datasets were used to train a multiclass SVM, then used to generate these models. The SOM optimization was used to prepare the SVM on both datasets. Depending on the input sequence and activity, these numbers can fluctuate, resulting in dynamically created clusters.

Milenkoski et al. [23] trained an LSTM network to identify raw accelerometer data windows, and their accuracy was only 88.6 percent based on an 80/20 split. Pienaar and Malekian [24] used a random 80/20 split to train an LSTM and an RNN and achieved a 93.8 percent accuracy rate. With the help of Microsoft Kinect data and Minkowski distances between joint data,

Another group of researchers developed a pose descriptor for differential quantities encoders and efficiently captured the information of a human joint’s posture in a frame sequence using differential quantities encoders [25]. When joining the descriptor, they used the -nearest neighbor method, although their results were nonparametric and had low latency recognition.

Siirtola and Röning [26] sampled the custom dataset at a frequency of 40 Hz, whereas the WISDM and Actitracker datasets were sampled at a frequency of 20 Hz. According to the authors, 98% and 99% of the walking activity’s power were confined below 10 Hz and 15 Hz. The fundamental could only be expressed in terms of 5 percent of the total signal strength at higher frequencies. It was shown that when a mobile device was carried at a hip region, the major frequencies were lower than 18 Hz. To save power and storage space, lower sample rates can be used. The ideal window size has been the subject of much research; however, most implementations employ window sizes between 1 and 10 seconds. In contrast to most other implementations, which use fixed window sizes ranging from 1.28 s to 2.56 s to 3.2 s to 10.7 s. In comparison, Ignatov et al. tested their implementation over a range of window sizes from 1 to 10.6 s. Most other implementations use variable window sizes ranging from 1.2 to 10 seconds.

3. Proposed Work

In this section, the proposed work will be explained in which wireless sensor data mining (WISDM) dataset will be used to perform the human activity recognition. The features will be extracted first using CNN and then classifier will be done using LSTM model [27]. Figure 1 displays the flowchart of the proposed work, let us discuss the proposed work in detail.

3.1. Dataset

In this work, wireless sensor data mining (WISDM) is the dataset cited in this work for the detection of human activities [28]. Among the samples in the collection are 1,098,207 examples of diverse physical activities from all around the world (sampled at 20 Hz). This dataset can also be found on the Wisdom website at https://www.cis.fordham.edu/wisdm/dataset.php. There are mainly six activities such as walking, jogging, upstairs and downstairs, as well as sitting and standing, are all included in the data collection.

3.2. Data Preprocessing

In this section, data preprocessing approaches will be discussed. First, feature extraction is performed then classification will be done by using deep learning model.

3.2.1. Feature Extraction

An example will help us comprehend the feature [29]. As soon as a photo is taken, the next step is to identify it; however, in order to do so, the user must keep vast numbers of photos that take up a lot of storage space and measure each photo in terms of pixels per inch. We must therefore do features extraction in order to store this image. The image’s dimensions will be reduced as a result of a feature extraction. As far as computer vision is concerned, a “feature” is an important notion. An important element or piece of information is referred to as a “feature.” The edges or the artefacts may be such details.

Zernike and Fourier descriptors are two examples of extraction algorithms [30]. To put it another way, descriptors are numbers assigned to describe a particular kind of thing. There are a few of them, such as (i)The number of form pixels in an area(ii)The number of pixels that fall within the shape’s boundaries(iii)Extend the shape to fit on a smaller rectangle by elongating a rectangle. Afterward, compute your height and width(iv)Rectangularity refers to how much of an object’s surface area is contained within a rectangular shape

This refers to the shape’s general orientation.

Recognition of human activity is accomplished in two stages:

(1) Collecting the training set and convolutional neural network feature extraction. Figure 2 shows the feature extraction approach. (i)Training Dataset. Changes to the dataset are recorded in preparation for the training phase. Data is comprised of five distinct gestures, with a total of one hundred and fifty different possible outcomes. As a result, the system is better able to deal with a wide range of gestures consistently. Thus, it aids comprehension of the operation under diverse conditions(ii)Feature Extraction. All images in training are retrieved and saved using the HU moment set approach, and the results are saved in a file for each training image. Descriptor values and classification levels for each image are stored in a matrix in the file(iii)Normalization. The features that are measured and processed are represented as a matrix in each row, which represents an image. There is a distinct feature (attribute) for each matrix attribute. As a result, the data in each column should be uniform and unrelated to the others. The maximum amount of data that can be kept in each column is used to save a file that will be classified later. The largest value of a particular characteristic (feature) is selected and used to divide the entire matrix (feature). The values that arise from this normalization are between 0 and 1. This greatly enhances the classification process. This will also reduce the likelihood of bias in classifications where each attribute is given equal weight(iv)Convolutional Neural Network. A convolutional neural network is a type of neural network designed to analyze multidimensional data, such as picture and time-series data (CNN). Feature extraction and weight computation are included in the training process. Using a convolution operator, these networks are referred recognized as convolutional networks. The fundamental advantage of CNNs is the ability to automatically extract features [2]. Figure 3 shows how the input data is first sent to a feature extraction network, and then the derived features are sent to a classifier network as illustrated. Convolutional and pooling layer pairs make up the feature extraction network. The input data is convolutioned using a layer of digital filters called a convolutional layer. The threshold is set by the pooling layer, which serves as a dimensionality reduction layer. Several factors must be modified during backpropagation in order to reduce the number of connections in the neural network architecture

3.2.2. Classification

Classification involves basic steps shown in (i)Implementation of the deep learning model using recurrent neural network

Long-term short-term memory improvement is the result of improved recurrent neural networks (RNNs). When the gradient disappears or explodes, LSTM memory blocks instead of RNN units can fix the problem. Unlike RNNs, it adds a cell state to store long-term states, which is its main difference. An LSTM network can maintain track of and link data from the past with data collected in the current time. For the LSTM, there are three gates: an input gate, a “forget” gate, and an output gate. The input gate relates to the current input, while the “forget” gate refers to the previous input. The internal construction of the LSTM is shown in Figure 4. This LSTM network layer has a total of 512 network units in its configuration. First, we employ two blocks of tightly linked layers, each of which has a size of 1024 network units and is activated by rectified linear units (ReLUs). Following the activation of the ReLU, we employ dropout regularisation with a value of 0.8. During the training phase, dropout regularisation is used to prevent the complicated coadaptations of the fully linked layers by disregarding randomly picked neurons. During the training process, this prevents overfitting from occurring. Last but not least, a third fully connected layer with a softmax activation function to obtain predictions for the next action. Because we want to choose the activities that are most likely to occur in a certain sequence.

Figure 4 displays the architecture of recurrent neural network in which represents the inputs. These inputs are the extracted features, and represents the output class of the input features. (ii)Recognition

The following are the steps involved in this procedure: (a)Hu moments are used first to determine the properties of the test image, followed by a second step(b)When compared to selecting training apps, these characteristics are advantageous

Figure 5 shows the process of recognition of images through machine learning approach. In order to feed the network with such temporal dependencies, a sliding time window is used to extract separate data segments. The window width and the step size can be both adjusted and optimised for better accuracy. Each time step is associated with an activity label, so for each segment, the most frequently appearing label is chosen. Here, the time segment or window width is chosen to be 200, and time step is chosen to be 100.

4. Results

The hybrid deep learning model has been applied on the wireless sensor data mining datasets to perform the human activity recognition. The classifier achieves the accuracy of 95%, though it might presumably be slightly improved by decreasing the step size of sliding window. The following graphs show the train/test error/accuracy for each epoch and the final confusion matrix (normalised so that each row sums to one). The proposed work is implemented using Python programming language. The performance metrics used for the evaluation of the model are confusion matrix, accuracy, and the loss. The number of epochs used for the training of the model is 5000. The model hyperparameters are taken on the basis of human activities such as downstairs, jogging, sitting, standing, upstairs, and walking. The confusion matrix is also formed using these hyperparameters.

Table 1 has shown the confusion matrix generated by the hybrid deep learning model. Another parameter used for the evaluation of the model is accuracy and the error.

Figure 6 has represented the graph of confusion matrix. The graph is based on the hyperparameters.

Figure 7 has shown the graph based on generated train and test accuracy, train and test loss with respect to training epoch. The proposed work has achieved accuracy greater than 96%.

5. Conclusion

Because of the positive effects on our health and well-being, the ability to recognize our activities is increasingly in demand. It is now an essential instrument in the fight against obesity and the care of the elderly. The capacity to recognize human activity is based on using sensors to understand the body’s gestures and motions and derive human activity or action. The activity recognizing systems can automate or simplify many ordinary human operations. Depending on the circumstances, human activity recognition systems may or may not be supervised. In this work, the human activity has been recognized using hybrid deep learning model, i.e., combination of convolutional neural network and the recurrent neural network. The parameters used for the evaluation of the model are confusion matrix, accuracy, and the errors. The results of the proposed model have shown that it has outperformed the existing models. In the future, the work can be improved by using Capsule Network in which we can take multiple convolutional neural networks for performing the activity recognition with different number of dropout layers, max pooling layers, and activation function.

Data Availability

The dataset has been collected from the website https://www.cis.fordham.edu/wisdm/dataset.php, that is, mainly available for the research purpose.

Conflicts of Interest

No potential conflict of interest was reported by the authors.

Acknowledgments

The Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia has funded this project, under grant no. (RG-26-166-42). The authors, therefore, acknowledge with thanks DSR for technical and financial support.