Research Notes - Wireless Perception Part 1 (Investigation of Human Behavior Recognition Based on WIFI CSI)

Human behavior recognition survey based on WIFI CSI

Table of contents

Human behavior recognition survey based on WIFI CSI

Research Status of Human Pose Recognition:

Knowledge points related to wireless perception:

Limitations of the WIFI system:

Behavior recognition based on Wi-Fi CSI:

Histogram-based techniques:

CSI denoising:

Feature extraction:

Machine Learning for Classification:

Multi-user activity recognition:


Research Status of Human Pose Recognition:

      1. In existing systems, individuals must wear devices with motion sensors such as gyroscopes and accelerometers. This makes its application very limited (people cannot wear devices all the time).

      2. The camera-based system can be used for passive activity recognition, but it may involve privacy issues and has great limitations.

Passive monitoring system based on wireless signal can well avoid the above problems

Knowledge points related to wireless perception:

       Received Signal Strength (RSS): When a person is between a WiFi device and an AP, the signal is attenuated and thus a different RSS is observed. Although RSS is very simple to use and can be easily measured, it does not capture the true changes in the signal due to the movement of a person. This is because RSS is not a stable metric even in the absence of dynamic changes in the environment.

WIFI system hardware modification: In order to use data other than RSS, the WIFI system must also be modified. The WiFi Universal Software Radio Peripheral (USRP) software radio system is an improved WiFi hardware. Frequency Modulated Carrier (fMCW) is a technique to measure the Doppler shift of an Orthogonal Frequency Division Multiple Access (OFDM) signal caused by human motion. Since the Doppler shift is related to distance, the position of the target can be estimated.

Doppler extraction: To extract the Doppler information, WiSee computes the frequency-time Doppler profile, extracts the Fast Fourier Transform (FFT) on the samples in a half-second window, then shifts the window by 5 milliseconds, continuing this process. This technique is also known as the short-time Fourier transform (STFT), and is used in other techniques as well.

Segmentation: The next step is to segment the STFT data to distinguish different patterns. For example, a gesture may include one segment with positive and negative Doppler shifts, or two or more segments each with positive and negative Doppler shifts. Detection of a segment is based on energy detection over a small duration. If the energy is 3 decibels above the noise level, the start of the segment is found, if the energy is below 3 decibels, the segment has ended.

Classification: The concept of classification is simple. There are three possibilities for each segment: only positive Doppler shift, only negative Doppler shift, and segments with both positive and negative frequency shift, according to which three numbers are assigned. Therefore, each gesture is represented by a series of numbers. The classification task is to compare the obtained sequence with the sequence used in training. WiSee also claims that the system can detect multiple moving targets and use the idea that each moving target's reflection can be seen as a signal from a wireless transmitter to identify their activity. Therefore, using the idea of ​​a multiple-input multiple-output (MIMO) receiver, the reflected signals generated by different groups of people moving in the area can be separated. The problem is to find the weight matrix that, when multiplied by the Doppler energy corresponding to each segment of each antenna, maximizes the Doppler of each segment. For this, an iterative algorithm is used. In contrast to technologies such as WiSee that require specialized USRP software defined radios, there have been several efforts to use commercial WiFi APs without modifying the WiFi system. To represent dynamic changes in the environment due to human motion, other metrics have recently been employed, such as channel state information (CSI), which is described in more detail below.

Limitations of the WIFI system:

        1. The phase shift caused by body movement cannot be observed in CSI.

        2. The analog-to-digital converter (ADC) generates SFO, and the SFO changes as the subcarrier index changes, so each subcarrier faces different errors.

The movement of the human body and objects changes the multipath characteristics of the wireless channel, making the estimated channel have different amplitudes and phases. The CSI amplitude for one subcarrier and all antennas is related to a person walking and sitting between the WiFi transmitter and receiver, as shown in Figure 2a. The person is stationary for the first 400 packs, but then begins to walk or sit. As observed, when the person is not moving, the CSI amplitude is relatively stable for all antennas; however, when activity starts, the CSI starts to change drastically. In this experiment, the time spent walking was longer than sitting because when the person sat down he/she was still. The receiving phase is very distorted due to CFO and SFO, as mentioned earlier. This can be observed in Figure 2b. However, using phase sanitization techniques, the effects of phase errors can be removed. The aligned phase can be observed in Fig. 2c.

Behavior recognition based on Wi-Fi CSI:

       In this section, we provide a summary of techniques using commercial WiFi NICs. The general diagram of the activity recognition system using WiFi CSI is shown in Fig. 3.

Histogram-based techniques:

       One of these techniques is E-Eyes, where the CSI histogram is used as a fingerprint in the database. During the testing phase, activities can be identified by comparing the obtained CSI histograms with the database and finding the closest one. The preprocessing steps are low-pass filtering and Modulation Coding Scheme (MCS) index filtering. The former is to eliminate high-frequency noise, which may not be caused by human motion, and the latter is to reduce unstable wireless channel changes. Although the histogram technique has good performance and low computational cost, it may not perform well in different environments because the histogram technique is sensitive to environmental changes.

        Recently, other techniques have been proposed, such as those proposed in WiHear [3], CARM [9] and [14]. In WiHear, a directional antenna is used to capture changes in CSI due to mouth motion. WiHear works great, however, this app can only monitor spoken words. In [14], the authors use advanced feature extraction and machine learning techniques to recognize words typed on a keyboard. This idea is similar to that in CARM [9], which is described in more detail below.

CSI denoising:

        CSI is noisy and may not show distinct signatures for different activities. Therefore, it is necessary to filter out the noise first, and then extract some features for classification using machine learning techniques. There are different methods to filter noise such as Butterworth low-pass filter [9]. However, due to the presence of explosion and impulsive noise in CSI with high bandwidth, low-pass filter cannot produce smooth CSI stream [9].
The research results show that there are some better techniques to achieve this goal, such as principal component analysis denoising technique [9]. Principal component analysis (PCa) is a dimensionality reduction analysis technique for large-dimensional systems, which utilizes the idea that most of the information of a signal is concentrated on certain features. In the CARM algorithm, the first principal component is removed first to reduce noise, and then the last five principal components are used for feature extraction. By removing the first principal component, the information generated by the dynamic reflection from the moving object is not lost since it is also captured in the other principal components. After performing principal component analysis to denoise the CSI data, some features are extracted from it for classification. Feature extraction is discussed below.

Feature extraction:

       One way to extract features from a signal is to transform it into another domain, such as the frequency domain. The Fast Fourier Transform (FFT), an efficient implementation of the Discrete Fourier Transform, can be used for this purpose. To do this, a window size of a certain number of CSI samples is chosen, and then an FFT is applied to each segment through a sliding window. This technique, also known as the short-time Fourier transform (STFT), detects frequency changes in a signal over time. Short-time Fourier transform has been applied to radar signals to detect torso and leg motion [8]. In Fig. 4, the STFT (spectrogram) of CSI with different activities is shown for the CSI data collected at 1 kHz frequency. As shown in Figure 4, activities involving strenuous exercise, such as walking and running, show high energy at high frequency in the spectrogram. In [3,9,14], features are extracted from CSI using DWT as a function of time. DWT provides high temporal resolution for high-frequency activity and high-frequency resolution for low-speed activity. Each level of the wavelet transform represents a frequency range, where lower levels contain higher frequency information and higher levels contain lower frequencies. The advantages of DWT over short-time Fourier transform (STFT) are stated in [9]: • DWT can provide a good balance in time and frequency domain. Discrete wavelet transform also reduces the size of the data, so it is suitable for machine learning algorithms.
In CARM, the five principal components were decomposed (after removing the first principal component) using a 12-level DWT. The five values ​​of DWT are then averaged. For every 200 milliseconds, CARM extracts a 27-dimensional feature vector, including three sets of features: • The energy of each wavelet level, representing the intensity of motion at different speeds.
* The difference between each level at consecutive 200 ms intervals.
• Torso and leg velocities estimated using Doppler radar techniques [8].
These features are used as input to the classification algorithm described below.


Machine Learning for Classification:

       Different machine learning techniques can be used for multi-class classification based on the extracted features. Some popular classification techniques include Logit models, support vector machines (SVMs), hidden Markov models (hMM), and deep learning. Since the activity data is a sequence, CARM uses HMMs and shows that satisfactory results can be obtained.
Action Recognition Using Deep Learning The problem of activity recognition is somewhat similar to the speech recognition process, where traditional HMMs have been used for classification. However, deep recurrent neural networks (RNNs) have been considered as the counterpart of Hidden Markov Models. Training an RNN is difficult because it suffers from the vanishing or exploding gradient problem; however, it was shown in [15] that the best accuracy for speech recognition to date can be achieved using the Long Short-Term Memory (LSTM) extension of RNNs. Therefore, we propose to use LSTM for activity recognition instead of other traditional machine learning techniques such as HMM, although feature extraction is not done similarly to CARM. There are two advantages to using LSTMs. First, LSTMs can automatically extract features; in other words, no preprocessing of the data is required. Second, LSTM can save the temporal state information of the activity, that is, LSTM can distinguish activities like "lay down" and "fall" Since "lay down" includes "sit" and "fall", the memory of LSTM can help to recognize these activities .

       In this section, we implement different methods along with our proposed method and show the performance of each method.
Measurement Setup We conduct experiments in an indoor office area, where Tx and Rx are 3 meters apart in line of sight. The Rx is equipped with a commercial Intel 5300 NIC with a sampling rate of 1 kHz. A person starts moving and doing an activity within 20 seconds of being on sight, and at the beginning and end, the person remains still. We also record video of the event so we can label the data. Our dataset includes 6 people, 6 activities, denoted as "lie down, fall, walk, run, sit, stand up," and 20 trials each.
Evaluating Machine Learning Techniques We applied PCA to the CSI amplitudes and then used short-time Fourier transform (STFT) to extract frequency-domain features every 100 ms. We only use the first 25 frequency components of the 128 FFT frequency bins, since most of the active energy is in the lower frequencies, and thus, the feature vectors do not become sparse.

       First, we use a random forest with 100 trees for activity classification. In order for the feature vectors to contain enough activity information, the modified STFT ensembles are stacked in one feature vector every 2 seconds; thus, the length of each feature vector will be 1000. We also implemented other techniques such as support vector machines, logit models, and decision trees, however, random forest outperformed these techniques.
        Table 1a shows the confusion matrix of random forest, as observed, good performance can be obtained in some activities, but not in activities such as "lie down", "sit down" and "stand up" at the same time The hidden Markov model is applied to feature extraction by using the STFT method, and the hidden Markov model is trained by using the MATLAB toolbox. Note that HMM is also used for CARM; however, DWT and the technique in [8] are used for feature extraction. The results are shown in Table 1b, where an increase in accuracy can be observed compared to Random Forest, although training requires higher computation time. While the HMM performed well, especially for "walk" and "run", it sometimes misclassified "stand up" and "sit down" or "lie down". We used Tensorflow in Python to evaluate the performance of the LSTM performance. The input eigenvector is the raw CSI magnitude data, which is a 90-dimensional vector (3 antennas and 30 subcarriers).
The LSTM method is different from traditional methods, it does not use principal component analysis (PCA) and short-time Fourier transform (STFT), and can directly extract features from CSI. The number of hidden units is chosen to be 200, where we only consider one hidden layer. For the numerical minimization of cross-entropy, we use stochastic gradient descent (SGD) with a batch size of 200 and a learning rate of 10−4.
Our results are shown in Table 1c, where all campaigns achieve an accuracy of over 75%. One of the disadvantages of using LSTMs with this approach is that they take longer to train than HMMs. However, with deep learning packages such as Tensorflow, it is also possible to use GPUs to speed up training. Once the LSTM is trained, testing can be done very quickly.

      Impact of Environmental Changes on Performance: The CSI characteristics of different environments and different people are not the same.
There are different techniques to reduce the environmental impact [9]. For example, after using PCA, the first component mainly includes CSI information due to stationary objects [9].
By discarding the first principal component, information generated by moving objects is mainly captured. Therefore, using this technique, relatively similar features can be obtained in different environments. Other techniques, such as Short-Time Fourier Transform (STFT) and Discrete Wavelet Transform (DWT) represent multipath varying speeds, which are related to the speed of motion of various parts of the body. Although the same activity in different environments results in very different CSI signatures, similar signatures can be obtained for different environments and people using STFT or DWT due to the similarity of signal reflection changes [9].
       Impact of Wi-Fi transfer rate on performance: In order for CSI to show significant changes due to movement, the transfer rate should be high enough (close to 1kHz) to capture activities that are done quickly. When the sampling frequency is around 50Hz, we observe a severe drop in the performance of the classification method. Increasing the frame rate increases the number of samples, which increases the computation for denoising and feature extraction. Increasing the frame rate may also not help further at some point, since the speed of human movement is limited in indoor areas. Therefore, choosing an appropriate sampling rate (about 1kHz) can achieve a good balance between computational cost and accuracy.
Using CSI phase information: Due to errors such as CFO and SFO, WiFi CSI phase information is rarely used for activity recognition in the literature. However, by subtracting the phase information of adjacent antennas from each other, CFO and SFO are missed. The phase difference is related to the angle of arrival (AOA), although there is an integer ambiguity in the integer number of cycles of the received signal. A change in the target position can change the AOA and thus the phase difference. When the motion speed is faster and the amplitude is larger, the signal will be scattered more randomly by the body, so that the AOA and phase difference will change faster. Therefore, it may be helpful to combine phase difference and magnitude for feature extraction and apply a classification algorithm. However, due to space limitations, further research is required.

Multi-user activity recognition:

       While many activity recognition techniques have been tested on a single user, a more interesting and challenging problem is the presence of multiple people in the environment. A solution is proposed in [2], using the idea of ​​MIMO receivers to separate the signals due to two different moving objects. Having multiple receivers may also help to differentiate activity across multiple users. Some multi-speaker recognition techniques may be applicable to the activity recognition problem. This is still an interesting open question.
CONCLUSIONS AND FUTURE WORK In this work, a survey of recent advances in human activity recognition systems using WiFi channels has been provided. The literature in this area shows great promise for good accuracy in indoor environments. Numerical experiments show that deep learning techniques such as RNN LSTM can achieve higher accuracy than methods such as HMM. In future research work, how to utilize CSI phase information and amplitude information, how to make the system robust in different dynamic environments, and how to recognize the behavior of multiple users are still some problems to be solved.

原文链接:A Survey on Behavior Recognition Using WiFi Channel State Information | IEEE Journals & Magazine | IEEE Xplore

github source link: GitHub - Hirokazu-Narui/LSTM_wifi_activity_recognition

It is not easy to write, please indicate the source for reprinting;

Guess you like

Origin blog.csdn.net/h1998040218/article/details/128679168