Table of contents
Motivation: Why did the authors want to address this problem?
- Previous Wi-Fi-based 3D human pose estimation has the following flaws:
- Only suitable for gestures at fixed positions [1]
- Only allow execution of predefined activities [2]
Contribution: What did the author accomplish in this paper (innovative points)?
-
challenge
- Unlike USRP or FMCW RADAR, channel state information CSI data derived from off-the-shelf Wi-Fi devices do not provide any spatial information of the human body (How to understand spatial information? AoA, AoD and the like)
- How to make a human pose estimation system independent of the environment in which it operates?
- How to model the complex relationship between 2D AoA spectrum and human 3D skeleton
-
solution
- Deriving 2D AoA spectrum from nonlinearly spaced antennas and combining spatial diversity of transmitters and frequency diversity of Wi-Fi OFDM subcarriers to improve the spatial resolution of 2D AoA to distinguish signals reflected from different parts of the human body
- Subtract the 2D AoA spectrum of a static environment from the spectrum extracted while one or more users are performing activities
- The 2D AoA spectrum is used as input, and the 3D pose of the human body is inferred based on CNN and LSTM. CNN extracts spatial features, LSTM extracts temporal features
-
precision
- GoPose achieves an accuracy of about 4.5 cm in various scenarios (including tracking activities in dark conditions) and NLoS scenarios (Accuracy is MPJPE?? should be)
Planning: How do they get the job done?
-
Overall structure
WiFi Probing: Collect data and use linear fitting method to denoise
Data Processing: First, combine space diversity and frequency diversity ( described later ) to improve the resolution of two-dimensional AoA to distinguish signals reflected from different parts of the human body; Then the static signal reflected from the indoor environment is filtered out through the static environment; finally, the 2D AoA spectrum of multiple data packets is combined as the input of the network
3D Pose Constrction: CNN is used to capture the spatial characteristics of human body parts, while LSTM is used to estimate the motion time feature -
Improved resolution, space diversity and frequency diversity for 2D AoA
1D AoA estimation does not elaborate too much, it uses the MUSIC algorithm
2D AoA Estimation:
Using an L-shaped antenna array at the receiver to derive the azimuth φ of the incoming signalφ and elevation angleθ \thetaθ , see paper 3.3 for details of the formula.
Although 2D AoA can provide the approximate position of the human body in 2D space, it cannot distinguish signals reflected from different parts of the human body, such as signals from the torso (ie, signalk 2 k_2k2) or from the leg (ie signal k 3 k_3k3). This is because the hardware limitations of commodity WiFi lead to very low resolution of the 2D AoA spectrum. To overcome this limitation, we further combine the spatial diversity of the transmitter (2D AoA,AoD) and frequency diversity of WiFi OFDM subcarriers (ToF) to improve the resolution of the 2D AoA spectrumSpace diversity among the three transmit antennas introduces a phase shift due to angle of deviation (AoD), while frequency diversity of OFDM subcarriers introduces a phase shift with respect to time of flight (ToF). Therefore, we can significantly improve the resolution of the 2D AoA spectrum by exploiting spatial and frequency diversity to jointly estimate 2D AoA, AoD, and ToF:
a ′ ( φ , θ , τ ) = [ 1 , ... , Ω τ V − 1 , Φ ( ϕ , θ ) , ... , Ω τ V − 1 Φ ( φ , θ ) , ... , Φ ( ϕ , θ ) R − 1 , ... , Ω τ V − 1 Φ ( φ , θ ) R − 1 ] T a ( ϕ , θ , ω , τ ) = [ a ( φ , θ , τ ) , Γ ω a ( φ , θ , τ ) ′ , ... , Γ ω S − 1 a ( ϕ , θ , τ ) ] T \begin{aligned} \mathbf{a}^{\prime}(\varphi, \theta, \tau)=& { \left[1, \ldots, \Omega_{\tau}^{V-1}, \Phi_{(\varphi, \theta)}, \ldots, \Omega_{\tau}^{V-1}\Phi_ {(\varphi, \theta)}, \ldots, \Phi_{(\varphi, \theta)}^{R-1}, \ldots, \Omega_{\tau}^{V-1}\Phi_{( \varphi, \theta)}^{R-1}\right]^{T} } \\ & \mathbf{a}(\varphi, \theta, \omega, \tau)=\left[\mathbf{a }_{(\varphi, \theta, \tau)}, \Gamma_{\omega} \mathbf{a}_{(\varphi, \theta, \tau)}^{\prime}, \ldots, \Gamma_ {\omega}^{S-1}\mathbf{a}_{(\varphi, \theta, \tau)}\right]^{T}\end{aligned}a′ (φ,i ,t )=[1,…,OhtV−1,Phi( f , i ),…,OhtV−1Phi( f , i ),…,Phi( f , i )R−1,…,OhtV−1Phi( f , i )R−1]Ta ( φ ,i ,oh _t )=[a( φ , θ , τ ).,Coha( φ , θ , τ )′,…,CohS−1a( φ , θ , τ ).]TP ( ϕ , θ , ω , τ ) Improve = 1 a H ( ϕ , θ , ω , τ ) ENENH a ( φ , θ , ω , τ ) P(\varphi, \theta, \omega, \tau)_ {\text {Improve}}=\frac{1}{\mathbf{a}^{H}(\varphi, \theta, \omega, \tau) \mathbf{E}_{N} \mathbf{E} _{N}^{H}\mathbf{a}(\varphi, \theta, \omega, \tau)}P ( φ ,i ,oh _t )Improve =aH (φ,i ,oh _t ) ENENHa ( φ ,i ,oh _t )1
Azimuth φ \varphiφ , elevation angleθ \thetaθ、AoDω \omegaω、ToFτ \taut -
Static environment removed
Since the 2D AoA spectrum provides spatial information of multipath signals, we can exploit this information to remove LoS signals and signals reflected from static environments for environment-independent 3D pose estimation. The specific method is to subtract the 2D AoA spectrum of the static environment from the 2D AoA spectrum of human activities.
-
Combine multiple packets:
A 2D AoA spectrum derived from a single WiFi packet can only capture a small fraction of body motion, so a sequence of packets (100 packets) is fed as input to the neural network to estimate body pose:
-
Neural Networks
Set the range of azimuth and elevation to [0, 180] degrees, and the resolution to 1 degree, to obtain a frequency spectrum with a size of 180×180.The system utilizes 4 receiversThe user's motion is captured from different angles, and the spectra of the four receivers are concatenated to obtain a tensor of size 180 × 180 × 4. Additionally we need to combine multiple spectra to capture full body motion. Therefore, we concatenate 100 data packets from each receiver to form a 180 × 180 × 400 matrix as input
Neural Network, CNN is used to capture the spatial features of human body parts, and LSTM is used to estimate the temporal feature
loss of motion Function:
LP = 1 T ∑ t = 1 T 1 N ∑ i = 1 N ∥ p ˉ ti − pti ∥ 2 , L_{P}=\frac{1}{T} \sum_{t=1}^{T } \frac{1}{N} \sum_{i=1}^{N}\left\|\bar{p}_{t}^{i}-p_{t}^{i}\right\| _{2},LP=T1t=1∑TN1i=1∑N∥ ∥pˉti−pti∥ ∥2, L H = 1 T ∑ t = 1 T 1 N ∑ i = 1 N ∥ p ˉ t i − p t i ∥ H , L_{H}=\frac{1}{T} \sum_{t=1}^{T} \frac{1}{N} \sum_{i=1}^{N}\left\|\bar{p}_{t}^{i}-p_{t}^{i}\right\|_{H}, LH=T1t=1∑TN1i=1∑N∥ ∥pˉti−pti∥ ∥H, L = Q P ⋅ L P + Q H ⋅ L H , L=Q_{P} \cdot L_{P}+Q_{H} \cdot L_{H}, L=QP⋅LP+QH⋅LH,
Justification: What experiments are used to verify the results of their work
-
Experimental configuration
One transmission and four receptions, 3 antennas at the transmitting end, 3 antennas at the receiving end (L-shaped placement)
packet rate 1000Hz
Kinect2.0 records ground truth (Can absolute attitude be recorded? ?)
data of 10 individuals -
Experiment site
The default distance between transceivers in living room (4 × 4), dining room (3.6 × 3.6) and bedroom (4 × 3.8) is
2.5 meters
-
Evaluation Index
The joint localization error is used as the evaluation metric, defined as the Euclidean distance between the predicted joint position and the ground truth. Note that evaluating 14 keypoints/joints (Is it aligned or not?)
-
overall performance
① NLOS condition: Prove that the system can apply deep learning models trained under LoS conditions to NLoS scenarios without retraining
② Impact of environmental changes: Use data collected in one environment (such as living room or dining room) to train the system, Then evaluate the performance of the system when running in different environments (such as bedrooms)
③The impact of the distance between transceivers
④The impact of packet sending rate
⑤Different users: 7 people for training, 1 person for verification, 2 people for testing
⑥The impact of multiple users: verification The experiment collected the data of 2 people, but it is useless
own opinion
- Need 4 receivers, too many
- Does this count as absolute pose estimation? Should still be based on the root node
references
[1] Towards 3D human pose construction using wifi
[2] Winect: 3D Human Pose Tracking for Free-form Activity Using Commodity WiFi