Literature reading: GoPose 3D Human Pose Estimation Using WiFi

Motivation: Why did the authors want to address this problem?

  • Previous Wi-Fi-based 3D human pose estimation has the following flaws:
    • Only suitable for gestures at fixed positions [1]
    • Only allow execution of predefined activities [2]

Contribution: What did the author accomplish in this paper (innovative points)?

  • challenge

    • Unlike USRP or FMCW RADAR, channel state information CSI data derived from off-the-shelf Wi-Fi devices do not provide any spatial information of the human body (How to understand spatial information? AoA, AoD and the like)
    • How to make a human pose estimation system independent of the environment in which it operates?
    • How to model the complex relationship between 2D AoA spectrum and human 3D skeleton
  • solution

    • Deriving 2D AoA spectrum from nonlinearly spaced antennas and combining spatial diversity of transmitters and frequency diversity of Wi-Fi OFDM subcarriers to improve the spatial resolution of 2D AoA to distinguish signals reflected from different parts of the human body
    • Subtract the 2D AoA spectrum of a static environment from the spectrum extracted while one or more users are performing activities
    • The 2D AoA spectrum is used as input, and the 3D pose of the human body is inferred based on CNN and LSTM. CNN extracts spatial features, LSTM extracts temporal features
  • precision

    • GoPose achieves an accuracy of about 4.5 cm in various scenarios (including tracking activities in dark conditions) and NLoS scenarios (Accuracy is MPJPE?? should be)

Planning: How do they get the job done?

  • Overall structure

    insert image description here
    WiFi Probing: Collect data and use linear fitting method to denoise
    Data Processing: First, combine space diversity and frequency diversity ( described later ) to improve the resolution of two-dimensional AoA to distinguish signals reflected from different parts of the human body; Then the static signal reflected from the indoor environment is filtered out through the static environment; finally, the 2D AoA spectrum of multiple data packets is combined as the input of the network
    3D Pose Constrction: CNN is used to capture the spatial characteristics of human body parts, while LSTM is used to estimate the motion time feature

  • Improved resolution, space diversity and frequency diversity for 2D AoA

    1D AoA estimation does not elaborate too much, it uses the MUSIC algorithm

    2D AoA Estimation:
    Using an L-shaped antenna array at the receiver to derive the azimuth φ of the incoming signalφ and elevation angleθ \thetaθ , see paper 3.3 for details of the formula.
    insert image description here
    Although 2D AoA can provide the approximate position of the human body in 2D space, it cannot distinguish signals reflected from different parts of the human body, such as signals from the torso (ie, signalk 2 k_2k2) or from the leg (ie signal k 3 k_3k3). This is because the hardware limitations of commodity WiFi lead to very low resolution of the 2D AoA spectrum. To overcome this limitation, we further combine the spatial diversity of the transmitter (2D AoA,AoD) and frequency diversity of WiFi OFDM subcarriers (ToF) to improve the resolution of the 2D AoA spectrum

    Space diversity among the three transmit antennas introduces a phase shift due to angle of deviation (AoD), while frequency diversity of OFDM subcarriers introduces a phase shift with respect to time of flight (ToF). Therefore, we can significantly improve the resolution of the 2D AoA spectrum by exploiting spatial and frequency diversity to jointly estimate 2D AoA, AoD, and ToF:
    a ′ ( φ , θ , τ ) = [ 1 , ... , Ω τ V − 1 , Φ ( ϕ , θ ) , ... , Ω τ V − 1 Φ ( φ , θ ) , ... , Φ ( ϕ , θ ) R − 1 , ... , Ω τ V − 1 Φ ( φ , θ ) R − 1 ] T a ( ϕ , θ , ω , τ ) = [ a ( φ , θ , τ ) , Γ ω a ( φ , θ , τ ) ′ , ... , Γ ω S − 1 a ( ϕ , θ , τ ) ] T \begin{aligned} \mathbf{a}^{\prime}(\varphi, \theta, \tau)=& { \left[1, \ldots, \Omega_{\tau}^{V-1}, \Phi_{(\varphi, \theta)}, \ldots, \Omega_{\tau}^{V-1}\Phi_ {(\varphi, \theta)}, \ldots, \Phi_{(\varphi, \theta)}^{R-1}, \ldots, \Omega_{\tau}^{V-1}\Phi_{( \varphi, \theta)}^{R-1}\right]^{T} } \\ & \mathbf{a}(\varphi, \theta, \omega, \tau)=\left[\mathbf{a }_{(\varphi, \theta, \tau)}, \Gamma_{\omega} \mathbf{a}_{(\varphi, \theta, \tau)}^{\prime}, \ldots, \Gamma_ {\omega}^{S-1}\mathbf{a}_{(\varphi, \theta, \tau)}\right]^{T}\end{aligned}a (φ,i ,t )=[1,,OhtV1,Phi( f , i ),,OhtV1Phi( f , i ),,Phi( f , i )R1,,OhtV1Phi( f , i )R1]Ta ( φ ,i ,oh _t )=[a( φ , θ , τ ).,Coha( φ , θ , τ ),,CohS1a( φ , θ , τ ).]TP ( ϕ , θ , ω , τ ) Improve = 1 a H ( ϕ , θ , ω , τ ) ENENH a ( φ , θ , ω , τ ) P(\varphi, \theta, \omega, \tau)_ {\text {Improve}}=\frac{1}{\mathbf{a}^{H}(\varphi, \theta, \omega, \tau) \mathbf{E}_{N} \mathbf{E} _{N}^{H}\mathbf{a}(\varphi, \theta, \omega, \tau)}P ( φ ,i ,oh _t )Improve =aH (φ,i ,oh _t ) ENENHa ( φ ,i ,oh _t )1
    Azimuth φ \varphiφ , elevation angleθ \thetaθ、AoDω \omegaω、ToFτ \taut

  • Static environment removed

    Since the 2D AoA spectrum provides spatial information of multipath signals, we can exploit this information to remove LoS signals and signals reflected from static environments for environment-independent 3D pose estimation. The specific method is to subtract the 2D AoA spectrum of the static environment from the 2D AoA spectrum of human activities.
    insert image description here

  • Combine multiple packets:

    A 2D AoA spectrum derived from a single WiFi packet can only capture a small fraction of body motion, so a sequence of packets (100 packets) is fed as input to the neural network to estimate body pose:
    insert image description here

  • Neural Networks

    Set the range of azimuth and elevation to [0, 180] degrees, and the resolution to 1 degree, to obtain a frequency spectrum with a size of 180×180.The system utilizes 4 receiversThe user's motion is captured from different angles, and the spectra of the four receivers are concatenated to obtain a tensor of size 180 × 180 × 4. Additionally we need to combine multiple spectra to capture full body motion. Therefore, we concatenate 100 data packets from each receiver to form a 180 × 180 × 400 matrix as input
    Neural Network, CNN is used to capture the spatial features of human body parts, and LSTM is used to estimate the temporal feature
    insert image description here
    loss of motion Function:
    LP = 1 T ∑ t = 1 T 1 N ∑ i = 1 N ∥ p ˉ ti − pti ∥ 2 , L_{P}=\frac{1}{T} \sum_{t=1}^{T } \frac{1}{N} \sum_{i=1}^{N}\left\|\bar{p}_{t}^{i}-p_{t}^{i}\right\| _{2},LP=T1t=1TN1i=1N pˉtipti 2, L H = 1 T ∑ t = 1 T 1 N ∑ i = 1 N ∥ p ˉ t i − p t i ∥ H , L_{H}=\frac{1}{T} \sum_{t=1}^{T} \frac{1}{N} \sum_{i=1}^{N}\left\|\bar{p}_{t}^{i}-p_{t}^{i}\right\|_{H}, LH=T1t=1TN1i=1N pˉtipti H, L = Q P ⋅ L P + Q H ⋅ L H , L=Q_{P} \cdot L_{P}+Q_{H} \cdot L_{H}, L=QPLP+QHLH,

Justification: What experiments are used to verify the results of their work

  • Experimental configuration

    One transmission and four receptions, 3 antennas at the transmitting end, 3 antennas at the receiving end (L-shaped placement)
    packet rate 1000Hz
    Kinect2.0 records ground truth (Can absolute attitude be recorded? ?)
    data of 10 individuals

  • Experiment site

    The default distance between transceivers in living room (4 × 4), dining room (3.6 × 3.6) and bedroom (4 × 3.8) is
    2.5 meters
    insert image description here

  • Evaluation Index

    The joint localization error is used as the evaluation metric, defined as the Euclidean distance between the predicted joint position and the ground truth. Note that evaluating 14 keypoints/joints (Is it aligned or not?)

  • overall performance

    ① NLOS condition: Prove that the system can apply deep learning models trained under LoS conditions to NLoS scenarios without retraining
    ② Impact of environmental changes: Use data collected in one environment (such as living room or dining room) to train the system, Then evaluate the performance of the system when running in different environments (such as bedrooms)
    ③The impact of the distance between transceivers
    ④The impact of packet sending rate
    ⑤Different users: 7 people for training, 1 person for verification, 2 people for testing
    ⑥The impact of multiple users: verification The experiment collected the data of 2 people, but it is useless

own opinion

  • Need 4 receivers, too many
  • Does this count as absolute pose estimation? Should still be based on the root node

references

[1] Towards 3D human pose construction using wifi
[2] Winect: 3D Human Pose Tracking for Free-form Activity Using Commodity WiFi

Guess you like

Origin blog.csdn.net/qq_42980908/article/details/125833105