Introduction to Human-Computer Interaction Papers——EarBuddy: Enabling On-Face Interaction via Wireless Earbuds

     This is a paper published by the Tsinghua team on CHI in April 2020, entitled Enabling On-Face Interaction via Wireless Earbuds, that is, the interaction on the face through wireless headphones. The researchers proposed the Ear Buddy Real-time system using microphones in commercially available wireless headsets to detect tap and swipe gestures near the face and ears.
     In this paper, a total of three studies were carried out. The developer established a comprehensive design space and designed 27 gestures on the side of the human face and ears. Since the user cannot truly remember all 27 gestures And some gestures are not easily detected by the earbud microphone, so the developer conducted a user study to narrow down the designed gesture set to 8 gestures based on user preference and microphone detectability. In a second user study, the researchers collected a full dataset containing these gestures in both quiet and background noise environments, used the data to train a shallow neural network binary classifier to detect gestures, and used A deep DenseNet is used to classify gestures. Finally, a real-time implementation of EarBuddy is constructed using these models, and a third user study is conducted to evaluate the usability of EarBuddy.


    

1. System Design EarBuddy

Gestures are recognized in two steps. First, the gesture detector determines whether a gesture is present. If a gesture is detected, then

A classifier recognizes the gesture. Figure 2 illustrates the overall plumbing of the system.

Detect gestures

Detection starts using a 180 ms sliding window with a step size of 40 ms. At each step 20 MFCCs are extracted from the window and fed into a binary neural network classifier. The classifier outputs 1 whenever there is audio content belonging to the gesture, and 0 otherwise.

(FCNN network, as the name suggests, is that the neural network is composed of convolutional layers. The difference from the classic CNN network is that it replaces all the fully connected layers in the CNN network with convolutional layers.

The FCNN network can classify images at the pixel level, thus solving the problem of image segmentation at the semantic level. The FCNN network can accept input images of any size, and use the deconvolution layer to upsample the feature map of the last convolution layer to restore it to the same size as the input image, so that each pixel can be generated. prediction while preserving the spatial information of the original input image, and finally perform pixel-wise classification on the upsampled feature map. )

( In speech recognition and speaker recognition, the most commonly used speech feature is the Mel cepstral coefficient (MFCC for short).

The MFCC extraction process includes preprocessing, fast Fourier transform, Mel filter bank, logarithmic operation, discrete cosine transform, dynamic feature extraction and other steps.

Preprocessing: Framing and windowing the audio signal, because the audio signal itself is a non-stationary signal, mainly due to the short-term variation caused by the unpredictable movement of the vocal organs, so it cannot be directly processed and analyzed. However, the speed of the state change of the vocal organ is much slower than the speed of the sound vibration, so in order to obtain a stable signal, a frame-by-frame processing method can be adopted. Assuming that the signal is stable in a short period of time, usually the value is between 20-40ms as a frame.


Fourier transform: A form of transformation that transforms a signal from the time domain to the frequency domain

Mel filter bank: Pass the energy spectrum through a set of Mel-scale triangular filter banks

(1) The triangle is dense at low frequencies and sparse at high frequencies, which can imitate the high-resolution characteristics of the human ear at low frequencies;

(2) Smooth the spectrum, eliminate the effect of harmonics, and highlight the formant of the original voice;

(3) The sequence obtained by Fourier transform is very long, transforming it into the energy under each triangle can reduce the amount of data;

Logarithmic operations: Logarithmic operations include absolute value and log operations. Taking the absolute value only uses the amplitude value and ignores the influence of the phase, because the phase information has little effect in speech recognition.

The log operation is because human perception is proportional to the logarithm of the frequency, and it happens to use log simulation.

Discrete cosine transform: Discrete cosine transform is performed on the signal of each filter point, and the pitch information and channel information can be separated, so that 12-dimensional MFCC features can be obtained.

Dynamic feature extraction: The standard cepstrum parameter MFCC only reflects the static characteristics of speech parameters. The dynamic characteristics of speech can be described by the difference spectrum of these static features, such as the trajectory of MFCC over time. It is actually proved that after adding the trajectory change of MFCC will improve the recognition effect. Therefore, we can use the information of several frames before and after the current frame to calculate the first-order and second-order differences , and also perform the first-order and second-order differences on the energy of the frame , so that the 39-dimensional MFCC feature vector can be finally obtained.


Almost all gestures require more than three single steps (>120 ms), so the presence of gestures should cause the classifier to produce multiple 1s in a row; however, temporal variations in the data and noise can make the serial output of the classifier noisy . EarBuddy solves this problem by using a majority voting scheme where adjacent consecutive sequences of 1s are merged if they are separated by one or two 0s. A gesture is defined as being present as long as there are 3 or more consecutive 1's, corresponding to a minimum gesture duration of 120 milliseconds. Whenever a gesture occurs, EarBuddy takes a 1.2-second raw audio clip centered on 1 sequence (covering more than 99% of gestures) and feeds it into the gesture classifier.

Audio data is processed for classification using a mel-spectrogram, generated by applying a short-time Fourier transform with a 180-millisecond window and a 5.36-millisecond step size, resulting in a linear spectrogram of length 224 , the spectrum can be converted to a 224-bit mel spectrum. When EarBuddy converts an audio signal to a mel-spectrogram, the one-dimensional audio signal is converted to a two-dimensional image format.


Mel spectrogram: The spectrogram is based on the graph obtained by the Mel scale. The spectrum represents the distribution of signals at different frequencies. However, through actual subjective experiments, scientists have found that the human ear is more sensitive to the difference between low-frequency signals and higher-frequency signals. The difference is less sensitive. That is to say, two frequencies on the low frequency band and two frequencies on the high frequency band, it will be easier for people to distinguish the former. Therefore, two pairs of frequencies with equal distances in the frequency domain may not necessarily have the same distances for the human ear.

Therefore, it is thought that by adjusting the scale of the frequency domain, two pairs of frequencies with equal distances on this new scale will be equal to the human ear, which is the so-called Mel scale.

The Mel scale was proposed around 1930 and is still widely used today.


Researchers explored pre-trained visual models such as VGG16 [58], ResNet [22] and DenseNet [25] for migration learning. According to observations, it is found that pre-trained DenseNet has more advantages: it is a deep dense network with a relatively small number of parameters. Few yields the best precision for the data.

 

The DenseNet in the paper is a network with one convolutional layer, four dense blocks, and intermediate transition layers.

Then I looked up this DenseNet ( it's based on the idea that ConvNets can be trained deeper, more accurately and more efficiently if they contain shorter connections between layers closer to the input and layers closer to the output .The dense convolutional network (DenseNet) was proposed , which was jointly invented by Cornwell University, Tsinghua University and Facebook AI Research (FAIR) . It is the paper that won the best paper award in CVPR in 2017. Feedback method will One layer is connected to every other layer. For each layer, the feature maps of all previous layers are used as input, and its own feature maps are used as input for all subsequent layers. DenseNets have several compelling advantages: They alleviate the vanishing gradient problem, enhance feature propagation, encourage feature reuse, and drastically reduce the number of parameters. )

As shown in the figure, in the dense block, each layer receives feature maps from all previous layers, where the above H l (⋅) function represents a nonlinear transformation function, (it is a combination operation, including a series of BN(Batch Normalization), ReLU, Pooling and Conv operations.

1×1 Conv and 2×2 average pooling are used as a transition layer between two consecutive dense blocks, which mainly connects two adjacent DenseBlocks and reduces the feature map size. .

The feature maps within a dense block are of the same size and can be concatenated in the channel dimension. .

(BN reason: As the depth of the network increases, the distribution of eigenvalues ​​at each layer will gradually approach the upper and lower ends of the output interval of the activation function (activation function saturation interval), and if this continues, the gradient will disappear. BN is through the method Pull the feature value distribution of this layer back to the standard normal distribution, and the feature value will fall in the range where the activation function is more sensitive to the input. A small change in the input can lead to a larger change in the loss function, making the gradient larger and avoiding the disappearance of the gradient. It also speeds up convergence.)

This architecture is modified after pre-training by replacing the last fully-connected layer with two fully-connected layers, with a dropout layer and ReLU activation function in between. The output layer needs to be modified because DenseNet produces 1000 possible output classes for the dataset, but EarBuddy requires much fewer (1 for each gesture). Finally, the modified pretrained network is trained on the dataset to produce the final classification model used by EarBuddy.

(  Dropout function: randomly stop a certain number of convolutions during each training, which can improve the generalization ability of the network and reduce overfitting )


In the interaction design, a total of seven areas that can be used for interaction are identified in the figure. The position of the gesture on the face and the movement of the finger are the two dimensions that define the design space of the researcher. As shown in the figure, using these two A total of 27 gestures are generated for all feasible option pairs in three dimensions, 14 of which are tap-based gestures and 13 are swipe-based gestures.


The researchers wanted to narrow the set of 27 gestures down to a subset that could be performed naturally, memorized quickly, and classified easily, so a study was conducted to identify a set with optimal gestures.

Sixteen participants (8 males, 8 females, age = 21.3 ±0.9) were recruited. The study was conducted in a quiet room with an ambient noise level of approximately 35‑40 dB. Each participant performed all 27 gestures 3 times with the right hand.

After performing the gesture 3 times, participants were asked to

Gestures are scored on three criteria (1: strongly disagree to 7: strongly agree):

• Simple : "Gestures are easy to perform with precision."

 • Social Acceptability : "The gesture can be performed without social concern." (ie, ethologically acceptable and not obtrusive)

• Fatigue : "This pose makes me feel tired." (Note: Likert scores are reversed for analysis)

They use the following aspects to choose the best gesture:

  1. SNR. We calculated the signal-to-noise ratio (SNR) for each sample and removed gestures with an average SNR below 5 dB. This removes eight gestures, many of which are swipe-based gestures that are either bottom-up or complex swipe gestures
  2. Signal similarity. We use Dynamic Time Warping (DTW) [54] on the raw data to compute the signal similarity between gesture pairs. Gestures with a total distance below 25% were removed as they were most likely to be confused during classification.
  3. Consistency in design. Previous work has shown that single-tap and double-tap gestures often occur together in a design space, i.e., if an interface supports a single-tap gesture, it usually also supports a double-tap gesture. Thus, for every single tap gesture dismissed before this point, the corresponding double tap gesture is removed, and vice versa.
  4. preference. Use the user's subjective rating to decide the remaining gestures.

In the end, there are 8 gestures left, including 6 tap gestures and 2 swipe gestures.

A second study was then conducted to evaluate the EarBuddy's detection and classification accuracy. This study was conducted in two phases: one in a quiet environment and the other in a noisy environment.

During data collection, participants were asked to perform each gesture 10 times in 5 rounds in two sessions,

Thus 100 examples of each gesture were generated for each participant (10 examples/round × 5 rounds/session × 2 sessions). To verify the EarBuddy's detection accuracy, participants were asked to perform the gesture for a time that coincided with a countdown on the laptop screen. This means the timer counts down for 2 seconds, and then the participant has another 2 seconds to complete the gesture. Audio was recorded during these 4 seconds to capture audio with and without gestures.


To test the feasibility of EarBuddy, the researchers used the data collected in this study to train two models, one for segmenting audio (i.e., gesture detection) and the other for recognizing gestures in clips (i.e., gesture detection). potential classification).

They use SGD (Stochastic Gradient Descent) as the optimizer for training with a momentum parameter of 0.9 to speed up convergence and a weight decay regularization parameter of 0.0001 [30] to prevent overfitting. We combine the gradient warm-up method [17] and the cosine burn-off technique to update the learning rate.

Cosine annealing : When using the gradient descent algorithm to optimize the objective function, when approaching the global minimum of the loss function, the learning rate should become smaller to make the model as close as possible to this point, so the learning rate needs to be attenuated. The characteristic of the cosine function is that with the increase of the independent variable x, the value of the cosine function first decreases slowly, then accelerates and then decelerates, so the cosine function is often used to reduce the learning rate

Gradient warm-up method : Since the weights of the model are randomly initialized at the beginning of training, if a large learning rate is selected at this time, the model may oscillate. Using the method of training warmup (Warmup) learning rate makes the learning rate in the first few cycles smaller, and the model will gradually stabilize under the warmup of the smaller learning rate. When the model is more stable, use the preset settings The learning rate is used for training, which is beneficial to speed up the convergence speed of the model, and the model effect is better. )

The learning rate starts at 0.01, then ramps up to 0.1 over 20 epochs, and then decays with a cosine curve over the next 400 epochs.

Such a learning rate schedule has the advantages of fast convergence (large learning rate at the beginning) and robust (small learning rate at the end) convergence.


The figure above shows the confusion matrix for the eight gestures based on the best model in Table 2. Three double-tap gestures had the highest accuracy (97.3%), followed by three single-tap gestures (94.4%) and two swipe gestures (93.1%).


Figure: Results of the evaluation study. top) Time to complete the task. Bottom) Subjective scoring for the three settings


In general, EarBuddy can be used as a convenient input method. However, EarBuddy is not suitable for repeated,

Continuous interaction, such as text entry and interface scrolling.

. Users often prefer tap gestures to swipe gestures. Compared to swipe gestures, tap gestures had similar mean simplicity scores but higher social acceptability (4.6 vs. 3.9) and fatigue scores (4.8 vs. 3.7). Furthermore, simple swipe gestures were preferred over complex ones, as the latter were considered less social (2.6) and more tiring (3.0).

Guess you like

Origin blog.csdn.net/gx19990824/article/details/127768231