[Computer Vision|Speech Separation] Expected listening in noisy environments: A speaker-independent "audio-visual model" for speech separation

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

标题:Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

链接:Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation: ACM Transactions on Graphics: Vol 37, No 4

Translator's Note: The "Cocktail Party" in the original title literally means "cocktail party". The "cocktail party effect" is a term commonly used in the science of hearing, a concept derived from the amazing ability of humans to process complex acoustic environments. Consider a busy cocktail party scene where people can focus on a specific conversation or sound while ignoring other noises in the background. This is what we usually call "listening selective attention" or "cocktail party effect".

Authorization statement:

License to obtain a digital or hard copy of some or all of this work for personal or classroom use, provided that it is not for profit or commercial gain, and provided that the copy bears this notice and is cited in its entirety on the first page, at no charge. The copyrights of third-party components in this work must be respected. For other uses, please contact the owner/author of the work.

© 2018 Copyright by owner/author.

0730-0301/2018/8-ART112

https://doi.org/10.1145/3197517.3201357

Summary

We propose a joint "audio-visual model" for separating a single speech signal from a mixture of sounds such as other speakers and background noise . Solving this task using only audio as input is extremely challenging, and it is not possible to correlate the isolated speech signal with the speaker in the video.

In this paper, we propose a deep network-based model that incorporates both visual and auditory signals to solve this task.

Visual features are used to "focus" the audio on the desired speaker in the scene to improve the quality of speech separation. To train our joint audio-visual model, we introduce AVSpeech, a new dataset consisting of thousands of hours of video clips from across the web.

Requiring only that users specify the faces of people in videos they want to isolate, we demonstrate that our method works on both classic speech separation tasks, as well as real-world situations involving intense interviews, noisy bars, and screaming children.

In the case of mixed speech, our method significantly outperforms state-of-the-art audio-only speech separation.

Furthermore, our model is speaker -independent (train once, apply to any speaker), producing results that outperform recent speaker-dependent audio-video speech separation methods (requiring speakers of interest train a separate model).

Additional keywords and phrases

Audio-Vision, Source Separation, Speech Enhancement, Deep Learning, Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BLSTM)

ACM Reference Format

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, 和 Michael Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Trans. Graph. 37, 4, Article 112 (August 2018), 11 pages. https://doi.org/10.1145/3197517.3201357

1 Introduction

In noisy environments, humans have the amazing ability to focus their auditory attention on a single sound source, while reducing ("muting") all other sounds and noises. How the nervous system accomplishes this feat, known as the cocktail party effect [Cherry 1953], remains unclear.

However, research has shown that observing a speaker's face can enhance a person's ability to resolve perceptual ambiguities in noisy environments [Golumbic et al. 2013; Ma et al. 2009]. In this paper, we implement a computational expression of this capability.

The first author did this work at Google as an intern

Automatic speech separation, the separation of an input audio signal into individual speech sources, has been extensively studied in the audio processing literature. Since the problem is ill-conditioned in nature, prior knowledge or special microphone configurations are required for a reasonable solution [McDermott 2009].

Furthermore, audio-only speech separation suffers from a fundamental problem, the label permutation problem [Hershey et al. 2016]: there is no easy way to associate each separated audio source with the corresponding speaker in the video[ Hershey et al. 2016; Yu et al. 2017].

In this work, we propose a joint audio and visual approach for "focusing" on a specific speaker in a video. Input video can be recombined to enhance audio associated with a specific person while suppressing all other sounds (Figure 1).

Figure 1: We propose a model for isolating and enhancing speaker-specific speech in videos. (a) The input is a video (frame + audio track) in which one or more people are speaking, and the speech of interest is disturbed by other speakers and/or background noise. (b) Audio and visual features are extracted and fed into a joint audio-visual speech separation model. The output is the decomposition of the input audio track into clean speech tracks, one track© for each detected person in the video. This allows us to synthesize videos in which a specific person's voice is enhanced while all other sounds are suppressed.

Our model is trained using thousands of hours of video clips from our new dataset AVSpeech. The "Stand-Up" (similar to "Crosstalk") video (a) in the picture is provided by Team Coco.

Specifically, we design and train a neural network-based model that takes as input a recorded mixture of voices and tight crops of detected faces at each frame in the video , and segments the mixture into each detected utterance independent audio stream of the

The model leverages visual information both to improve the quality of source separation (compared to results using only audio) and to associate separated speech tracks with speakers visible in the video. All the user needs to do is specify which faces in the video they want to hear speaking from.

To train our model, we collected 290,000 high-quality lecture, TED talks, and tutorial videos from YouTube, and then automatically extracted about 4,700 hours of speech with visible speakers and clean speech (no interfering sounds) from these videos video clip (Figure 2).

Figure 2: AVSpeech dataset : First, we collect 290,000 high-quality online public speech and lecture videos (a). From these videos, we extract clips with clean speech (e.g. no mixed music, listener voices, or other speakers) and with speakers visible in the frames (see Section 3 and Figure 3 for processing details). This resulted in 4,700 hours of video clips, each of which was a single person talking without background distractions (b). The data cover a wide variety of people, languages, and face poses, with distributions shown in © (age and head angle estimated using automatic classifiers; languages ​​based on YouTube metadata). For a detailed list of video sources in the dataset, please refer to the project webpage.

We call this new dataset AVSpeech. With this dataset, we then generated a training set for a "synthetic cocktail party" — mixing videos of faces containing clean speech with audio tracks of other speakers and background noise.

We demonstrate the advantage of our method over recent speech separation methods in two ways.

  • We demonstrate superior results compared to state-of-the-art audio-only methods on mixtures of pure speech.
  • We demonstrate the ability of our model to produce augmented sound streams in real-world scenarios in mixtures containing overlapping speech and background noise.

In summary, our paper provides two main contributions :

  1. An audio-visual speech separation model that outperforms audio-only and audio-visual models in classical speech separation tasks and applies to challenging natural scenarios. To the best of our knowledge, our paper is the first to propose a model for speaker-independent audio-visual speech separation.
  2. A new large-scale audio-visual dataset, AVSpeech, carefully collected and processed, contains video clips where audible sounds belong to a single visible person in the video and are free of audio background distractions. This dataset enables us to achieve state-of-the-art results in speech separation and potentially leads to further studies by the research community.

Our dataset, input and output videos, and other supplementary material are all available on the project webpage: http://looking-to-listen.github.io/.

2 related work

We briefly review related work in the fields of speech separation and audio-visual signal processing.

Speech Separation : Speech separation is a fundamental problem in audio processing and has been the subject of extensive research in recent decades.

  • Wang and Chen [2017] provided a comprehensive overview of recent deep learning-based audio-only methods for speech denoising [Erdogan et al., 2015; Weninger et al., 2015] and speech separation tasks.

  • Two approaches to address the aforementioned **label permutation problem** have recently emerged for speaker-independent multi-speaker separation in the mono case.

    • Hershey et al. [2016] proposed a method called " deep clustering ", in which discriminatively trained speech embeddings are used to cluster and separate different speech sources.

    • Hershey et al. [2016] also introduced the idea of ​​a permutation-free or permutation-invariant loss function, but they did not find it to work very well. Isik et al. [2016] and Yu et al. [2017] subsequently proposed a method that successfully uses a permutation invariant loss function to train deep neural networks.

  • The advantages of our method over audio-only methods are threefold:

    • We show that our audio-visual model achieves higher quality separation results than state-of-the-art audio-only models.

    • Our method performs well in the presence of multiple speakers mixed with background noise, a problem that, to our knowledge, has not been addressed satisfactorily by an audio-only approach.

    • We jointly address two speech processing problems: speech separation and associating speech signals with their corresponding faces, which until now have been treated independently [Hoover et al., 2017; Hu et al., 2015; Monaci, 2011].

Audio-Visual Signal Processing : Multimodal fusion of auditory and visual signals using neural networks to solve various speech-related problems is attracting increasing interest.

  • which includes

    • Audio-visual speech recognition [Feng et al., 2017; Mroueh et al., 2015; Ngiam et al., 2011]

    • Speech or text prediction from silent videos (lip reading) [Chung et al., 2016; Ephrat et al., 2017]

    • Unsupervised learning of language from visual and speech signals [Harwath et al., 2016].

    These methods exploit the natural synchronization relationship between simultaneously recorded visual and auditory signals.

  • Audio-visual (AV) methods have also been used

    • Speech separation and enhancement [Hershey et al., 2004; Hershey and Casey, 2002; Khan, 2016; Rivet et al., 2014].

    • Casanovas et al. [2010] use sparse representations for AV source separation, but are limited by relying on only active regions to learn source features and assuming that all audio sources are visible on the screen.

    • Recent approaches use neural networks to perform this task.

      • Hou et al. [2018] propose a multi-task CNN based model that outputs a denoised speech spectrogram along with a reconstruction of the input mouth region.

      • Gabbay et al. [2017] trained a speech augmentation model on videos where other speech samples of the target speaker were used as background noise, a scheme they called “noise-invariant training”. In parallel work, Gabbay et al. [2018] used a video-to-audio synthesis approach to filter noisy audio.

    • The main limitation of these AV speech separation methods is that they are speaker-specific, meaning that a dedicated model must be trained separately for each speaker. Although these works have made specific choices in design, limiting their applicability to specific speaker situations. But we conjecture that the main reason why speaker-independent AV models have not been extensively studied to date is the lack of sufficiently large and diverse datasets to train such models. And this is exactly what the dataset we constructed and made available in this work has .

  • To the best of our knowledge, our paper is the first to address the problem of AV speech separation for unrelated speakers. Our model is able to isolate and enhance speakers it has never seen before , speaking languages ​​not in the training set. Furthermore, our work is unique in that we demonstrate high-quality speech separation in real-world examples in settings not covered by previous audio-only and audio-visual speech separation work.

  • A number of independent and concurrent works have recently emerged that address the problem of audio-visual sound source separation using deep neural networks.

    • [Owens and Efros 2018] trained a network to predict whether audio and visual streams are temporally aligned. The learned features extracted from this self-supervised model are then used to condition an on-screen and off-screen speaker source separation model.

    • Afouras et al. [2018] perform speech enhancement by using a network to predict the magnitude and phase of the denoised speech spectrogram.

    • Zhao et al. [2018] and Gao et al. [2018] address the closely related problem of separating the sounds of objects (e.g. musical instruments) within multiple screens.

Audio-Visual Datasets : Most existing AV datasets contain videos involving only a few subjects and speaking words from a limited vocabulary.

  • For example,

    • The CUAVE dataset [Patterson et al., 2002] contains 36 subjects who each say the digits 0 to 9 five times, for a total of 180 examples for each digit.

    • Another example is the Mandarin Sentences dataset introduced by Hou et al. [2018], which contains video recordings of 320 Mandarin sentences spoken by a native speaker. Each sentence contains 10 Chinese characters in which the phonemes are equally distributed.

    • The TCD-TIMIT dataset [Harte and Gillen, 2015] includes 60 volunteer speakers with approximately 200 videos per speaker. The speakers read a variety of sentences from the TIMIT dataset [S Garofolo et al., 1992] and recorded using a camera facing forward and at a 30-degree angle.

    For comparison with previous work, we evaluate our results on these three datasets.

  • More recently, Chung et al. [2016] introduced a large-scale lip-reading sentence (LRS) dataset, which includes words from a variety of different speakers and a larger vocabulary. However, not only is this dataset not publicly available, but speech in LRS videos is not guaranteed to be clean, which is crucial for training speech separation and augmentation models.

3 AVSpeech dataset

We introduce a new large-scale audio-visual dataset containing speech clips free of interfering background signals. The clips vary in length, between three and 10 seconds, and in each clip the only face visible in the video and the voice in the audio belong to the same speaker. In total, the dataset contains approximately 4,700 hours of video clips, covering approximately 150,000 different speakers, covering a wide variety of people, language, and facial poses. Figure 2 shows some representative frames, audio waveforms, and some dataset statistics.

We took the approach of collecting the dataset automatically because it is important to assemble such a large corpus without relying on extensive human feedback. Our dataset creation pipeline collected snippets from approximately 290,000 YouTube videos, including lectures (such as TED talks) and tutorial videos. For such channels, most videos will only contain a single speaker, and the video and audio are usually of high quality.

Dataset creation process . Our dataset collection process has two main stages, as shown in Figure 3.

Figure 3: Video and audio processing for dataset creation : (a) We use face detection and tracking to extract speech segment candidates from videos and reject frames where faces are blurred or not oriented sufficiently positively. (b) We discard passages containing noisy speech by estimating the signal-to-noise ratio of the speech (see Section 3). The graphs are intended to demonstrate the accuracy of our speech SNR estimator (and thus reflect the quality of the dataset). We compare the true speech SNR to the predicted SNR of a mixture generated by synthesizing clean speech and non-speech noise at known SNR levels. Predicted SNR values ​​(in decibels) were averaged over 60 generated mixtures in each SNR bin, error bars represent 1 standard deviation. We discard passages with predicted speech SNR below 17dB (marked by the dashed gray line in the figure).

  • First, we used the speaker tracking method of Hoover et al. [2017] to detect passages in videos where people are actively speaking and faces are visible. Face frames are discarded from the segment if they are blurry, poorly lit, or pose extreme. If more than 15% of face frames in a segment are missing, the entire segment is discarded. At this stage, we used Google Cloud Vision API 1 for the classifier and for computing the statistics in Figure 2.
  • The second step in building the dataset is to refine the speech segments to contain only clean, distraction-free speech. This is a critical component since these passages serve as the ground truth at training time. We automate this optimization step by estimating the speech signal-to-noise ratio (the logarithmic ratio between the main speech signal and the rest of the audio signal) for each paragraph.

    • We use a pretrained audio-only speech denoising network to predict the speech SNR for a given passage by using the denoised output as an estimate of the clean signal . The architecture of this network is identical to that of the audio-only speech enhancement baseline implemented in Section 5, trained on speech data from LibriVox public domain audio books.

    • For those passages whose estimated speech SNR is below a certain threshold, we discard them. This threshold was set empirically using a synthetic mixture of pure speech and non-speech interfering noise at different levels of known SNR2 . These synthetic mixtures are fed into a denoising network, and the estimated (denoised) speech SNR is compared to a ground truth SNR (see Figure 3(b)).

  • We find that in the low SNR case, on average, the estimated SNR of speech is very accurate and thus can be considered a good predictor of the raw noise level. However, at higher SNRs (i.e., passages where the original speech signal has little interference), the accuracy of this estimator decreases because the noise signal becomes weak. The threshold at which this weakening occurs is about 17dB, as shown in Figure 3(b). We listened to a random sample of 100 clips that passed this filter and found that none contained significant background noise. We provide sample video clips from the dataset in the supplementary material.

4 Audio-Visual Speech Separation Model

AUDIO-VISUAL SPEECH SEPARATION MODEL

At a high level, our model consists of a multi-stream architecture that takes as input a visual stream of detected faces and noisy audio and outputs complex spectral masks, each corresponding to A detected face (see Figure 4).

Figure 4: The multi-stream neural network architecture of our model : the visual stream takes as input thumbnails of faces detected in each frame of the video, while the audio stream takes as input the video's audio track, which contains a mix of speech and background noise. Visual Flow uses a pre-trained face recognition model to extract face embeddings for each thumbnail, and then uses a dilated convolutional neural network to learn visual features. Audio Streaming first computes the short-time Fourier transform (STFT) of the input signal to obtain a spectrogram, and then uses a similar dilated convolutional neural network to learn an audio representation. A joint audio-visual representation is then created by concatenating the learned visual and audio features, which is further processed by a bidirectional LSTM and three fully-connected layers. The network outputs a complex spectral mask for each speaker, which is multiplied with the noise input and converted back to waveforms to obtain individual speech signals for each speaker.

The noise input spectrum is then multiplied by the mask, resulting in individual speech signals for each speaker and suppressing all other interfering signals.

4.1 Video and Audio Representation

Enter features . Our model accepts both visual and auditory features as input.

  • For video clips with multiple speakers, we use an off-the-shelf face detector (e.g. Google Cloud Vision API) to find faces in each frame (a total of 75 face thumbnails per speaker, assuming each The clip is 3 seconds and the frame rate is 25 FPS).

    • We extract a face embedding for each detected face thumbnail using a pretrained face recognition model. We use the lowest spatially invariant layer in the network, similar to what Cole et al. [2016] used to synthesize faces. The rationale for this is that these embeddings preserve the information needed to recognize millions of faces while removing irrelevant variations between images, such as lighting.

    • Indeed, recent work has also shown that facial expressions can be recovered from these embeddings [Rudd et al., 2016]. We also tried using the raw pixels of the face image, but this did not result in a performance improvement.

  • As for audio features, we compute the short-time Fourier transform (STFT) of a 3-second audio clip. Each time-frequency (TF) bin contains the real and imaginary parts of a complex number, which we take as input. We do **power-law compression** to prevent loud audio from drowning out soft audio. The same processing applies to the noisy signal and the clean reference signal.

  • At inference time, our separation model can be applied to arbitrarily long video segments. Our model can accept multiple streams of faces as input when the faces of multiple speakers are detected in a frame , we will discuss this later.

output . The output of our model is a multiplicative spectral mask that describes the time-frequency relationship between clean speech and background disturbances.

  • In previous studies [Wang and Chen 2017; Wang et al. 2014], multiplicative masks were observed to be more effective than other options, such as directly predicting spectral magnitudes or directly predicting time-domain waveforms. There exist many mask-based training objectives in the source separation literature [Wang and Chen 2017], we tried two of them: ratio mask (RM) and complex ratio mask (cRM).

    • The ideal ratio mask (RM) is defined as the magnitude ratio between the clean spectrum and the noise spectrum , and it is normalized between 0 and 1.

      • When using a ratio mask, we perform a point-wise multiplication of the predicted ratio mask and the magnitude of the noise spectrum, and then perform an inverse short-time Fourier transform (ISTFT) together with the noise raw phase to obtain the denoised waveform [Wang and Chen 2017].
    • A complex ideal ratio mask is defined as the ratio between the complex clean spectrum and the noise spectrum . A complex ideal ratio mask has a real part and an imaginary part, which are estimated separately in the real domain. The real and imaginary parts of complex masks are usually between -1 and 1, however, we use sigmoid function compression to constrain these complex mask values ​​between 0 and 1 [Wang et al. 2016].

      • When masking is performed using a complex ideal ratio mask, the denoised waveform is obtained by performing a complex multiplication on the predicted complex ideal ratio mask and the noise spectrum, followed by an inverse short-time Fourier transform (ISTFT) of the result.
  • Given multiple streams of detected speaker faces as input, the network outputs a separate mask for each speaker and background distractors. In most experiments, we use cRM because we found that the quality of speech output using cRM is significantly better than that of RM. Please refer to Table 6 for a quantitative comparison of the two methods.

    Table 6: Ablation study : We investigate the contribution of our model in separating different parts of a scene where two clean speakers are mixed. Signal-to-reverberation ratio (SDR) correlates well with noise suppression, while ViSQOL indicates the level of speech quality (see Section A in the Appendix for details).

4.2 Network Architecture

Figure 4 provides a high-level overview of the individual modules in our network, which we now describe in detail.

Audio and visual streams .

  • The audio streaming part of our model consists of dilated convolutional layers with parameters listed in Table 1.

    Table 1: The dilated convolutional layers that make up the audio stream of our model.

  • The visual flow of our model is used to process the input face embeddings (see Section 4.1) and consists of dilated convolutions detailed in Table 2. Note that the "spatial" convolutions and dilations in the visual stream are performed on the time axis (rather than on the 1024-dimensional face embedding channel).

    Table 2: The dilated convolutional layers that make up the visual flow of our model.

  • To compensate for the sampling rate difference between the audio and video signals, we upsample the output of the visual stream to match the sampling rate of the spectrogram (100 Hz). This is done by using simple nearest neighbor interpolation in the temporal dimension of each visual feature.

Audio visual fusion (AV fusion) .

  • The audio and visual streams are merged by concatenating the feature maps of each stream together,

  • It is then fed into a BLSTM (bidirectional long short-term memory network), followed by three fully connected layers.

  • The final output consists of a complex mask (two channels, real and imaginary) for each input speaker.

  • The corresponding spectrograms (The corresponding spectrograms) are obtained by complex multiplication of the spectrogram of the noisy input and the output mask.

  • The squared error (L2 loss) between the clean and enhanced spectrogram after power compression is used as the loss function for training the network.

  • The final output waveform is obtained by inverse short-time Fourier transform (ISTFT), as described in Section 4.1.

Multiple speakers .

Our model supports isolating multiple visible speakers from a video, each speaker is represented by a visual stream , as shown in Figure 4.

  • For each number of visible speakers, train a separate dedicated model. For example,

    • A model with one visual stream corresponds to one visible speaker
    • A model with dual visual streams corresponding to two visible speakers

    etc.

  • All visual streams share the same weights on the convolutional layers. In this case, the learned features of each visual stream are concatenated with the learned audio features before proceeding with BLSTM.

  • It is worth noting that in practice a model that takes a single visual stream as input can be used for the general case where the number of speakers is unknown or a dedicated multi-speaker model cannot be used.

4.3 Implementation Details

Our network is implemented using TensorFlow, which contains operations for controlling waveform and STFT transformations.

  • The ReLU activation function follows all network layers except the last layer (mask), which uses sigmoid .

  • Batch normalization is performed after all convolutional layers [ Ioffe and Szegedy 2015].

  • We did not use dropout because we train on a large amount of data and there is no chance of overfitting.

  • We use a batch size of 6 samples,

  • And use the Adam optimizer for 5 million steps (batches, batches) of training,

  • The learning rate is 3 ⋅ 1 0 − 5 3\cdot10^{−5}3105 , which is halved every 1.8 million steps.

All audio data is resampled to 16kHz, and stereo audio will be converted to mono by using only the left channel. Compute the STFT using a Hann window length of 25 ms, a jump length of 10 ms, and an FFT size of 512, resulting in a 257 × 298 × 2 257\times298\times2257×298×2 scalar input audio features. Takep = 0.3 p=0.3p=0.3 A 0.3 A^{0.3} A0.3 , of whichAAA is the input/output audio spectrogram) for power-law compression.

We resample the face embeddings of all videos to 25 frames per second (FPS) before training and inference, either by dropping or duplicating the embeddings. This results in an input visual stream consisting of 75 face embeddings. Face detection, alignment and quality assessment were performed using the tools described by Cole et al. [2016] . When a missing frame is encountered in a particular sample, we use a zero vector instead of the face embedding.

5 Experiments and Results

We test our method under various conditions and compare the results quantitatively and qualitatively with state-of-the-art audio-only (AO) and audio-visual (AV) speech separation and enhancement methods.

Comparison with Audio-Only.

  • There are currently no state-of-the-art audio-only speech enhancement/separation systems publicly available, and there are relatively few publicly available datasets for training and evaluating audio-only speech enhancement.

  • While there is an extensive literature on blind source separation of audio signals [Comon and Jutten 2010], most of these techniques require multiple audio channels (multiple microphones) and thus are not suitable for our task.

For these reasons, we implement an audio-only baseline speech enhancement model with an architecture similar to our audio streaming model (Fig. 4, when the visual streaming is removed). When trained and evaluated on the CHiME-2 dataset widely used in speech enhancement work [Vincent et al. The state-of-the-art mono results at 14.75 dB as well.

Therefore, our audio-only augmentation model is considered a near-state-of-the-art baseline.

To compare our separation results with state-of-the-art audio-only models, we implement the permutation-invariant training method introduced by Yu et al. [2017].

  • Note that speech separation using this method requires prior knowledge of the number of sources present in the recording, and manual assignment of each output channel to its corresponding speaker's face (our AV method does this automatically).

We use these AO methods in all synthetic experiments in Section 5.1 and conduct quality comparisons on real videos in Section 5.2.

Comparison with Recent Audio-Visual Methods.

  • Since existing audio-visual speech separation and enhancement methods are speaker-specific , we cannot easily compare with them in experiments on synthetic mixed speech (Section 5.1), nor run them on natural video (Section 5.1). Section 5.2).

  • However, we show quantitative comparisons with these methods on existing datasets by running our models on videos from those papers. We discuss this comparison in more detail in Section 5.3.

  • Furthermore, we present qualitative comparisons in the appendix material.

5.1 Quantitative Analysis of Synthesized Mixed Speech

We generated data for several different monophonic speech separation tasks. Each task requires its own unique mix of speech and non-speech background noise configurations. We describe below the generative process for each variant of the training data, and the associated models for each task, which are trained from scratch.

  • In all cases, clean speech clips and corresponding face images are from our AVSpeech (AVS) dataset.

  • Non-speech background noise comes from AudioSet [Gemmeke et al. 2017], a large-scale dataset consisting of manually annotated clips from YouTube videos.

The separated speech quality was evaluated using the Signal-to-Distortion Ratio (SDR) improvement in the BSS Eval toolbox [Vincent et al. 2006], which is a commonly used metric for evaluating speech separation quality (see Section A in the Appendix for details).

We extracted 3-second non-overlapping segments from our dataset (e.g., a 10-second segment would generate three 3-second segments). We generated 1.5 million synthetic mixed speeches for all models and experiments. For each experiment, 90% of the generated data is used as the training set and the remaining 10% is used as the testing set. We did not use any validation set as no parameter tuning or early stopping was performed.

One speaker+noise (One speaker+noise (1S+Noise)).

This is a classic speech enhancement task where the training data is generated by linearly combining unnormalized clean speech and AudioSet noise :
Mixi = AVS j + 0.3 ∗ A audio Set Mix_i=AVS_j+0.3*AudioSet_kMixi=AVSj+0.3AudioSetk
in:

  • A V S j AVS_j AVSjThis is AVS AVSAn utterance in A V S
  • A u d i o S e t k AudioSet_k AudioSetk A u d i o S e t AudioSet A fragment in AudioSet whose magnitude is multiplied by 0.3
  • M i x i Mix_i Mixiis a sample from the synthetic mixed speech dataset

Our audio-only models perform very well in this case, since the eigenfrequencies of noise are usually well separated from those of speech. Our audio-visual (AV) model performs comparable to the audio-only (AO) baseline, both with 16dB SDR (see first column of Table 3).

Table 3: Quantitative analysis and comparison to audio-only speech separation and enhancement : quality improvement (in SDR, see Section A in the Appendix for details) as a function of the number of input visual streams, using different network configurations. The first row (audio only) is our implementation of a state-of-the-art speech separation model and shown as a baseline.

Two clean speakers (Two clean speakers (2S clean)).

The dataset used for this two-speaker separation scenario was generated by mixing clean speech from two different speakers in our AVS dataset:
Mixi = AVS j + AVS k Mix_i = AVS_j + AVS_kMixi=AVSj+AVSk
in:

  • A V S j AVS_j AVSjSum AVS k AVS_kAVSkare clean speech samples from different source videos in the dataset

  • M i x i Mix_i Mixiis a sample from the synthetic mixed speech dataset

In addition to our AO baseline model, we train two different AV models on this task:

  1. (i) A model that accepts only one visual stream as input and outputs only its corresponding denoised signal.

    In this case, at inference time, each speaker's denoised signal is obtained by making two forward passes through the network (one for each speaker). Averaging the SDR results of this model yields a 1.3dB improvement over our AO baseline model (second column of Table 3).

  2. (ii) Simultaneously accepts visual information from two speakers as input in the form of two separate visual streams (as described in Section 4).

    In this case, the output consists of two masks, one for each speaker, and only one forward pass is required for inference. Using this model gets an additional 0.4dB boost, for a total of 10.3dB SDR improvement. Intuitively, jointly processing the two visual streams provides more information to the network and imposes more constraints on the separation task, leading to improved results.

Figure 5 shows the improved SDR for this task based on the input SDR, including the audio-only baseline model and our two-speaker audio-visual model.

Figure 5: Input SDR vs. Improved Output SDR : This is a scatterplot showing the separation performance (SDR improvement) in the task of separating two clean speakers (2S clean) as a function of the original (noisy) SDR. Each point corresponds to a single 3-second audio-visual sample in the test set.

Two speakers + noise (Two speakers + noise (2S + Noise)).

Here we consider the task of isolating the voice of one speaker from a mixture of two speakers and non-speech background noise. To our knowledge, this audio-visual task has not been solved before. The training data is obtained by combining the clean speech of two different speakers (as generated by the 2S clean task) with Audio S et AudioSetThe background noise of A u d i o Set is mixed: M ixi = AVS j + AVS k + 0.3 ∗ A audio S etl Mix_i=AVS_j+AVS_k+0.3 * AudioSet_l
Mixi=AVSj+AVSk+0.3AudioSetl
In this case, we train the AO network with three outputs, one for each speaker and background noise.

Furthermore, we trained models with two different configurations,

  • One that receives a visual stream as input

    • The configuration of an AV model for a visual stream is the same as model (i) in previous experiments.
  • The other receives two visual streams as input

    • The AV model of the two visual streams outputs three signals, one for each speaker and background noise.

As shown in Table 3 (third column), the SDR gain of 0.1 dB for the AV model for one visual stream and 0.5 dB for the AV model of two visual streams relative to the audio-only baseline model leads to an overall SDR improvement of up 10.6dB.

Figure 6 shows the inferred mask and output spectrogram for a sample segment from this task, along with its noisy input and true spectrogram.

Figure 6: Examples of input and output audio: the top row shows the audio spectrogram for a segment of our training data involving two speakers and background noise (a), and the real, separated spectrogram for each speaker ( b,c). In the bottom row, we show our results: our method's estimated mask for the segment, superimposed on a spectrogram with different colors for each speaker (d), and the corresponding output spectrum for each speaker Fig. (e,f).

Three clean speakers (Three clean speakers (3S clean)).

The dataset for this task is made by mixing clean speech from three different speakers:
Mixi = AVS j + AVS k + AVS l Mix_i=AVS_j+AVS_k+AVS_lMixi=AVSj+AVSk+AVSl
Similar to the previous tasks, we train an AV model that receives one, two and three visual streams as input and outputs one, two and three signals, respectively.

We find that even with a single visual stream, the AV model outperforms the AO model by 0.5dB in comparison. The configuration of two visual streams also gives the same improvement to the AO model, while using three visual streams leads to a gain of 1.4dB, bringing the overall SDR improvement to 10dB (fourth column of Table 3).

Same-gender separation.

Many previous speech separation methods perform poorly when trying to separate speech mixtures containing speech of the same gender [Delfarah and Wang 2017; Hershey et al. 2016].

Table 4 shows our separation quality according to different gender combinations.

Table 4: **Separation of the same sex. **The results in this table are from the 2S clean experiment, showing that our method is robust to separating speech from same-gender mixtures.

Interestingly, our model performs best (slightly ahead) on the female-female mix, but also performs well on other combinations, showing that it is robust to gender.

5.2 Speech Separation in the Real World

To demonstrate the speech separation capabilities of our model in real-world scenarios, we tested it on a variety of videos containing heated debates and interviews, noisy bars, and screaming children (see Figure 7).

Figure 7: Speech separation in the wild : shows representative frames from natural videos applying our method in various real-world scenarios. All videos and results can be found in the appendix material. "Undisputed Interview" video courtesy of Fox Sports.

In each scene, we trained the model using a number of visual input streams that matched the number of visible speakers in the video.

  • For example, for a video with two visible speakers, we use a two-speaker model.

We perform separation using a single forward pass per video, an operation supported by our model because our network architecture never enforces a specific temporal persistence.

  • This avoids the need to post-process and integrate results on shorter segments of the video.

Since these examples do not have clean reference audio, these results and their comparison to other methods are evaluated qualitatively; they are presented in our appendix material.

It is worth noting that our method does not support real-time processing, and currently our speech enhancement is more suitable for the video post-processing stage.

  • The synthetic video "Double Brady" in our appendix material highlights the exploitation of visual information by our model, since speech separation is difficult in this case by only the characteristic speech frequencies contained in the audio.

  • In the "Noisy Bar" scenario, our method shows some limitations in separating speech from a low SNR mix. In this case, the background noise is almost completely suppressed, but the output speech quality is significantly reduced.

    • Sun et al. [2017] observed that this limitation stems from using mask-based methods for separation, and in this case, directly predicting the denoised spectrogram may help to overcome this issue.
    • In the classic case of speech enhancement, i.e., one speaker and non-speech background noise, our AV model achieves similar results to our strong AO baseline model. We suspect this is because the eigenfrequencies of noise are often clearly separated from those of speech, so adding visual information did not provide additional discriminative power.

5.3 Comparison with previous audio-visual speech separation and enhancement work

Our evaluation would not be complete without comparing our results with those of previous audio-visual speech separation and enhancement work.

Table 5 contains comparisons on three different audio-visual datasets (Mandarin, TCD-TIMIT and CUAVE, see Section 2), using the evaluation protocols and metrics described in the respective papers.

Table 5: Comparison with existing audio-visual speech separation work : We compare our speech separation and enhancement results on several datasets with those of previous work, using the evaluation protocol reported in the original paper and the objective Score. It is important to note that previous methods are speaker-dependent, whereas our results are obtained by using a general, speaker-independent model.

The reported objective quality scores are PESQ [Rix et al. 2001], STOI [Taal et al. 2010] and SDR [Vincent et al. 2006] in the BSS eval toolkit. The qualitative results of these comparisons are available on our project page.

It is important to note that these previous methods required training a specialized model (speaker-dependent) for each speaker in their datasets individually, whereas our evaluation on their data was performed using our general-purpose AVS dataset. Trained model (speaker independent). Although we have never encountered these specific speakers, our results are significantly better than those reported in the original paper, indicating the strong generalization ability of our model.

5.4 Applied to video transcription

While this paper focuses on speech separation and enhancement, our method can also be used for automatic speech recognition (ASR) and video transcription.

To test this concept, we performed the following qualitative experiments. We uploaded the speech separation results of the "Stand-Up" video to YouTube, and compared the results of YouTube's automatic subtitle generation3 with the results generated by the corresponding part of the original video with mixed speech. For portions of the original "Stand-Up" video, the ASR system was unable to generate any subtitles on the mixed-voice segment of the video. Speech from both speakers was included in the result, resulting in difficult-to-read sentences.

However, the resulting subtitles are significantly more accurate for our separated speech results. We present the fully captioned video in the appendix material.

5.5 Additional Analysis

We also performed extensive experiments to better understand the behavior of the model and the impact of its different components on the results.

Ablation study

To better understand the contribution of different parts of our model, we performed ablation experiments on the task of separating speech from a mixture of two clean speakers (2S Clean). In addition to ablating several combined network modules (visual and audio streams, BLSTM and FC layers), we also investigate higher-level variations, such as different output masks (magnitudes), reducing the learned visual features to each time step A scalar effect of , and a different fusion method (early fusion).

  • In early fusion models, we did not have separate visual and audio streams, but combined the two modalities at input. This is through

    1. The dimensions of each visual embedding are reduced to match the dimensions of the spectrogram at each time step using two fully connected layers,
    2. The visual features are then stacked as a third spectrogram "channel" and processed jointly throughout the model to achieve this.
  • Table 6 shows the results of our ablation experiments. The table includes evaluations using SDR and ViSQOL [Hines et al., 2015], an objective measure designed to approximate the Mean Opinion Score (MOS) of speech quality by human listeners. ViSQOL scores are computed on a random 2000 sample subset of our test data. We find that SDR is closely related to the amount of residual noise in the separated audio, while ViSQOL better characterizes the quality of the output speech. See Part A of the Appendix for more details on these scores. “Oracle” RMs and cRMs are masks acquired as described in Section 4.1, using ground truth real-valued and complex-valued spectrograms, respectively.

The most interesting findings of this study are the reduction in MOS when using real-valued magnitude masks instead of complex-valued magnitude masks , and the unexpected effectiveness of compressing the visual information into one scalar per time step, as described below.

Bottleneck features

Translator's Note: The reason why it is called a bottleneck is because the bottleneck layer looks more like a bottleneck.

In our ablation analysis, we find that a network that compresses visual information into a scalar at each time step (“Bottleneck (cRM)”) performs almost as well as our full model (“Full model (cRM)”) ( only 0.5dB difference). The latter uses 64 scalars per time step.

How does the model utilize visual signals? (How does the model utilize the visual signal?)

Our model uses face embeddings as input visual representations (Section 4.1). We want to understand the information captured in these high-level features and determine which regions in the model's input frames are used to separate speech.

To this end, we follow a protocol similar to [Zeiler and Fergus 2014; Zhou et al. 2014] for visual network receptive field visualization. We extend this protocol from 2D images to 3D (space-time) videos.

More specifically, we use a spatio-temporal occluder (11px × 11px × 200ms patch 4 ) in a sliding window fashion. For each spatio-temporal occluder, we feed the occluded video into our model and compare the resulting speech separation result Socc with the result Sori for the original (unoccluded) video.

To quantify the difference between network outputs, we use SNR, considering results without occlusions as "signal" 5 . That is, for each space-time occluder, we compute:
E = 10 ⋅ log ( S orig 2 ( S occ − S orig ) 2 ) (1) E=10\cdot{log(\frac{ { S_ {orig}}^2}{(S_{occ}-S_{orig})^2})}\tag{1}E=10log((SoccSorig)2Sorig2)( 1 )
Repeating this process for all spatio-temporal occluders in the video results in a heatmap for each frame. For visualization, we normalize the heatmap to the maximum SNR of the video:
E ~ = E max − E \tilde{E}=E_{max}−EE~=EmaxE
E ~ \tilde{E}E~ , a large value corresponds to an occluder that has a greater influence on the speech separation result.

In Figure 8, we show heatmap results for representative frames from several videos (full heatmap videos are available on our project page). As expected, the largest contribution to the face region is mainly located around the mouth, however the visualization results show that other regions such as eyes and cheeks also contribute to some extent.

Figure 8: How does the ** model utilize visual signals? **We show heatmaps superimposed on representative input frames from several videos, visualizing the contribution (in decibels, see text) of different regions to our speech separation results, ranging from blue (low contribution) to Red (high contribution).

Effect of missing visual information

We further test the contribution of visual information to the model by gradually removing visual embeddings. Specifically, we first run the model and use the full 3-second video for evaluation, resulting in speech separation quality with visual information. We then progressively discard the embeddings at both ends of the paragraph and re-evaluate the separation quality for visual durations of 2 s, 1 s, 0.5 s and 0.2 s.

The result is shown in Figure 9. Interestingly, when discarding up to 2/3 of the visual embeddings in a passage, the speech separation quality only drops by 0.8 dB on average. This shows that the model is robust to missing visual information, which can occur in real-world scenarios due to head motion or occlusions.

Figure 9: Effect of missing visual information : This figure shows the effect of visual information duration on output SDR improvement in a 2 clean speaker (2S clean) scene. We test by gradually zeroing out the input face embeddings from both ends of the sample. The results show that even a small number of visual frames is sufficient for high-quality separation.

in conclusion

We propose a novel audio-visual neural network model for single-channel, speaker-independent speech separation. Our model performs well in several challenging scenarios, including multi-speaker mixes with background noise. To train the model, we created a new audio-visual dataset consisting of thousands of hours of video clips of visible speakers and clean speech collected from the web. Our model achieves state-of-the-art results on speech separation and shows potential applications in video captioning and speech recognition. We also conduct extensive experiments analyzing the behavior and effectiveness of our model and its individual components. Overall, our method represents an important advance in audio-visual speech separation and enhancement.

thank you

We would like to thank Yossi Matias and Google Research Israel for their support of this project, and John Hershey for his valuable input. We would also like to thank Arkady Ziefman for his help with figure design and video editing, and Rachel Soh for helping us license the video content in the results.

references

  1. T. Afouras, JS Chung, and A. Zisserman. 2018. Dialogue: Deep Audiovisual Speech Enhancement. In arXiv:1804.04121.
  2. Anna Llagostera Casanovas, Gianluca Monaci, Pierre Vandergheynst, and Rémi Gribonval. 2010. "Blind Audio-Video Source Separation Based on Sparse Redundant Representations". IEEE Transactions on Multimedia 12, 5 (2010), 358–371.
  3. E Colin Cherry. 1953. "Some Experiments in Speech Recognition with One and Two Ears". Journal of the Acoustical Society of America 25, 5 (1953), 975–979.
  4. Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip-reading sentences in the wild. CoRR abs/1611.05358 (2016).
  5. Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Freeman. 2016. Synthesizing Normalized Faces from Facial Identity Features. CVPR'17.
  6. Pierre Comon and Christian Jutten. 2010. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press.
  7. Masood Delfarah and DeLiang Wang. 2017. Characterization of Masking-based Monophonic Speech Separation in Reverberant Environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017), 1085–1094.
  8. Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. ICCV 2017 Computer Vision Workshop.
  9. Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase Sensitive and Enhanced Speech Separation with Deep Recurrent Neural Networks. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015).
  10. Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, and Zhigang Luo. 2017. Audio-Visual Speech Recognition with Multimodal Recurrent Neural Networks. 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 681–688.
  11. Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2018. Seeing Through Noise: Speaker Separation and Enhancement Using Visually-Derived Speech. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2018) .
  12. Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. 2017. Visual Speech Enhancement Using Anti-Noise Training. arXiv preprint arXiv:1711.08789 (2017).
  13. R. Gao, R. Feris, and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Videos. arXiv preprint arXiv:1804.01665 (2018).
  14. Jort F. Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audiosets: An Ontology and Human Labeling Dataset of Audio Events. IEEE 2017 ICASSP Conference Proceedings.
  15. Elana Zion Golumbic, Gregory B Cogan, Charles E. Schroeder, and David Poeppel. 2013. Visual Input Enhances Tracking of Selective Speech Envelopes in Auditory Cortex at "Cocktail Party." Neuroscience, Official Journal of the American Academy of Neuroscience 33 Issue 4 (2013), 1417–26.
  16. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.
  17. David F. Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In NIPS.
  18. John Hershey, Hagai Attias, Nebojsa Jojic, and Trausti Kristjansson. 2004. Audio-Visual Graphical Models for Speech Processing. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  19. John R Hershey and Michael Casey. 2002. Audio-Visual Sound Separation Using Hidden Markov Models. Advances in Neural Information Processing Systems. 1173–1180.
  20. John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep Clustering: Discriminative Embeddings for Segmentation and Separation. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2016), 31–35 .
  21. Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. 2015. ViSQOLAudio, an objective audio quality metric for low bitrate codecs. Journal of the Acoustical Society of America, Vol. 137, No. 6 (2015 ), EL449–55.
  22. Andrew Hines and Naomi Harte. 2012. Prediction of speech intelligibility measured using a neural graph similarity index. Speech Communication 54 No. 2 (2012), 306–320. DOI: http://dx.doi.org/10.1016/j.specom.2011.09.004
  23. Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Facing the Sound: Fusing Audio and Visual Signals in Video to Identify Speakers. CoRR abs/1706.00079 (2017).
  24. Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. 2018. Audio-Vision Using Multimodal Deep Convolutional Neural Networks Speech Enhancement." IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 2, No. 2 (2018), 117–128.
  25. Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. In-Depth Multimodal Speaker Naming. Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1107–1110.
  26. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. International Conference on Machine Learning.
  27. Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey. 2016. Monophonic multi-speaker separation using deep clustering. Interspeech (2016), 545–549.
  28. Faheem Khan. 2016. Audio-Visual Speaker Separation. Ph.D. Dissertation. University of East Anglia.
  29. Wei Ji Ma, Xiang Zhou, Lars A. Ross, John J. Foxe, and Lucas C. Parra. 2009. Lexical Recognition Aided by Bayesian Interpretation of High-Dimensional Feature Spaces in Moderate Noise. PLoS ONE Volume 4 (2009), 233–252.
  30. Josh H McDermott. 2009. The Cocktail Party Problem. Current Biology 19 No. 22 (2009), R1024–R1027.
  31. Gianluca Monaci. 2011. Development of Visual Speaker Localization for Real-Time Audio. Signal Processing Conference, 19th Europe 2011. IEEE, 1055–1059.
  32. Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. Deep Multimodal Learning for Audio-Visual Speech Recognition. In 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2130–2134.
  33. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML.
  34. Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis Using Self-Supervised Multisensory Features. (2018).
  35. Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy. 2002. "Mobile Speakers, Speaker-Independent Features Study and Baseline Results on the CUAVE Multimodal Speech Corpus". Eurasian Journal of Advanced Signal Processing Volume 2002 (2002), 1189–1201.
  36. Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. 2017. Audio-Visual Object Localization and Separation Using Low Rank and Sparsity. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2901–2905.
  37. Bertrand Rivet, Wenwu Wang, Syed M. Naqvi, and Jonathon A. Chambers. 2014. Audio-Visual Speaker Separation: An Overview of Key Methods. IEEE Journal of Signal Processing 31 (2014), 125–134.
  38. Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual Evaluation of Speech Quality (PESQ) - A New Approach for Speech Quality Assessment of Telephone Networks and Codecs. Acoustics , Speech and Signal Processing" 2001 International Conference (ICASSP'01). IEEE, 749–752.
  39. Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. Moon: A Mixed-Objective Optimization Network for Recognizing Facial Attributes. European Conference on Computer Vision. Springer, 19–35.
  40. JS Garofolo, Lori Lamel, WM Fisher, Jonathan Fiscus, D S. Pallett, N L. Dahlgren, and V Zue. 1992. The TIMIT Speech Corpus. (1992).
  41. Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2017. LSTM-RNN-Based Multi-Objective Deep Learning Speech Enhancement. at the HSCMA.
  42. Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. Short-term objective intelligibility measures for time-frequency-weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4214–4217.
  43. Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. Second "The Bell" Speech Separation and Recognition Challenge: Datasets, Tasks, and Baselines. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 126–130.
  44. E. Vincent, R. Gribonval, and C. Fevotte. 2006. Performance Measurements for Blind Audio Source Separation. Transactions in Audio, Speech, and Language Processing, Vol. 14, No. 4 (2006), 1462–1469.
  45. DeLiang Wang and Jitong Chen. 2017. Supervised Speech Separation Based on Deep Learning: A Survey. CoRR abs/1708.07524 (2017).
  46. Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. Training Objectives for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 22 No. 12 (2014), 1849–1858.
  47. Ziteng Wang, Xiaofei Wang, Xu Li, Qiang Fu, and Yonghong Yan. 2016. Oracle Performance Investigation of Ideal Masks. At IWAENC.
  48. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn W. Schuller. 2015. Speech Enhancement Using LSTM Recurrent Neural Networks and Its Application to Noise Robust ASR . At LVA/ICA.
  49. Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Permutation-invariant training of deep models for speaker-independent multi-speaker speech separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2017), 241–245.
  50. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In European Computer Vision Conference. Springer, 818–833.
  51. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. (2018).
  52. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Occurrence Object Detectors in Deep Scene CNNs. arXiv preprint arXiv:1412.6856 (2014).

REFERENCES

  1. T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In arXiv:1804.04121.
  2. Anna Llagostera Casanovas, Gianluca Monaci, Pierre Vandergheynst, and Rémi Gribonval. 2010. Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia 12, 5 (2010), 358–371.
  3. E Colin Cherry. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America 25, 5 (1953), 975–979.
  4. Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip Reading Sentences in the Wild. CoRR abs/1611.05358 (2016).
  5. Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Freeman. 2016. Synthesizing normalized faces from facial identity features. In CVPR’17.
  6. Pierre Comon and Christian Jutten. 2010. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press.
  7. Masood Delfarah and DeLiang Wang. 2017. Features for Masking-Based Monaural Speech Separation in Reverberant Conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017), 1085–1094.
  8. Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media.
  9. Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015).
  10. Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, and Zhigang Luo. 2017. Audio-visual speech recognition with multimodal recurrent neural networks. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 681–688.
  11. Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2018. Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018).
  12. Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. 2017. Visual Speech Enhancement using Noise-Invariant Training. arXiv preprint arXiv:1711.08789 (2017).
  13. R. Gao, R. Feris, and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665 (2018).
  14. Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017.
  15. Elana Zion Golumbic, Gregory B Cogan, Charles E. Schroeder, and David Poeppel. 2013. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. The Journal of neuroscience: the official journal of the Society for Neuroscience 33 4 (2013), 1417–26.
  16. Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.
  17. David F. Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In NIPS.
  18. John Hershey, Hagai Attias, Nebojsa Jojic, and Trausti Kristjansson. 2004. Audio-visual graphical models for speech processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  19. John R Hershey and Michael Casey. 2002. Audio-visual sound separation via hidden Markov models. In Advances in Neural Information Processing Systems. 1173–1180.
  20. John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 31–35.
  21. Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. 2015. ViSQOLAudio: An objective audio quality metric for low bitrate codecs. The Journal of the Acoustical Society of America 137 6 (2015), EL449–55.
  22. Andrew Hines and Naomi Harte. 2012. Speech Intelligibility Prediction Using a Neurogram Similarity Index Measure. Speech Commun. 54, 2 (Feb. 2012), 306–320. DOI: http://dx.doi.org/10.1016/j.specom.2011.09.004
  23. Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers. CoRR abs/1706.00079 (2017).
  24. Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. 2018. Audio-Visual Speech Enhancement Using Multi-modal Deep Convolutional Neural Networks. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (2018), 117–128.
  25. Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1107–1110.
  26. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML.
  27. Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey. 2016. Single-Channel Multi-Speaker Separation Using Deep Clustering. Interspeech (2016), 545–549.
  28. Faheem Khan. 2016. Audio-visual speaker separation. Ph.D. Dissertation. University of East Anglia.
  29. Wei Ji Ma, Xiang Zhou, Lars A. Ross, John J. Foxe, and Lucas C. Parra. 2009. Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space. PLoS ONE 4 (2009), 233 – 252.
  30. Josh H McDermott. 2009. The cocktail party problem. Current Biology 19, 22 (2009), R1024–R1027.
  31. Gianluca Monaci. 2011. Towards real-time audiovisual speaker localization. In Signal Processing Conference, 2011 19th European. IEEE, 1055–1059.
  32. Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2130–2134.
  33. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML.
  34. Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. (2018).
  35. Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy. 2002. Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus. EURASIP J. Adv. Sig. Proc. 2002 (2002), 1189–1201.
  36. Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. 2017. Audio-visual object localization and separation using low-rank and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2901–2905.
  37. Bertrand Rivet, Wenwu Wang, Syed M. Naqvi, and Jonathon A. Chambers. 2014. Audio-visual Speech Source Separation: An overview of key methodologies. IEEE Signal Processing Magazine 31 (2014), 125–134.
  38. Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, Vol. 2. IEEE, 749–752.
  39. Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. Springer, 19–35.
  40. J S Garofolo, Lori Lamel, W M Fisher, Jonathan Fiscus, D S. Pallett, N L. Dahlgren, and V Zue. 1992. TIMIT Acoustic-phonetic Continuous Speech Corpus. (11 1992).
  41. Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2017. Multiple-target deep learning for LSTM-RNN based speech enhancement. In HSCMA.
  42. Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 4214–4217.
  43. Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ’chime’ speech separation and recognition challenge: Datasets, tasks and baselines. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), 126–130.
  44. E. Vincent, R. Gribonval, and C. Fevotte. 2006. Performance Measurement in Blind Audio Source Separation. Trans. Audio, Speech and Lang. Proc. 14, 4 (2006), 1462–1469.
  45. DeLiang Wang and Jitong Chen. 2017. Supervised Speech Separation Based on Deep Learning: An Overview. CoRR abs/1708.07524 (2017).
  46. Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22, 12 (2014), 1849–1858.
  47. Ziteng Wang, Xiaofei Wang, Xu Li, Qiang Fu, and Yonghong Yan. 2016. Oracle performance investigation of the ideal masks. In IWAENC.
  48. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn W. Schuller. 2015. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR. In LVA/ICA.
  49. Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 241–245.
  50. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818–833.
  51. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. (2018).
  52. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014).

A Objective measure for assessing separation quality

A.1 Signal-to-noise ratio (SDR)

The Signal-to-Distortion Ratio (SDR), introduced by Vincent et al. in 2006, is one of a series of metrics used to evaluate blind audio source separation (BASS) algorithms, where the original source signal exists as a baseline fact . These metrics are based on decomposing each estimated source signal into the true source part (s_target) and error terms corresponding to interference (e_interf), additive noise (e_noise), and algorithmically induced artifacts (e_artif).

SDR is the most general score and is often used to report the performance of speech separation algorithms. It is measured in decibels (dB) and is defined as follows:
SDR : = 10 ⋅ log ⁡ 10 ( ∣ ∣ S target ∣ ∣ 2 ∣ ∣ einterf + enoise + eartif ∣ ∣ 2 ) (2) SDR:=10\cdot\ log_{10}(\frac{||S_{target}||^{2}}{||e_{interf}+e_{noise}+e_{artif}||^{2}})\tag{2 }SDR:=10log10(∣∣ein t er f+enoise+ea r t i f2∣∣Starget2)( 2 )
We refer the reader to the original paper for details on the decomposition of the signal into its component parts. We found a good correlation between this metric and the amount of residual noise after separation.

A.2 Virtual Voice Quality Objective Monitor (ViSQOL)

The Virtual Voice Quality Objective Listener (ViSQOL) is an objective speech quality model proposed by Hines et al. [2015]. This metric models human speech quality perception using a spectral-temporal similarity measure between reference (r) and degraded (d) speech signals and is based on the Neurogram Similarity Index Measure (NSIM) [Hines and Harte 2012]. NSIM is defined as follows:
NSIM ( r , d ) = 2 μ r μ d + C 1 μ r 2 + μ d 2 + C 1 ⋅ σ rd + C 2 σ r σ d + C 2 (3) NSIM(r, d)=\frac{2\mu_{r}\mu_{d}+C_{1}}{\mu_{r}^{2}+\mu^{2}_{d}+C_{1}} \cdot\frac{\sigma_{rd}+C_{2}}{\sigma_{r}\sigma_{d}+C_{2}}\tag{3}NSIM(r,d)=mr2+md2+C12 mrmd+C1prpd+C2prd+C2( 3 )
Here, µs and σs are the mean and correlation coefficient between the reference signal and the degraded signal, respectively, calculated between the spectrograms. In ViSQOL, NSIM is computed on the spectral tiles of the reference signal and their corresponding tiles from the degraded signal. The algorithm then aggregates and converts the NSIM scores into a Mean Opinion Score (MOS) between 1 and 5.
DR is the most general score and is often used to report the performance of speech separation algorithms. It is measured in decibels (dB) and is defined as follows:
SDR : = 10 ⋅ log ⁡ 10 ( ∣ ∣ S target ∣ ∣ 2 ∣ ∣ einterf + enoise + eartif ∣ ∣ 2 ) (2) SDR:=10\cdot\ log_{10}(\frac{||S_{target}||^{2}}{||e_{interf}+e_{noise}+e_{artif}||^ {2}})\tag{2 }SDR:=10log10(∣∣ein t er f+enoise+ea r t i f2∣∣Starget2)( 2 )
We refer the reader to the original paper for details on the decomposition of the signal into its component parts. We found a good correlation between this metric and the amount of residual noise after separation.

A.2 Virtual Voice Quality Objective Monitor (ViSQOL)

The Virtual Voice Quality Objective Listener (ViSQOL) is an objective speech quality model proposed by Hines et al. [2015]. This metric models human speech quality perception using a spectral-temporal similarity measure between reference (r) and degraded (d) speech signals and is based on the Neurogram Similarity Index Measure (NSIM) [Hines and Harte 2012]. NSIM is defined as follows:
NSIM ( r , d ) = 2 μ r μ d + C 1 μ r 2 + μ d 2 + C 1 ⋅ σ rd + C 2 σ r σ d + C 2 (3) NSIM(r, d)=\frac{2\mu_{r}\mu_{d}+C_{1}}{\mu_{r}^{2}+\mu^{2}_{d}+C_{1}} \cdot\frac{\sigma_{rd}+C_{2}}{\sigma_{r}\sigma_{d}+C_{2}}\tag{3}NSIM(r,d)=mr2+md2+C12 mrmd+C1prpd+C2prd+C2( 3 )
Here, µs and σs are the mean and correlation coefficient between the reference signal and the degraded signal, respectively, calculated between the spectrograms. In ViSQOL, NSIM is computed on the spectral tiles of the reference signal and their corresponding tiles from the degraded signal. The algorithm then aggregates and converts the NSIM scores into a Mean Opinion Score (MOS) between 1 and 5.


  1. https://cloud.google.com/vision/ ↩︎

  2. Such mixtures model well the types of disturbances in our dataset, which typically involve a single speaker being disturbed by non-speech sounds such as audience applause or opening music. ↩︎

  3. https://support.google.com/youtube/answer/6373554?hl=en ↩︎

  4. We use a length of 200 ms to cover the typical phoneme duration range: 30-200 ms. ↩︎

  5. We refer readers to the supplementary material to verify our results for speech separation on non-occluded videos, which we consider to be "correct" in this example, are indeed accurate. ↩︎

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132072793