Chrome Plugin: Cloud Music Listening to Songs and Recognizing Songs

Image source: Chrome Plugin - Cloud Music Listening to Songs

Author of this article: Konggo

When you use the webpage to swipe the video on the video website, have you ever come across a BGM that stirs up waves in your heart, but you don't know its name? At this time, you can only open the mobile phone to listen to songs and recognize songs, but it is easier to solve this problem through a browser plug-in. There is no need to tediously take out the mobile phone, and it will not disturb others because it needs to be released, and it will not be difficult to identify because of ambient noise.

If you happen to have this need, you might as well try the Chrome browser plug-in "Cloud Music Listening Song" produced by Cloud Music , and you can also directly collect hearts. You can also go to the official website of the plugin to preview the actual running effect.

background

At present, most of the listening and recognizing song plug-ins on the Chrome store are produced abroad, and there are very few domestic products, which have poor support for domestic music. Since cloud music has this ability, we hope to cover every corner of this function and convey the beautiful power of music. At the same time, most of the plug-ins on the market are based on manifest v2 (compared to manifest v3, which has poor security, performance, and privacy). Extraction increases the computational pressure on the server and increases network transmission. So is there a way to use the manifest v3 protocol to implement functions, and at the same time put the calculation of audio fingerprint extraction on the front end?

New protocol for Chrome browser plug-ins

The focus of this article is not on how to implement a browser plug-in itself. If you don't understand the development of the plug-in itself, you can refer to Google's official development documentation .

In particular, manifest v2 (MV2) is about to be abandoned, and will gradually not accept updates in 2022, and will gradually fail to run in 2023. All content in this article is based on manifest v3, which is more secure, has better performance, and has stronger privacy. (MV3) to implement.

The protocol upgrade will also bring some changes to the implementation of functions, because of the more secure restrictions of MV3, some flexible implementation methods based on MV2 (for example: executing remote code, you can use eval, new Function(...) and other unsafe methods ) will not be available. And this will bring some implementation difficulties to the listening song and song recognition plug-in.

The core impact points of the MV3 protocol on the plug-in implementation:

The original Background Page is replaced by Service Worker, which means that operations such as Web API can no longer be performed on the Background Page.
Remote code hosting is no longer supported, and dynamic loading of code is not possible, which means that executable code needs to be packaged directly into plugins.
Content security policy adjustment, no longer supports direct execution of unsafe code. WASM initialization related functions cannot be run directly.

The realization of listening to songs

The technology of listening to songs and recognizing songs is relatively mature. The overall idea is to extract audio fingerprints through digital audio sampling , and finally match the fingerprints in the database . The song with the highest feature value is the song that is considered to be recognized.

Audio extraction in browser plugins

Using plug-ins to record audio and video in web pages is actually very simple. Only chrome.tabCaptureAPIs are needed to realize audio recording of web pages. We need to sample the audio data for the obtained stream data to ensure that the rules for calculating HASH are consistent with the database data.

For the obtained stream, audio transcription and sampling can be performed. Generally, there are three processing methods:

createScriptProcessor : This method is the easiest for audio processing, but this method has been marked as deprecated in the W3C standard. Not recommended for use
MediaRecorder : Audio transcription can also be done with the help of the Media API, but there is no way to do fine-grained processing.
AudioWorkletNode：用于替代 createScriptProcessor 进行音频处理，可以解决同步线程处理导致导致的对主线程的压力，同时可以按 bit 进行音频信号处理，这里也选择此种方式进行音频采样。

基于 AudioWorkletNode 实现音频的采样及采样时长控制方法：

模块注册，这里的模块加载是通过文件的加载方式，PitchProcessor.js 对应的是根目录下的文件：

const audio_ctx = new window.AudioContext({
  sampleRate: 8000,
});
await audio_ctx.audioWorklet.addModule("PitchProcessor.js");
复制代码

创建 AudioWorkletNode，主要用于接收通过 port.message 从 WebAudio 线程传递回来的数据信息，从而可以在主线程进行数据处理：

class PitchNode extends AudioWorkletNode {
  // Handle an uncaught exception thrown in the PitchProcessor.
  onprocessorerror(err) {
    console.log(
      `An error from AudioWorkletProcessor.process() occurred: ${err}`
    );
  }

  init(callback) {
    this.callback = callback;
    this.port.onmessage = (event) => this.onmessage(event.data);
  }

  onmessage(event) {
    if (event.type === 'getData') {
      if (this.callback) {
        this.callback(event.result);
      }
    }
  }
}

const node = new PitchNode(audio_ctx, "PitchProcessor");
复制代码

处理 AudioWorkletProcessor.process，也就是 PitchProcessor.js 文件内容：

process(inputs, outputs) {
  const inputChannels = inputs[0];
  const inputSamples = inputChannels[0];
  if (this.samples.length < 48000) {
    this.samples = concatFloat32Array(this.samples, inputSamples);
  } else {
    this.port.postMessage({ type: 'getData', result: this.samples });
    this.samples = new Float32Array(0);
  }
  return true;
}
复制代码

取第一个输入通道的第一个声道进行数字信号的收集，收集到符合定义的长度（例如这里的48000）之后通知到主线程进行信号的识别处理。

基于 process 方法可以做很多有意思的尝试，比如最基础的白噪音生成等。

音频指纹提取

提取到音频信号之后，下一步要做的就是对信号数据进行指纹提取，我们提取到的其实就是一段二进制数据，需要对数据进行傅里叶变换，转换为频域信息进行特征表示。具体指纹的提取的逻辑是有一套规整的复杂算法，常规的指纹提取方法：1) 基于频带能量的音频指纹；2）基于landmark的音频指纹；3）基于神经网络的音频指纹，对算法感兴趣的可以阅读相关论文，例如：A Highly Robust Audio Fingerprinting System 。整个运算有一定的性能要求，基于 WebAssembly 进行运算，可以获得更好的 CPU 性能。现如今，C++/C/Rust 都有比较便捷的方式编译成 WebAssembly 字节码，这里不再展开。

Next, when you try to initialize the WASM module by running it in the plugin scenario, you will most likely encounter the following exception:

Refused to compile or instantiate WebAssembly module because 'wasm-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' 'unsafe-eval' ...
复制代码

This is because the strict CSP definition needs to be followed when using WebAssembly, which can be resolved by appending "content_security_policy":"script-src 'self' 'unsafe-eval';"the . In MV3, due to stricter privacy and security restrictions, this simple and rude implementation is no longer allowed. In MV3, the script-src object-src worker-srconly allowed values are:

self
none
localhost

That is, there is no way to define attributes such as unsafe-eval, so it is no longer feasible to simply run wasm directly in the plugin page. It seems to have reached a dead end here? There are always more methods than problems. I carefully examine the document and find that the document has such a description:

CSP modifications for sandbox have no such new restrictions. - Chrome plugin development documentation

That is to say, this security restriction does not exist in sandbox mode. The plugin itself can define a sandbox page, which cannot access the web/chrome API, but it can run some so-called "unsafe" methods, such eval、new Function、WebAssembly.instantiateas . Therefore, you can use the sandbox page to load and run the WASM module, return the calculation result to the main page, and the overall fingerprint collection process becomes as shown below:

As for how to communicate data between the main page and the sandbox page, you can load the iFrame in the main page and use the contentWindow of the iFrame to communicate with the main window. The data flow is as follows:

The process of basic audio extraction and fingerprint extraction has been completed here, and the remaining part is to perform feature matching in the database through fingerprints.

feature matching

After extracting the audio fingerprint, the next step is to perform audio retrieval in the fingerprint database. The fingerprint database can be implemented with a hash table. Each table entry represents the music ID and the time when the music appears corresponding to the same fingerprint, and a fingerprint database is constructed. Access the extracted fingerprints from the database to get matching songs. Of course, this is only a basic process, and the specific algorithm optimization methods are still very different from each other. In addition to copyright reasons, the algorithm directly leads to the efficiency and accuracy of each match. The implementation of the plug-in here is still in a way of giving priority to efficiency.

write at the end

The above roughly describes the general process of implementing the listening song recognition plug-in based on WebAssemblyand MV3. Although the plug-in is flexible and easy to use, Google is also aware of some security and privacy issues brought by the plug-in, and has carried out a large-scale migration. The MV3 protocol is more privacy and security, but it also limits the implementation of many functions. After 2023, there will be a large number of plug-ins that can no longer be used.

About the functions that have been completed by the Song Song Recognition Plug- in, including audio recognition, red heart playlist collection, etc., the functions will continue to be expanded in the future. I hope this small function can help you.

References

This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit various technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at grp.music-fe(at)corp.netease.com!