Welcome to the future of music information retrieval, where machine learning, vector databases, and audio data analysis come together to open exciting new possibilities! If you are interested in the field of music data analysis, or are just passionate about how technology is revolutionizing the music industry, then this guide is for you.

Here, we'll take you on a journey to search music data using vector search methods. Since more than 80% of the world 's data is unstructured , it's good to know how to deal with different types of data besides text.

If you want to follow and walk through the code as you read it, visit the files on GitHub listed at the end of this article. We use the following command to clone the code:

git clone https://github.com/liu-xiao-guo/music-search

architecture

Imagine if you could hum the tune of a song you're trying to recall, and then the song you hummed suddenly appeared on the screen? Of course, given the necessary effort and data model adjustments, that's what we do today.

To achieve our result, we will create a schema that looks like this:

The main role here is embeddings. We will use the audio embeddings generated by the model as search keywords in vector search.

Install

If you have not installed your own Elasticsearch and Kibana, please refer to my previous article to install them separately:

When installing, please refer to the installation guide of Elastic Stack 8.x for installation. In the following exercises, I will use the latest Elastic Stack 8.9.0 for demonstration.

How to generate audio embeds

At the heart of generating embeddings are models that are trained on millions of examples to provide more relevant and accurate results. For audio, these models can be trained on large amounts of audio data. The output of these models is a dense digital representation of the audio (i.e. audio embedding). This high-dimensional vector captures key features of audio clips, allowing similarity computation and efficient search in the embedding space.

For this work, we will use librosa (an open source python package) to generate audio embeddings. This usually involves extracting meaningful features from audio files, such as Mel-frequency cepstral coefficients (MFCC), chrominance, and Mel-scale spectrogram features. So, how do we implement audio search with Elasticsearch®?

Step 1: Create an index to store audio data

First, we need to create an index in Elasticsearch before populating the vector database with music data. For simplicity, we will use the Python code to run in Jupyter Notebook.

1.1 Create our audio dataset index

Now that we've established the connections, let's create an index for storing audio information. We use jupyter notebook to open the elastic_music_search.ipynb file.

!pip install elasticsearch
!pip install Config

Above we follow the necessary python libraries. For a link to Elasticsearch, see " Elasticsearch: Everything You Need to Know About Using Elasticsearch in Python - 8.x ". We modify the following file simple.cfg in the downloaded code:

simple.cfg

ES_PASSWORD: "p1k6cT4a4bF+pFYf37Xx"
ES_FINGERPRINT: "633bf7f6e4bf264e6a05d488af3c686b858fa63592dc83999a0d77f7e9fe5940"

The ES_PASSWORD above is the password we displayed when Elasticsearch first started, and the value of ES_FINGERPRINT is the fingerprint of http_ca.crt. We can also see this when Elasticsearch starts up for the first time. If you already can't find this display, then you can refer to the article " Elasticsearch: Everything you need to know about using Elasticsearch in Python - 8.x " to learn how to get this. Another relatively simple method is to open the config/kibana.yml file:

#index data in elasticsearch
from elasticsearch import Elasticsearch
from config import Config
 
with open('simple.cfg') as f:
    cfg = Config(f)
 
print(cfg['ES_FINGERPRINT'])
print(cfg['ES_PASSWORD'])
 
es = Elasticsearch(
    'https://localhost:9200',
    ssl_assert_fingerprint = cfg['ES_FINGERPRINT'],
    basic_auth=('elastic', cfg['ES_PASSWORD'])
)
 
es.info()

The above code shows that our python code connected to Elasticsearch successfully.

Next, we create an index called my-audio-index:

index_name = "my-audio-index"

if(es.indices.exists(index=index_name)):
    print("The index has already existed, going to remove it")
    es.options(ignore_status=404).indices.delete(index=index_name)

# Specify index configuration
mappings = {
    "_source": {
      "excludes": ["audio-embedding"]
    },
    "properties": {
      "audio-embedding": {
        "type": "dense_vector",
        "dims": 2048,
        "index": True,
        "similarity": "cosine"
      },
      "path": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "timestamp": {
        "type": "date"
      },
      "title": {
        "type": "text"
      },
      "genre": {
        "type": "text"
      }
    }
}

# Create index
if not es.indices.exists(index=index_name):
    index_creation = es.options(ignore_status=400).indices.create(index=index_name, mappings = mappings)
    print("index created: ", index_creation)
else:
    print("Index  already exists.")

The provided Python code creates an index with a specific configuration using the Elasticsearch Python client . The purpose of this index is to provide a structure that allows search operations on dense vector fields, which are often used to store vector representations or embeddings of certain entities (such as audio files in this case).

The mappings object defines the mapping properties for this index, including the audio-embedding, path, timestamp, and title fields. The audio-embedding field is specified as a dense_vector type, suitable for 2048 dimensions, and indexed by cosine similarity, which determines the method used to calculate the distance between vectors during the search operation. The path field will store the path where the audio is played. Note that to accommodate an embedding dimension of 2048, you need to be using Elasticsearch 8.8.0 or later.

The script then checks to see if the index exists in the Elasticsearch instance. If the index does not exist, it will create a new one using the specified configuration. This type of indexing configuration can be used in scenarios such as audio search, where audio files are converted into vector representations for indexing and subsequent similarity-based retrieval.

Step 2: Populate Elasticsearch with audio data

At the end of this step, you'll read an index and populate it with audio data to create our data store. In order to proceed with the audio search, we first need to populate the database.

2.1 Select audio data to ingest

Many audio datasets have specific goals. For our example, I'll be utilizing the files generated on the Google Music LM page, specifically from the text and melodic conditioning sections. Put the audio files *.wav in a specific directory - in this example I chose /Users/liuxg/python/music-search/dataset.

$ pwd
/Users/liuxg/python/music-search/dataset
$ ls
a-cappella-chorus.wav                        bella_ciao_tribal-drums-and-flute.wav
bella_ciao_a-cappella-chorus.wav             mozart_symphony25_electronic-synth-lead.wav
bella_ciao_electronic-synth-lead.wav         mozart_symphony25_guitar-solo.wav
bella_ciao_guitar-solo.wav                   mozart_symphony25_jazz-with-saxophone.wav
bella_ciao_humming.wav                       mozart_symphony25_opera-singer.wav
bella_ciao_jazz-with-saxophone.wav           mozart_symphony25_piano-solo.wav
bella_ciao_opera-singer.wav                  mozart_symphony25_prompt.wav
bella_ciao_piano-solo.wav                    mozart_symphony25_string-quartet.wav
bella_ciao_string-quartet.wav                mozart_symphony25_tribal-drums-and-flute.wav

import os

def list_audio_files(directory):
    # The list to store the names of .wav files
    audio_files = []

    # Check if the path exists
    if os.path.exists(directory):
        # Walk the directory
        for root, dirs, files in os.walk(directory):
            for file in files:
                # Check if the file is a .wav file
                if file.endswith('.wav'):
                    # Extract the filename from the path
                    filename = os.path.splitext(file)[0]
                    print(filename)

                    # Add the file to the list
                    audio_files.append(file)
    else:
        print(f"The directory '{directory}' does not exist.")

    # Return the list of .mp3 files
    return audio_files

# Use the function
audio_path = "/Users/liuxg/python/music-search/dataset"
audio_files = list_audio_files(audio_path)

The code defines a function called list_audio_files that takes a directory as an argument. The purpose of this function is to iterate through the provided directory and its subdirectories, looking for audio files with the extension ".wav". If you need to support .mp3 files, you need to modify this function.

2.2 The Power of Vector Search Embeddings

This step is where the magic happens. Vector similarity search is a mechanism for storing, retrieving and searching vectors based on their similarity to a given query, and is commonly used in applications such as image retrieval , natural language processing , recommender systems, etc. This concept is widely used due to the rise of deep learning and using embeddings to represent data. Essentially, embeddings are vector representations of high-dimensional data.

The basic idea is to represent data items (e.g. images, documents, user profiles) as vectors in a high-dimensional space. Then, use a distance metric such as cosine similarity or Euclidean distance to measure the similarity between vectors and return the most similar vectors as search results. While text embeddings are extracted using linguistic features, audio embeddings are often generated using spectrograms or other audio signal features.

The process of creating embeddings for text and audio data involves converting the data into vectors using feature extraction or embedding techniques, and then indexing these vectors in a vector search database.

2.2.3 Extracting audio features

The next step involves analyzing our audio files and extracting meaningful features. This step is crucial as it helps the machine learning model understand and learn from our audio data.

The process of extracting features from a spectrogram is a crucial step in the context of machine learning for audio signal processing. A spectrogram is a visual representation of the frequency content of an audio signal over time. The characteristics identified in this case cover three specific types:

Mel-Frequency Cepstral Coefficients (MFCC): MFCCs are coefficients that capture the spectral characteristics of an audio signal in a manner more closely related to human auditory perception.
Chromatic features : Chromatic features represent the 12 different pitch levels of a musical octave and are particularly useful in music-related tasks.
Spectral Contrast : Spectral Contrast focuses on the perceived brightness of different frequency bands in an audio signal.

By analyzing and comparing the effectiveness of these feature sets on real-world text files, researchers and practitioners can gain insight into their applicability to various audio-based machine learning applications, such as audio classification and analysis.

First, we need to convert the audio file into a format suitable for analysis. Libraries such as librosa in Python can help with this conversion, turning audio files into spectrograms.
Next, we will extract features from these spectrograms.
We will then save these features and send them as input to the machine learning model.

We use panns_inference , a Python library designed for the task of audio labeling and sound event detection. The models used in this library are trained with PANN, which stands for Large-Scale Pretrained Audio Neural Network, a method for audio pattern recognition.

!pip install -qU panns-inference librosa
from panns_inference import AudioTagging

# load the default model into the gpu.
model = AudioTagging(checkpoint_path=None, device='cuda') # change device to cpu if a gpu is not available

Note : It may take several minutes to download the PANNS inference model.

import numpy as np
import librosa

# Function to normalize a vector. Normalizing a vector means adjusting the values measured in different scales to a common scale.
def normalize(v):
   # np.linalg.norm computes the vector's norm (magnitude). The norm is the total length of all vectors in a space.
   norm = np.linalg.norm(v)
   if norm == 0:
        return v

   # Return the normalized vector.
   return v / norm

# Function to get an embedding of an audio file. An embedding is a reduced-dimensionality representation of the file.
def get_embedding (audio_file):

  # Load the audio file using librosa's load function, which returns an audio time series and its corresponding sample rate.
  a, _ = librosa.load(audio_file, sr=44100)

  # Reshape the audio time series to have an extra dimension, which is required by the model's inference function.
  query_audio = a[None, :]

  # Perform inference on the reshaped audio using the model. This returns an embedding of the audio.
  _, emb = model.inference(query_audio)

  # Normalize the embedding. This scales the embedding to have a length (magnitude) of 1, while maintaining its direction.
  normalized_v = normalize(emb[0])

  # Return the normalized embedding required for dot_product elastic similarity dense vector
  return normalized_v

2.3 Insert audio data into Elasticsearch

We now have everything we need to insert audio data into our Elasticsearch index.

from datetime import datetime

#Storing Songs in Elasticsearch with Vector Embeddings:
def store_in_elasticsearch(song, embedding, path, index_name, genre, vec_field):
  body = {
      'audio-embedding' : embedding,
      'title': song,
      'timestamp': datetime.now(),
      'path' : path,
      'genre' : genre

  }

  es.index(index=index_name, document=body)
  print ("stored...",song, embedding, path, genre, index_name)

# Initialize a list genre for test
genre_lst = ['jazz', 'opera', 'piano','prompt', 'humming', 'string', 'capella', 'eletronic', 'guitar']

for filename in audio_files:
  audio_file = audio_path + "/" + filename

  emb = get_embedding(audio_file)

  song = filename.lower()

  # Compare if genre list exists inside the song
  genre = next((g for g in genre_lst if g in song), "generic")

  store_in_elasticsearch(song, emb, audio_file, index_name, genre, 2 )

2.4 Visualizing the results in Kibana

At this point, we can check the indexing using the audio data embedded in the audio embedding dense vector field. Kibana® Dev Tools, especially the console functionality, is a powerful interface for interacting with Elasticsearch clusters. It provides a way to send RESTful commands directly to Elasticsearch and view the results in a user-friendly format.

One thing to note about the feature is that we left out the audio-embedding field here. This is defined in mappings. Its data volume is relatively large, which can save space.

Step Three: Search by Music

You can now perform a vector similarity search using the generated embeddings. When you feed the system an input song, it converts the song into an embedding, searches the database for similar embeddings, and returns songs with similar characteristics.

# Define a function to query audio vector in Elasticsearch
def query_audio_vector(es, emb, field_key, index_name):
    # Initialize the query structure
    # It's a bool filter query that checks if the field exists
    query = {
        "bool": {
            "filter": [{
                "exists": {
                    "field": field_key
                }
            }]
        }
    }

    # KNN search parameters
    # field is the name of the field to perform the search on
    # k is the number of nearest neighbors to find
    # num_candidates is the number of candidates to consider (more means slower but potentially more accurate results)
    # query_vector is the vector to find nearest neighbors for
    # boost is the multiplier for scores (higher means this match is considered more important)
    knn = {
        "field": field_key,
        "k": 2,
        "num_candidates": 100,
        "query_vector": emb,
        "boost": 100
    }

    # The fields to retrieve from the matching documents
    fields = ["title", "path", "genre", "body_content", "url"]

    # The name of the index to search
    index = index_name

    # Perform the search
    # index is the name of the index to search
    # query is the query to use to find matching documents
    # knn is the parameters for KNN search
    # fields is the fields to retrieve from the matching documents
    # size is the maximum number of matches to return
    # source is whether to include the source document in the results
    resp = es.search(index=index,
                     query=query,
                     knn=knn,
                     fields=fields,
                     size=5,
                     source=False)

    # Return the search results
    return resp

Let's start with the fun part!

3.1 Select the music to search

In the code below we directly select music from dataset audio directory and play the result in jupyter using audio music.

# Import necessary modules for audio display from IPython
from IPython.display import Audio, display

# Provide the URL of the audio file
my_audio = "/Users/liuxg/python/music-search/dataset/bella_ciao_humming.wav"

# Display the audio file in the notebook
Audio(my_audio)

You can play music by clicking the " Play " button.

3.2 Search music

Now, let's run a piece of code to search Elasticsearch for the music my_audio. We will only use audio files for our search.

audio_file = "/Users/liuxg/python/music-search/dataset/bella_ciao_humming.wav"
# Generate the embedding vector from the provided audio file
# 'get_embedding' is a function that presumably converts the audio file into a numerical vector
emb = get_embedding(audio_file)

# Query the Elasticsearch instance 'es' with the embedding vector 'emb', field key 'audio-embedding',
# and index name 'my-audio-index'
# 'query_audio_vector' is a function that performs a search in Elasticsearch using a vector embedding.
# 'tolist()' method is used to convert numpy array to python list if 'emb' is a numpy array.
resp = query_audio_vector (es, emb.tolist(), "audio-embedding","my-audio-index")
resp['hits']

Elasticsearch will return all music similar to your hit song:

NUM_MUSIC = 5  # example value

for i in range(NUM_MUSIC):
    path = resp['hits']['hits'][i]['fields']['path'][0]
    print(path)

Some code to help play the result:

Audio("/Users/liuxg/python/music-search/dataset/bella_ciao_opera-singer.wav")

Now, you can check the result by clicking the "Play" button.

3.3 Analysis results

So, can I deploy this code in production and sell my app? No, as a probabilistic model, the Probabilistic Auditory Neural Network (PANN) and any other machine learning model require increased data volume and additional fine-tuning to be effectively applied to real-world scenarios.

This is clearly shown by the embedding visualization graphs associated with our 18 song samples, which can lead to false positives for the kNN method. However, future data engineers still face a significant challenge: the task of identifying the best model for a query by hum. This represents a fascinating intersection of machine learning and auditory cognition, requiring rigorous research and innovative problem solving.

3.4 Use UI to improve POC (optional)

After a little modification, I copied and pasted the entire code into Streamlit . Streamlit is a Python library that simplifies the process of creating interactive web applications for data science and machine learning projects. It allows novices to easily convert data scripts into shareable web applications without extensive web development knowledge.

The result is this app:

Elasticsearch vector search search audio files_哔哩哔哩_bilibili

Windows on the Future of Audio Search

We have successfully implemented a music search system in Python using Elasticsearch vectors. This is a starting point for the field of audio search and may inspire more innovative concepts by leveraging this architectural approach. By changing the model, different applications can be developed. Also, porting inference to Elasticsearch may improve performance. Visit Elastic's machine learning page to learn more.

This demonstrates the great potential and adaptability of this technique for various search applications beyond text.

All code can be found in a single file elastic-music_search.ipynb on GitHub .

原文：Searching by music: Leveraging vector search for audio information retrieval | Elastic Blog

Elasticsearch: Music Information Retrieval Using Vector Search