How to build an image-to-image search tool using CLIP and Pinecone

9887c85cfc966b70ed5bf168f916139c.jpeg

In this article, you'll learn, in action, why image-to-image search is a powerful tool that can help you find similar images in a vector database.

Table of contents

  1. Image to image search

  2. CLIP and Pinecone: a brief introduction

  3. Building an image-to-image search tool

  4. Test Time: Lord of the Rings

  5. Wait, what if I have 1 million images?

1. Image-to-image search

What does image-to-image search mean?

In a traditional image search engine, typically you use text queries to find images, and the search engine returns results based on keywords related to those images. In image-to-image search, on the other hand, you use an image as the starting point for your query, and the system retrieves images that are visually similar to the query image.

Imagine you have a painting, say a beautiful picture of a sunset. Now, you want to find other paintings that look exactly the same, but you can't describe it in words. Instead, you show the computer your paintings, and it looks through all the paintings it knows about and finds those that are very similar, even if they have different names or descriptions. Image-to-image search, in a nutshell.

What can I do with this search tool?

Image-to-image search engines open up a range of exciting possibilities:

  • Find specific data - Search for images containing specific objects you want to train the model to recognize.

  • Error Analysis - When the model misclassifies an object, it also fails by searching for visually similar images.

  • Model Debugging - Displays additional images that contain properties or defects that cause undesired model behavior.

2. CLIP and Pinecone: a brief introduction

297507aaffaffbd7793fbdaa78b4d1ea.jpegIndexing stage in image-to-image search

The image above shows the steps for indexing an image dataset in a vector database.

  • Step 1: Collect image dataset (can be raw/unlabeled images).

  • Step 2: Use CLIP [1], an embedding model, to extract a high-dimensional vector representation of the image, capturing its semantic and perceptual features.

  • Step 3: Encode these images into an embedding space, where the embeddings of the images are indexed in a vector database like Pinecone.

e957bf83b50becbb6f621cc93d1dcd77.jpeg

Query phase: Retrieve the most similar image for a given query

At query time, pass a sample image through the same CLIP encoder to get its embedding. Perform a vector similarity search to efficiently find the top k nearest database image vectors. The image with the highest cosine similarity score to the query embedding is returned as the most similar search result.

3. Building an image-to-image search engine

3.1 Dataset - Lord of the Rings

We used Google search to query images related to the keyword "Lord of the Rings movie scenes". Based on this code, we create a function to retrieve 100 URLs based on a given query.

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36"
}


params = {
    "q": "the lord of the rings film scenes", # search query
    "tbm": "isch",                # image results
    "hl": "en",                   # language of the search
    "gl": "us",                   # country where search comes from
    "ijn": "0"                    # page number
}


html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")


def get_images():


    """
    https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    if you try to json.loads() without json.dumps() it will throw an error:
    "Expecting property name enclosed in double quotes"
    """


    google_images = []


    all_script_tags = soup.select("script")


    # # https://regex101.com/r/48UZhY/4
    matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))


    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)


    # https://regex101.com/r/VPz7f2/1
    matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)


    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ", ".join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(", ")


    thumbnails = [
        bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails
    ]


    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))


    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)


    full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
    ]


    return full_res_images

3.2 Use CLIP to obtain embedding vectors

Extract all embeddings of our image set.

def get_all_image_embeddings_from_urls(dataset, processor, model, device, num_images=100):
    embeddings = []


    # Limit the number of images to process
    dataset = dataset[:num_images]
    working_urls = []


    #for image_url in dataset['image_url']:
    for image_url in dataset:
      if check_valid_URL(image_url):
          try:
              # Download the image
              response = requests.get(image_url)
              image = Image.open(BytesIO(response.content)).convert("RGB")
              # Get the embedding for the image
              embedding = get_single_image_embedding(image, processor, model, device)
              #embedding = get_single_image_embedding(image)
              embeddings.append(embedding)
              working_urls.append(image_url)
          except Exception as e:
              print(f"Error processing image from {image_url}: {e}")
      else:
          print(f"Invalid or inaccessible image URL: {image_url}")


    return embeddings, working_urls
LOR_embeddings, valid_urls = get_all_image_embeddings_from_urls(list_image_urls, processor, model, device, num_images=100)
Invalid or inaccessible image URL: https://blog.frame.io/wp-content/uploads/2021/12/lotr-forced-perspective-cart-bilbo-gandalf.jpg
Invalid or inaccessible image URL: https://www.cineworld.co.uk/static/dam/jcr:9389da12-c1ea-4ef6-9861-d55723e4270e/Screenshot%202020-08-07%20at%2008.48.49.png
Invalid or inaccessible image URL: https://upload.wikimedia.org/wikipedia/en/3/30/Ringwraithpic.JPG

97 out of 100 URLs contain valid images.

3.3 Storing our embeddings in Pinecone

To store our embeddings in Pinecone [2], we first need to create a Pinecone account. After that, create an index named "image-to-image".

pinecone.init(
   api_key = "YOUR-API-KEY",
   environment="gcp-starter"  # find next to API key in console
)


my_index_name = "image-to-image"
vector_dim = LOR_embeddings[0].shape[1]


if my_index_name not in pinecone.list_indexes():
  print("Index not present")


# Connect to the index
my_index = pinecone.Index(index_name = my_index_name)

Create a function to store data in the Pinecone index.

def create_data_to_upsert_from_urls(dataset,  embeddings, num_images):
  metadata = []
  image_IDs = []
  for index in range(len(dataset)):
    metadata.append({
        'ID': index,
        'image': dataset[index]
    })
    image_IDs.append(str(index))
  image_embeddings = [arr.tolist() for arr in embeddings]
  data_to_upsert = list(zip(image_IDs, image_embeddings, metadata))
  return data_to_upsert

Run the above function to get:

LOR_data_to_upsert = create_data_to_upsert_from_urls(valid_urls, 
                                LOR_embeddings, len(valid_urls))


my_index.upsert(vectors = LOR_data_to_upsert)
# {'upserted_count': 97}


my_index.describe_index_stats()
# {'dimension': 512,
# 'index_fullness': 0.00097,
# 'namespaces': {'': {'vector_count': 97}},
# 'total_vector_count': 97}

3.4 Testing our image-to-image search tool

# For a random image
n = random.randint(0,len(valid_urls)-1)
print(f"Sample image with index {n} in {valid_urls[n]}")
Sample image with index 47 in 
https://www.intofilm.org/intofilm-production/scaledcropped/870x489https%3A/s3-eu-west-1.amazonaws.com/images.cdn.filmclub.org/film__3930-the-lord-of-the-rings-the-fellowship-of-the-ring--hi_res-a207bd11.jpg/film__3930-the-lord-of-the-rings-the-fellowship-of-the-ring--hi_res-a207bd11.jpg

fd94a79e1efda0f7dae8b213b09813c0.jpeg

Sample image used in the query (can be found at the above URL) 

# 1. Get the image from url
LOR_image_query = get_image(valid_urls[n])
# 2. Obtain embeddings (via CLIP) for the given image
LOR_query_embedding = get_single_image_embedding(LOR_image_query, processor, model, device).tolist()
# 3. Search on Vector DB index for similar images to "LOR_query_embedding"
LOR_results = my_index.query(LOR_query_embedding, top_k=3, include_metadata=True)
# 4. See the results
plot_top_matches_seaborn(LOR_results)

8568414fc5d2ed617d9bf846001f8d7b.jpeg

Results showing the similarity score for each match

As shown, our image-to-image search tool found images similar to the given sample image, as expected ID 47 has the highest similarity score.

5. Wait, what if there are 1 million images?

As you may have realized, it's fun to build an image-to-image search tool by querying some images from Google Search. But what if you actually have a dataset with over 1 million images?

In this case, you might be more inclined to build a system rather than a tool. Building a scalable system is no easy task. Also, there are costs involved (e.g. storage costs, maintenance, writing actual code). For these cases, at Tenyks we built an optimal image-to-image search engine that can perform multi-modal queries in seconds, even if you have 1 million images or more!

·  END  ·

HAPPY LIFE

d73d00b250c433c8ab85e9c155b4dfcd.png

This article is for learning and communication only. If there is any infringement, please contact the author to delete it.

Guess you like

Origin blog.csdn.net/weixin_38739735/article/details/134980408