Object localization using CNN-based localizers

introduce

Object localization refers to the task of precisely identifying and localizing objects of interest in images. It plays a vital role in computer vision applications, enabling tasks such as object detection, tracking, and segmentation. Among CNN-based localizers, object localization involves training a convolutional neural network to predict the coordinates of a bounding box surrounding an object.

The localization process usually follows a two-step process, where the backbone CNN extracts image features and the regression head predicts bounding box coordinates.

learning target

  • Learn the basics of convolutional neural networks (CNNs).

  • Interpreting CNN architectures for localization models.

  • The localizer architecture is implemented using a pre-trained CNN model for localization.

Table of contents

  • introduce

  • Convolutional Neural Networks (CNNs)

  • CNN-based localizer architecture

  • Better understanding of model architecture

  • training locator

    • import the necessary libraries

    • build components

    • Modeling

    • download dataset

    • Generate batches of data

    • Load and create datasets

    • Loss functions and performance metrics

    • Optimizer and learning rate scheduler

    • training loop

    • predict

    • output

  • in conclusion

Convolutional Neural Networks (CNNs)

83a8180201fc0bb9c80dcb97f3f00020.png

Convolutional Neural Networks (CNNs) are a class of deep learning models for image analysis.

Their architecture consists of an input layer that receives image data, followed by convolutional layers that use convolutional filters to learn and extract features. The activation function introduces non-linearity, while the pooling layer reduces the spatial dimension. The last fully connected layer makes the final prediction.

CNNs learn hierarchical features, starting from low-level features such as edges, and gradually progressing to complex and abstract features such as shapes and object combinations.

During the training phase of a CNN, the network learns to automatically identify and extract features at different levels. Initial layers capture low-level features such as edges, corners, and textures, while deeper layers learn more complex and abstract features such as shapes, object parts, and object combinations. The hierarchical structure of CNNs enables them to learn representations that are increasingly insensitive to changes in translation, scaling, rotation, and other image transformations.

"Increasingly insensitive representation" means that with the deep learning of the CNN network, the learned feature representation becomes more and more stable and invariant to image transformations, and can maintain the importance of important features in the face of these transformations. Effective extraction and identification capabilities.

CNN-based localizer architecture

The CNN-based localizer model for object localization consists of 3 components:

1. CNN backbone network

Combine the power of SQL : Choose a standard CNN architecture (e.g. ResNet 18, ResNet 50, VGG, etc.) to fine-tune a pre-trained model on the Imagenet classification task. Augment the backbone network with additional CNN layers to reduce feature map size

2. Vectorizer

The output of the CNN backbone is a 3D tensor. But the final output of the locator is a 1D vector with four values ​​corresponding to each coordinate of the bounding box. To convert 3D tensors to vectors, we use vectorizers or utilize Flatten layers as an alternative.

3. Return to head

We built a fully connected regression head specifically for this task. Afterwards, the feature vectors obtained from the backbone network are fed to the regression head. The regression head consists of 4 nodes at the end, corresponding to (x1, y1, x2, y2) or any other equivalent bounding box representation.

Better understanding of model architecture

bbae30a0d72f2261b4173ba9d609dca0.png

The figure shows a common CNN-based localizer model architecture. In short, the CNN backbone receives RGB images and then generates feature maps. We then use a flattening layer or a global average pooling layer to form a 1D feature vector. The fully connected regression head receives feature vectors and gives predictions.

The CNN network maintains a fixed size of the input image, and we use a Flatten layer to convert the feature maps obtained from the CNN backbone into vectors. However, when using adaptive layers such as GAP (Global Average Pooling), no image resizing is required.

training locator

import the necessary libraries

import ast
import math
import os

import cv2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from functools import partial

from tensorflow.data import Dataset
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, losses, models, optimizers, utils

build components

The architecture takes an input image of size 300×300 with 3 color channels.

  • The backbone processes images and extracts high-level features.

  • The vectorizer then computes a fixed-length vector representation of these features.

  • Finally, the regression head takes this vector and performs regression, outputting a 4-dimensional vector as the final prediction.

IMG_SHAPE = (300, 300)

backbone = models.Sequential([
    ResNet50(include_top=False, 
    weights='imagenet', 
    input_shape=IMG_SHAPE + (3,)),
    layers.Conv2D(1024, 3, 2, activation='relu'),
    ], name='backbone' )

vectorizer = layers.GlobalAveragePooling2D(name='GAP_vectorizer')

regression_head = models.Sequential([
    layers.Dense(512, activation='relu'),
    layers.Dense(4)
], name='regression_head')

Modeling

It defines the complete model by combining previously defined components: backbone, vectorizer and regression head.

bbox_regressor = models.Sequential([
    backbone, 
    vectorizer,
 regression_head
])

bbox_regressor.summary()

utils.plot_model(bbox_regressor, "localizer.png", show_shapes=True)
8d481278f8e974a88a81a8fa64b48f83.png

download dataset

We are working with the selfie dataset. The Selfie dataset contains 46,836 selfie images. We use Haar Cascades to generate face bounding boxes. A CSV file is provided that contains image paths and bounding box coordinates for approximately 22K images.

The dataset is available at:

https://www.crcv.ucf.edu/data/Selfie/Selfie-dataset.tar.gz

Generate batches of data

The DataGenerator class is responsible for loading and preprocessing existing data for localization tasks.

  • It takes as input a directory of images and a CSV file with image path and bounding box information.

  • This class divides the data into training and testing subsets based on the provided scores.

  • During generation, this class preprocesses each image by resizing it, converting color channels, and normalizing pixel values.

  • The bounding box coordinates are also fine.

The generator generates preprocessed images and corresponding bounding boxes for each data sample.

class DataGenerator(object):
    def __init__(self, img_dir, _csv_path, train_max=0.8, test_min=0.9, target_shape=(300, 300)):
        for k, v in locals().items():
            if k != "self" and not k.startswith("_"):
                setattr(self, k, v)
        
        self.df = pd.read_csv(_csv_path)
        
    def __len__(self):
        return len(self.df)
        
    def generate(self, phase):
        assert phase in [None, 'train', 'test']
        _df = self.divide_data(phase)

        for rel_img_path, bbox in _df.values:
            img, bbox = self.preprocess_data(rel_img_path, bbox)
            img = tf.constant(img, dtype=tf.float32)
            bbox = tf.constant(bbox, dtype=tf.float32)
            yield img, bbox

    def preprocess_data(self, rel_img_path, bbox):
        bbox = np.array(ast.literal_eval(bbox))

        img_path = os.path.join(self.img_dir, rel_img_path)

        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        _h, _w, _ = img.shape
        img = cv2.resize(img, self.target_shape)
        img = img.astype(np.float32) / 127.0 - 1

        bbox = bbox / np.array([_w, _h, _w, _h])

        return img, bbox # np.expand_dims(bbox, 0)

    def divide_data(self, phase):
        train_max = int(self.train_max * len(self.df))
        
        _df = None
        
        if phase is None:
            _df = self.df
        elif phase == 'train':
            _df = self.df.iloc[:train_max, :].sample(frac=1)
        else:
            _df = self.df.iloc[train_max:, :]
            
        return _df

Load and create datasets

This uses the DataGenerator class to create training and test datasets through TensorFlow's Dataset API.

  • It uses TensorFlow's Dataset API to create training and test datasets.

  • We generate the training dataset by calling the "generate" method of the DataGenerator instance, specifying the "train" stage.

  • Use the "test" stage to generate a test dataset.

  • Both datasets are shuffled and batched with a batch size of 16.

The resulting train_dataset and test_dataset are TensorFlow Dataset objects ready for further processing or training the model.

IMG_DIR = 'Selfie-dataset/images'
CSV_PATH = '3-lv1-8-4-selfies_dataset.csv'

BATCH_SIZE = 16

dataset_generator = DataGenerator(IMG_DIR, CSV_PATH)
train_max = int(len(dataset_generator) * 0.9)

train_dataset = Dataset.from_generator(partial(dataset_generator.generate,
 phase='train'), output_types=(tf.float32, tf.float32), 
 output_shapes = (IMG_SHAPE + (3,), (4,)))
 
train_dataset = train_dataset.shuffle(buffer_size=2 * BATCH_SIZE).batch(BATCH_SIZE)

test_dataset = Dataset.from_generator(partial(dataset_generator.generate, 
  phase='test'),output_types=(tf.float32, tf.float32), 
  output_shapes = (IMG_SHAPE + (3,), (4,)))

test_dataset = test_dataset.shuffle(buffer_size=2 * BATCH_SIZE).batch(BATCH_SIZE)

Loss functions and performance metrics

Several regression loss functions are available for training bounding box locators. Regression loss functions such as MSE and Smooth L1 are used similarly to the case for other regression tasks and are applied between the ground truth bounding box vector and the predicted bounding box vector.

Intersection over union (IoU) is a commonly used performance metric in bounding box regression.

cde134c84c82ce49965e6d2255fe13dd.png

This function defines a set of functions for computing the intersection over union (IoU) and evaluating the performance of model predictions. It provides methods to compute IoU, evaluate predictions against loss and IoU, and assign evaluation criteria to variables.

def cal_IoU(b1, b2):
    zero = tf.convert_to_tensor(0., b1.dtype)

    b1_x1, b1_y1, b1_x2, b1_y2 = tf.unstack(b1, 4, axis=-1)
    b2_x1, b2_y1, b2_x2, b2_y2 = tf.unstack(b2, 4, axis=-1)
    
    b1_width = tf.maximum(zero, b1_x2 - b1_x1)
    b1_height = tf.maximum(zero, b1_y2 - b1_y1)
    b2_width = tf.maximum(zero, b2_x2 - b2_x1)
    b2_height = tf.maximum(zero, b2_y2 - b2_y1)
    
    b1_area = b1_width * b1_height
    b2_area = b2_width * b2_height

    intersect_x1 = tf.maximum(b1_x1, b2_x1)
    intersect_y1 = tf.maximum(b1_y1, b2_y1)
  
    intersect_y2 = tf.minimum(b1_y2, b2_y2)
    intersect_x2 = tf.minimum(b1_x2, b2_x2)

    intersect_width = tf.maximum(zero, intersect_x2 - intersect_x1)
    intersect_height = tf.maximum(zero, intersect_y2 - intersect_y1)
    
    intersect_area = intersect_width * intersect_height

    union_area = b1_area + b2_area - intersect_area
    iou = tf.math.divide_no_nan(intersect_area, union_area)
    return iou


def calculate_iou(y_true, y_pred):
    y_pred = tf.convert_to_tensor(y_pred)
    y_pred = tf.cast(y_pred, tf.float32)
    y_true = tf.cast(y_true, y_pred.dtype)
    iou = cal_IoU(y_pred, y_true)
    return iou


def evaluate(actual, pred):
    iou = calculate_iou(actual, pred)
    loss = losses.MSE(actual, pred)
    return loss, iou

criteron = evaluate

Optimizer and learning rate scheduler

We schedule the learning rate using an exponentially decaying learning rate and optimize using the Adam optimizer.

zEPOCHS = 10
LEARNING_RATE = 0.0003

lr_scheduler = optimizers.schedules.ExponentialDecay(LEARNING_RATE, 3600, 0.8)
optimizer = optimizers.Adam(learning_rate=lr_scheduler)

os.makedirs('checkpoints', exist_ok=True)

training loop

It implements a training loop that runs a specified number of times.

  • Within each epoch, the loop iterates over batches of the training dataset.

  • It performs forward propagation to obtain predicted bounding box coordinates, calculates loss and IoU values, applies backpropagation to update the model's weights, and records training metrics.

  • After each epoch, the average training loss and IoU are calculated.

The model is saved at the end of each epoch.

for epoch in range(EPOCHS):
    train_losses, train_ious = np.array([]), np.array([])

    for step, (inputs, labels) in enumerate(train_dataset):
      
        with tf.GradientTape() as tape:
            preds = bbox_regressor(inputs, training=True)
            loss, iou = criteron(labels, preds)

        grads = tape.gradient(loss, bbox_regressor.trainable_weights)
        optimizer.apply_gradients(zip(grads, bbox_regressor.trainable_weights))
        
        loss_value = tf.math.reduce_mean(loss).numpy()
        train_losses = np.hstack([train_losses, loss_value])
        
        iou_value = tf.math.reduce_mean(iou).numpy()
        train_ious = np.hstack([train_ious, iou_value])

        print('Training Loss : %f'%(step + 1, math.ceil(train_max / BATCH_SIZE),
         loss_value), end='')


    tr_lss, tr_iou = np.mean(train_losses), np.mean(train_ious)
    
    print('Train loss : %f  -- Train Average IOU : %f' % (epoch, EPOCHS, 
    tr_lss, tr_iou))    
    print()
    
    save_path = './models/checkpoint%d.h5' % (epoch)
    bbox_regressor.save(save_path)

predict

We visualize the bounding boxes predicted by the Bbox regressor for some of the images in the test set by drawing the bounding boxes in the images.

for inputs, labels in test_dataset:
    bbox_preds = bbox_regressor(inputs, training=False).numpy() 
    bbox_preds = (bbox_preds * (dataset_generator.target_shape * 2)).astype(int)
    imgs = (127 * (inputs + 1)).numpy().astype(np.uint8)
    for idx, img in enumerate(imgs):
        x1, y1, x2, y2 = bbox_preds[idx]
        img = cv2.rectangle(img, (x1, y1), (x2, y2), (255, 0, 0), 4)
        plt.imshow(img)
        plt.show()
    break

output

aaecaa6fe0147279b44486adda6363ef.png

in conclusion

In summary, CNN-based localizers help advance computer vision applications, especially in object localization tasks. The article highlights the importance of CNNs in image analysis and explains a two-step pipeline, including a backbone CNN for feature extraction and a regression head for predicting bounding box coordinates.

With advances in deep learning techniques, larger datasets, and the integration of other modalities, the future of object localization holds great potential to have a significant impact on the industry and change visual perception and understanding.

main point

  • CNN-based localizers are critical to advancing computer vision applications, exploiting the ability of CNNs to learn hierarchical features from images.

  • The two-step pipeline consists of a feature extraction backbone CNN and a regression head, often used in CNN-based localizers for accurate object localization.

  • With advances in deep learning, larger datasets, and integration of other modalities, the future of object localization is promising, with major implications for industries such as autonomous driving, robotics, surveillance, and healthcare.

☆ END ☆

If you see this, it means you like this article, please forward and like it. Search "uncle_pn" on WeChat, welcome to add the editor's WeChat "woshicver", and update a high-quality blog post in the circle of friends every day.

Scan the QR code to add editor↓

f7eb17c4849081729fd0013001c56883.jpeg

Guess you like

Origin blog.csdn.net/woshicver/article/details/131693156