Step-by-step detection of individual objects using PyTorch

1. Description

        In the object detection task, we want to find the location of an object in an image. We can search for one type of object (single object detection, as shown in this tutorial) or multiple objects (multiple object detection). Typically, we use a bounding box to define the location of an object. There are several ways to represent bounding boxes:

  1. Width and height points for upper left corner — [x0, y0, w, h], where x0 is the left side of the box, y0 is the top of the box, and w and h are the width and height of the box, respectively.
  2. Top left and bottom right points — [x0 , y0 , x1 , y1], where x0 is the left side of the box, y0 is the top of the box, x1 is the right side of the box, and y1 is the bottom of the box.
  3. Center point with width and height — [xc, yc, w, h], where xc is the x-coordinate of the box center, yc is the y-coordinate of the box center, and w and h are the box's width and height, respectively.

Photo by Indiana Barriopedro of Pexels , edited by author.

        In this tutorial, we will focus on finding the center of the fovea in medical eye images from the iChallenge-AMD competition .

2. Get data

        We will use images of the eyes of patients with age-related macular degeneration (AMD).

Eye images from the AMD dataset

        There are two main sources from which data can be obtained. The first is the iChallenge-AMD website  https://amd.grand-challenge.org/ . You first need to register for the competition, and then you can download the data. The second way does not require registration, it is   downloaded from https://ai.baidu.com/broad/download . Here you need to download the "[training] images and AMD labels" of the images and the "[training] discs and fovea annotations" of the Excel files with labels.

        After downloading and extracting the data, you should have a folder  Training400 containing subfolders AMD (containing 89 images) and non-  AMD (containing 311 images) and an Excel  file Fovea_location.xlsx containing the central Concave center position.

3. Explore data

Let's first load the Excell file using Pandas

从 pathlib 导入  路径
导入熊猫作为 pd

path_to_parent_dir = 路径('.')
path_to_labels_file = path_to_parent_dir / 'Training400' /'Fovea_location.xlsx'labels_df = pd.read_excel(path_to_labels_file, index_col='ID')print('Head')print(labels_df.head()) # 显示 excell 文件中
的前 5 行 print('\nTail'

)


print(labels_df.tail()) # 显示 excell 文件中的最后 5 行

print result of data frame

We see that the table consists of four columns:

  • ID — we will use this as an index into the dataframe
  • imgName — the name of the image. We noticed that the image with AMD starts with A, while the image without AMD starts with N.
  • Fovea_X — x coordinate of the centroid of the fovea in the image
  • Fovea_Y — y-coordinate of the fovea centroid in the image

We can plot the centroid of the fovea in the image to get an idea of ​​the distribution of fovea locations.

%matplotlib inline # 如果使用 Jupyter notebook 或 Colab
 import seaborn as sns
 import matplotlib.pyplot as  plt

plt.rcParams['figure.figsize'] = (10, 6)amd_or_non_amd = ['AMD' if name.startswith('A') 

else 'Non-AMD' for name inlabels_df.imgName]
sns.scatterplot(x='Fovea_X', y='Fovea_Y', hue=amd_or_non_amd, data=labels_df, alpha=0.7)

We can see two main groups of foveal locations, but more importantly, for some images the label for the foveal centroid is (0, 0). It would be better to remove these images from the data frame.

labels_df = labels_df[(labels_df[['Fovea_X', 'Fovea_Y']] != 0)。all(axis=1)]
amd_or_non_amd = ['AMD' if name.startswith('A') else 'Non-AMD' for name in labels_df.imgName]

Now we want to look at a random sample of images and mark the center of the fovea. To do this, let's define a function to load an image with labels, and another function to draw a bounding box around the fovea based on the labels.

从 PIL import Image, ImageDraw

def 导入 numpy 作为 np
 load_image_with_label(labels_df, id):
 image_name = labels_df.loc[id, 'imgName']
 data_type = 'AMD' 如果 image_name.startswith('A') else 'Non-AMD'
 image_path = path_to_ parent_dir / 'Training400' / data_type / image_name 图像 =
 图像。open(image_path) label = (labels_df.loc[id, 'Fovea_X'], labels_df.loc[id, 'Fovea_Y'])

 返回图像, label def show_image_with_bounding_box(图像, 标签, w_h_bbox=(50, 50), 厚度=2):
 W, h =
 w_h_bbox c_x , c_y = label


image = image.copy() ImageDraw.Draw(image).rectangle(((c_x-w//2, c_y-h//2), (c_x+w//2, c_y+h//2)), outline='green', width=thick) plt.imshow(image)

We randomly sample six images and show them.

rng = np.random.default_rng(42) # 创建具有种子的生成器对象 42 n_rows = 2 # 图像子图中的行数 n_cols = 3 # # 图像子图中
的列数 索引 = rng.choice(labels_df.index, 
n_rows * 
n_cols)

对于 ii, 枚举中的 id (索引, 1):
image, label = load_image_with_label(labels_df, id) plt.subplot(n_rows, n_cols, ii) show_image_with_bounding_box(image, label, (250, 250), 20)


 plt.title(labels_df.loc[id, 'imgName'])

The first thing we need to notice from the image above is that the size of the image is different for different images. We now want to understand the distribution of image sizes. For this, we collect the height and width of images in the dataset.

heights = []widths = []for image_name, data_type in zip(labels_df['imgName']


, amd_or_non_amd):
 image_path = path_to_parent_dir / 'Training400' / data_type / image_name
 h, w = Image。open(image_path).size
 heights.append(h) widths.append(w)sns.histplot(x=heights, hue=amd_or_non_amd)

sns.histplot(x=widths, hue=amd_or_non_amd) 

4. Data Augmentation and Transformation

        Data augmentation is a very important step that lets us expand the dataset (especially when we have a small dataset, as in our case) and make the network more robust. We also want to apply some transformations to make the input of the network consistent (in our case, we need to resize the images so that they have constant dimensions).

        In addition to image enhancement and transformation, we also need to take care of labels. For example, if we flip the image vertically, the centroid of the fovea will get new coordinates that we need to update. To update the transformations for labels and images, we'll write some transformation classes ourselves.

import torch
import torchvision.transforms.functional as tf

class Resize:
  '''Resize the image and convert the label
     to the new shape of the image'''
  def __init__(self, new_size=(256, 256)):
    self.new_width = new_size[0]
    self.new_height = new_size[1]

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]
    c_x, c_y = label
    original_width, original_height = image.size
    image_new = tf.resize(image, (self.new_width, self.new_height))
    c_x_new = c_x * self.new_width /original_width
    c_y_new = c_y * self.new_height / original_height
    return image_new, (c_x_new, c_y_new)


class RandomHorizontalFlip:
  '''Horizontal flip the image with probability p.
     Adjust the label accordingly'''
  def __init__(self, p=0.5):
    if not 0 <= p <= 1:
      raise ValueError(f'Variable p is a probability, should be float between 0 to 1')
    self.p = p  # float between 0 to 1 represents the probability of flipping

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]
    w, h = image.size
    c_x, c_y = label
    if np.random.random() < self.p:
      image = tf.hflip(image)
      label = w - c_x, c_y
    return image, label


class RandomVerticalFlip:
  '''Vertically flip the image with probability p.
    Adjust the label accordingly'''
  def __init__(self, p=0.5):
    if not 0 <= p <= 1:
      raise ValueError(f'Variable p is a probability, should be float between 0 to 1')
    self.p = p  # float between 0 to 1 represents the probability of flipping

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]
    w, h = image.size
    c_x, c_y = label
    if np.random.random() < self.p:
      image = tf.vflip(image)
      label = c_x, h - c_y
    return image, label


class RandomTranslation:
  '''Translate the image by randomaly amount inside a range of values.
     Translate the label accordingly'''
  def __init__(self, max_translation=(0.2, 0.2)):
    if (not 0 <= max_translation[0] <= 1) or (not 0 <= max_translation[1] <= 1):
      raise ValueError(f'Variable max_translation should be float between 0 to 1')
    self.max_translation_x = max_translation[0]
    self.max_translation_y = max_translation[1]

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]
    w, h = image.size
    c_x, c_y = label
    x_translate = int(np.random.uniform(-self.max_translation_x, self.max_translation_x) * w)
    y_translate = int(np.random.uniform(-self.max_translation_y, self.max_translation_y) * h)
    image = tf.affine(image, translate=(x_translate, y_translate), angle=0, scale=1, shear=0)
    label = c_x + x_translate, c_y + y_translate
    return image, label


class ImageAdjustment:
  '''Change the brightness and contrast of the image and apply Gamma correction.
     No need to change the label.'''
  def __init__(self, p=0.5, brightness_factor=0.8, contrast_factor=0.8, gamma_factor=0.4):
    if not 0 <= p <= 1:
      raise ValueError(f'Variable p is a probability, should be float between 0 to 1')
    self.p = p
    self.brightness_factor = brightness_factor
    self.contrast_factor = contrast_factor
    self.gamma_factor = gamma_factor

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]

    if np.random.random() < self.p:
      brightness_factor = 1 + np.random.uniform(-self.brightness_factor, self.brightness_factor)
      image = tf.adjust_brightness(image, brightness_factor)

    if np.random.random() < self.p:
      contrast_factor = 1 + np.random.uniform(-self.brightness_factor, self.brightness_factor)
      image = tf.adjust_contrast(image, contrast_factor)

    if np.random.random() < self.p:
      gamma_factor = 1 + np.random.uniform(-self.brightness_factor, self.brightness_factor)
      image = tf.adjust_gamma(image, gamma_factor)

    return image, label

class ToTensor:
  '''Convert the image to a Pytorch tensor with
     the channel as first dimenstion and values 
     between 0 to 1. Also convert the label to tensor
     with values between 0 to 1'''
  def __init__(self, scale_label=True):
    self.scale_label = scale_label

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1]
    w, h = image.size
    c_x, c_y = label

    image = tf.to_tensor(image)

    if self.scale_label:
      label = c_x/w, c_y/h
    label = torch.tensor(label, dtype=torch.float32)

    return image, label


class ToPILImage:
  '''Convert a tensor image to PIL Image. 
     Also convert the label to a tuple with
     values with the image units'''
  def __init__(self, unscale_label=True):
    self.unscale_label = unscale_label

  def __call__(self, image_label_sample):
    image = image_label_sample[0]
    label = image_label_sample[1].tolist()

    image = tf.to_pil_image(image)
    w, h = image.size

    if self.unscale_label:
      c_x, c_y = label
      label = c_x*w, c_y*h

    return image, label

Let's try a new transformation. We create objects for each transformation class and connect them using torchvision. We then apply the full transformation to the labeled image.Compose

from torchvision.transforms import Compose
image, label = load_image_with_label(labels_df, 1)
transformation = Compose([Resize(), RandomHorizontalFlip(), RandomVerticalFlip(), RandomTranslation(), ImageAdjustment(), ToTensor()])
new_image, new_label = transformation((image, label))
print(f'new_im type {new_image.dtype}, shape = {new_image.shape}')
print(f'{new_label=}')

# new_im type torch.float32, shape = torch.Size([3, 256, 256]
# new_label=tensor([0.6231, 0.3447])

We got the result as expected. We also want to convert the new tensor to a PIL image, and convert the label back to image coordinates so we can display it using our show method.

new_image, new_label = ToPILImage()((new_image, new_label))
show_image_with_bounding_box(new_image, new_label)

5. Make Dataset and Data Loader

        To load data into our model, we first need to build a custom dataset class (which is a subclass of the PyTorch dataset class). To do this, we need to implement three methods:

  • __init__()- Construct and initialize the dataset object
  • __getitem__()- handles the way we can index images and labels from the entire dataset
  • __len__()- returns the length of the dataset we have
import torch
from torch.utils.data import Dataset, DataLoader

device = 'cuda' if torch.cuda.is_available() else 'cpu'

class AMDDataset(Dataset):
  def __init__(self, data_path, labels_df, transformation):
    self.data_path = Path(data_path)
    self.labels_df = labels_df.reset_index(drop=True)
    self.transformation = transformation

  def __getitem__(self, index):
    image_name = self.labels_df.loc[index, 'imgName']
    image_path = self.data_path / ('AMD' if image_name.startswith('A') else 'Non-AMD') / image_name
    image = Image.open(image_path)
    label = self.labels_df.loc[index, ['Fovea_X','Fovea_Y']].values.astype(float)
    image, label = self.transformation((image, label))
    return image.to(device), label.to(device)

  def __len__(self):
    return len(self.labels_df)

Before actually creating the dataset object, we need to split the data into training and validation sets. We use to split it into training data frame and validation data frame.scikit-learnlabels_df

from sklearn.model_selection import train_test_split
labels_df_train, labels_df_val = train_test_split(labels_df, test_size=0.2, shuffle=True, random_state=42)

train_transformation = Compose([Resize(), RandomHorizontalFlip(), RandomVerticalFlip(), RandomTranslation(), ImageAdjustment(), ToTensor()])
val_transformation = Compose([Resize(), ToTensor()])

train_dataset = AMDDataset('Training400', labels_df_train, train_transformation)
val_dataset = AMDDataset('Training400', labels_df_val, val_transformation)

We can inspect our dataset object by displaying a sample image.

image, label = train_dataset[0]
show_image_with_bounding_box(*(ToPILImage()((image, label))))

image, label = val_dataset[0]
show_image_with_bounding_box(*(ToPILImage()((image, label))))

The next step is to define a data loader, one for the training dataset and one for the validation dataset.

train_dataloader = DataLoader(train_dataset, batch_size=8)
val_dataloader = DataLoader(val_dataset, batch_size=16)

We don't have to shuffle in the DataLoader because we already shuffle the data when we split it into training and validation datasets. Now let's look at a batch and see if the results are as expected.

image_batch, labels_batch = next(iter(train_dataloader))
print(image_batch.shape, image_batch.dtype)
print(labels_batch, labels_batch.dtype)

# torch.Size([8, 3, 256, 256]) torch.float32
# tensor([[0.4965, 0.3782],
#        [0.6202, 0.6245],
#         [0.5637, 0.4887],
#         [0.5114, 0.4908],
#         [0.3087, 0.4657],
#         [0.5330, 0.5309],
#         [0.6800, 0.6544],
#         [0.5828, 0.4034]], device='cuda:0') torch.float32

6. Build a model

        We want to build a model that takes a resized RGB image and returns two values ​​for the x and y coordinates. We will use residual blocks in a similar way to ResNet with skip connections. We start by defining the basic reblock

from torch.nn.modules.batchnorm import BatchNorm2d
import torch.nn as nn

class ResBlock(nn.Module):
  def __init__(self, in_channels, out_channels):
    super().__init__()
    self.base1 = nn.Sequential(
        nn.Conv2d(in_channels, in_channels, kernel_size=3, padding='same'),
        nn.BatchNorm2d(in_channels),
        nn.ReLU(True) 
    )
    self.base2 = nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=3, padding='same'),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(True)
    )

  def forward(self, x):
    x = self.base1(x) + x
    x = self.base2(x)
    return x

This block has two steps. First, it uses convolutional layers, followed by batch normalization and ReLU. We then add the original input to the result and apply a second step, which again consists of convolutional layers, followed by batch normalization and ReLU, but this time we change the number of filters. Now, we are ready to build the model.

class FoveaNet(nn.Module):
  def __init__(self, in_channels, first_output_channels):
    super().__init__()
    self.model = nn.Sequential(
        ResBlock(in_channels, first_output_channels),
        nn.MaxPool2d(2),
        ResBlock(first_output_channels, 2 * first_output_channels),
        nn.MaxPool2d(2),
        ResBlock(2 * first_output_channels, 4 * first_output_channels),
        nn.MaxPool2d(2),
        ResBlock(4 * first_output_channels, 8 * first_output_channels),
        nn.MaxPool2d(2),
        nn.Conv2d(8 * first_output_channels, 16 * first_output_channels, kernel_size=3),
        nn.MaxPool2d(2),
        nn.Flatten(),
        nn.Linear(7 * 7 * 16 * first_output_channels, 2)
    )
  
  def forward(self, x):
    return self.model(x)

We can get a better look at our models using the packagetorchinfo

! pip install torchinfo -q  # install torchinfo
from torchinfo import summary
net = FoveaNet(3, 16)

summary(model=net, 
        input_size=(8, 3, 256, 256), # (batch_size, color_channels, height, width)
        col_names=["input_size", "output_size", "num_params"],
        col_width=20,
        row_settings=["var_names"]
)

7. Loss and optimizer

        We first define the loss function using smooth L1  loss. In general, this loss behaves like L1 when the absolute difference is less than 2, and like L1 otherwise.

loss_func = nn.SmoothL1Loss()

        For the optimizer, we will use Adam.

optimizer = torch.optim.Adam(net.parameters(), lr=1e-4)

        As a performance metric, we use the "Intersection over Union" metric (IoU). This metric calculates the ratio between the intersection of two bounding boxes and their union.

        First, we need to define a function that takes the centroid as input and returns a bounding box of the form [x0, y0, x1, y1] as output

def centroid_to_bbox(centroids, w=50/256, h=50/256):
  x0_y0 = centroids - torch.tensor([w/2, h/2]).to(device)
  x1_y1 = centroids + torch.tensor([w/2, h/2]).to(device)
  return torch.cat([x0_y0, x1_y1], dim=1)

        and a function to calculate the IoU of a batch of labels

from torchvision.ops import box_iou
def iou_batch(output_labels, target_labels):
  output_bbox = centroid_to_bbox(output_labels)
  target_bbox = centroid_to_bbox(target_labels)
  return torch.trace(box_iou(output_bbox, target_bbox)).item()

Next, we define a loss function for batch processing

def batch_loss(loss_func, output, target, optimizer=None):
  loss = loss_func(output, target)
  with torch.no_grad():
    iou_metric = iou_batch(output, target)
  if optimizer is not None:
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  return loss.item(), iou_metric

8. Training model

        In this step, we will train the model to find the fovea. We first define a helper function to do the training step, which means iterating through all the data in the dataloader, using our previous function to get the loss (and update the weights in the training case), and keep track of the loss and IoU metrics.batch_loss

def train_val_step(dataloader, model, loss_func, optimizer=None):
  if optimizer is not None:
    model.train()
  else:
    model.eval()

  running_loss = 0
  running_iou = 0
  
  for image_batch, label_batch in dataloader:
    output_labels = model(image_batch)
    loss_value, iou_metric_value = batch_loss(loss_func, output_labels, label_batch, optimizer)
    running_loss += loss_value
    running_iou += iou_metric_value
  
  return running_loss/len(dataloader.dataset), running_iou/len(dataloader.dataset)

We now have everything we need for training. We define two dictionaries to track loss and IoU metrics for training and validation after each epoch. We also saved the model weights that gave the best results.

num_epoch = 100
loss_tracking = {'train': [], 'val': []}
iou_tracking = {'train': [], 'val': []}
best_loss = float('inf')

model = FoveaNet(3, 16).to(device)
loss_func = nn.SmoothL1Loss(reduction="sum")
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)


for epoch in range(num_epoch):
  print(f'Epoch {epoch+1}/{num_epoch}')

  training_loss, trainig_iou = train_val_step(train_dataloader, model, loss_func, optimizer)
  loss_tracking['train'].append(training_loss)
  iou_tracking['train'].append(trainig_iou)

  with torch.inference_mode():
    val_loss, val_iou = train_val_step(val_dataloader, model, loss_func, None)
    loss_tracking['val'].append(val_loss)
    iou_tracking['val'].append(val_iou)
    if val_loss < best_loss:
      print('Saving best model')
      torch.save(model.state_dict(), 'best_model.pt')
      best_loss = val_loss
  
  print(f'Training loss: {training_loss:.6}, IoU: {trainig_iou:.2}')
  print(f'Validation loss: {val_loss:.6}, IoU: {val_iou:.2}')

Let's plot the average loss and average IoU per epoch as a function of epoch.

plt.plot(range(1, num_epoch+1), loss_tracking['train'], label='train')
plt.plot(range(1, num_epoch+1), loss_tracking['val'], label='validation')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend()

plt.plot(range(1, num_epoch+1), iou_tracking['train'], label='train')
plt.plot(range(1, num_epoch+1), iou_tracking['val'], label='validation')
plt.xlabel('epoch')
plt.ylabel('iou')
plt.legend()

Finally, we want to look at some images to see how close the model's predictions are to the true coordinates of the fovea. To do this, we define a new function based on the previous function, but this time we draw bounding boxes for the prediction (green) and target (red).show_image_with_bounding_box

def show_image_with_2_bounding_box(image, label, target_label, w_h_bbox=(50, 50), thickness=2):
  w, h = w_h_bbox
  c_x , c_y = label
  c_x_target , c_y_target = target_label
  image = image.copy()
  ImageDraw.Draw(image).rectangle(((c_x-w//2, c_y-h//2), (c_x+w//2, c_y+h//2)), outline='green', width=thickness)
  ImageDraw.Draw(image).rectangle(((c_x_target-w//2, c_y_target-h//2), (c_x_target+w//2, c_y_target+h//2)), outline='red', width=thickness)
  plt.imshow(image)

Now we load the best model we got and make predictions on a sample of images and see the results

model.load_state_dict(torch.load('best_model.pt'))
model.eval()
rng = np.random.default_rng(0)  # create Generator object with seed 0
n_rows = 2  # number of rows in the image subplot
n_cols = 3  # # number of cols in the image subplot
indexes = rng.choice(range(len(val_dataset)), n_rows * n_cols, replace=False)

for ii, id in enumerate(indexes, 1):
  image, label = val_dataset[id]
  output = model(image.unsqueeze(0))
  iou = iou_batch(output, label.unsqueeze(0))
  _, label = ToPILImage()((image, label))
  image, output = ToPILImage()((image, output.squeeze()))
  plt.subplot(n_rows, n_cols, ii)
  show_image_with_2_bounding_box(image, output, label)
  plt.title(f'{iou:.2f}')

9. Conclusion

        In this tutorial, we have covered all the main steps required to build a network for a single object detection task. We first explore the data, clean and arrange it, then build data augmentation functions along with Dataset and DataLoader objects, and finally build and train the model. We got relatively good results, and you are welcome to try to improve the performance by changing the model's learned parameters and architecture.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132248188