Training YOLOv8 instance segmentation model

YOLOv8 was launched on January 10, 2023. As of now, it is the most advanced model in computer vision for classification, detection, and segmentation tasks. This model outperforms all known models in both accuracy and execution time.

7420b61ab36ce7912c5d2dae4bcdefc7.png

YOLOv8 compared to other YOLO models (from ultralytics) The ultralytics team has done a great job making this model easier to use than all previous YOLO models, you don't even need to clone the git repository anymore!

Create image dataset

In this article, I created a very simple example showing how to train YOLOv8 on your data, specifically for segmentation tasks. The data set is small and "easy to learn" for the model, so that we only need a few seconds of training on a regular CPU to obtain satisfactory results.

We will create a dataset of white circles with a black background. The size of the circles will vary. We will train a model to segment circles within images.

The dataset looks like this:

3d7c9a47d927486394a7b7a7061fa15e.png

The dataset is generated using the following code:

import numpy as np
from PIL import Image
from skimage import draw
import random
from pathlib import Path

def create_image(path, img_size, min_radius):
    path.parent.mkdir( parents=True, exist_ok=True )

    arr = np.zeros((img_size, img_size)).astype(np.uint8)
    center_x = random.randint(min_radius, (img_size-min_radius))
    center_y = random.randint(min_radius, (img_size-min_radius))
    max_radius = min(center_x, center_y, img_size - center_x, img_size - center_y)
    radius = random.randint(min_radius, max_radius)

    row_indxs, column_idxs = draw.ellipse(center_x, center_y, radius, radius, shape=arr.shape)

    arr[row_indxs, column_idxs] = 255

    im = Image.fromarray(arr)
    im.save(path)

def create_images(data_root_path, train_num, val_num, test_num, img_size=640, min_radius=10):
    data_root_path = Path(data_root_path)

    for i in range(train_num):
        create_image(data_root_path / 'train' / 'images' / f'img_{i}.png', img_size, min_radius)

    for i in range(val_num):
        create_image(data_root_path / 'val' / 'images' / f'img_{i}.png', img_size, min_radius)

    for i in range(test_num):
        create_image(data_root_path / 'test' / 'images' / f'img_{i}.png', img_size, min_radius)

create_images('datasets', train_num=120, val_num=40, test_num=40, img_size=120, min_radius=10)

Create tags

Now that we have a dataset of images, we need to create labels for the images. Normally we would need to do this manually, but since the dataset we created is so simple, it's easy to write code to generate the labels:

from rasterio import features

def create_label(image_path, label_path):
    arr = np.asarray(Image.open(image_path))

    # There may be a better way to do it, but this is what I have found so far
    cords = list(features.shapes(arr, mask=(arr >0)))[0][0]['coordinates'][0]
    label_line = '0 ' + ' '.join([f'{int(cord[0])/arr.shape[0]} {int(cord[1])/arr.shape[1]}' for cord in cords])

    label_path.parent.mkdir( parents=True, exist_ok=True )
    with label_path.open('w') as f:
        f.write(label_line)

for images_dir_path in [Path(f'datasets/{x}/images') for x in ['train', 'val', 'test']]:
    for img_path in images_dir_path.iterdir():
        label_path = img_path.parent.parent / 'labels' / f'{img_path.stem}.txt'
        label_line = create_label(img_path, label_path)

The following is an example of tag file content:

0 0.0767 0.08433 0.1417 0.08433 0.1417 0.0917 0.15843 0.0917 0.15843 0.1 0.1766 0.1 0.1766 0.10844 0.175 0.10844 0.175 0.1177 0.18432 0.1177 0.18432 0.14333 0.1918 0.14333 0.1918 0.20844 0.18432 0.20844 0.18432 0.225 0.175 0.225 0.175 0.24334 0.1766 0.24334 0.1766 0.2417 0.15843 0.2417 0.15843 0.25 0.1417 0.25 0.1417 0.25846 0.0767 0.25846 0.0767 0.25 0.05 0.25 0.05 0.2417 0.04174 0.2417 0.04174 0.24334 0.04333 0.24334 0.04333 0.225 0.025 0.225 0.025 0.20844 0.01766 0.20844 0.01766 0.14333 0.025 0.14333 0.025 0.1177 0.04333 0.1177 0.04333 0.10844 0.04174 0.10844 0.04174 0.1 0.05 0.1 0.05 0.0917 0.0767 0.0917 0.0767 0.08433

This label corresponds to the following image:

ba7c0b33d6877c3babd1f1c6e666e219.png

The label content is just a single line of text. There is only one object (circle) in each image, and each object is represented by a line in the file. If there are multiple objects in each image, you should create a row for each marked object.

The first 0 indicates the category type of the label. Since we only have one category type (circle), it will always be 0. If you have multiple categories in your data, you should map each category to a number (0, 1, 2...) and use that number in the labels file.

All other numbers represent the coordinates of the bounding polygon of the marked object. The format is, the coordinates are relative to the size of the image - you should normalize the coordinates to the 1x1 image size. For example, if there is a point (15, 75) and the image size is 120x120, the normalized point is (15/120, 75/120) = (0.125, 0.625).

When dealing with image libraries, it is often difficult to get the correct directionality of coordinates. To clarify, for YOLO, the X coordinate goes from left to right and the Y coordinate goes from top to bottom.

YAML configuration

Now we have images and labels. Now we need to create a YAML file with the dataset configuration:

yaml_content = f'''f'''
train: train/images
val: val/images
test: test/images

names: ['circle']
    '''

with Path('data.yaml').open('w') as f:
    f.write(yaml_content)

Note that if you have more object categories, you need to add them in the names array here in the same order as you put them in the tags file. The first category is 0, the second is 1, and so on...

Dataset file structure

Let's use the Linux tree command to see the file structure we created:

tree .
data.yaml
datasets/
├── test
│   ├── images
│   │   ├── img_0.png
│   │   ├── img_1.png
│   │   ├── img_2.png
│   │   ├── ...
│   └── labels
│       ├── img_0.txt
│       ├── img_1.txt
│       ├── img_2.txt
│       ├── ...
├── train
│   ├── images
│   │   ├── img_0.png
│   │   ├── img_1.png
│   │   ├── img_2.png
│   │   ├── ...
│   └── labels
│       ├── img_0.txt
│       ├── img_1.txt
│       ├── img_2.txt
│       ├── ...
|── val
|   ├── images
│   │   ├── img_0.png
│   │   ├── img_1.png
│   │   ├── img_2.png
│   │   ├── ...
|   └── labels
│       ├── img_0.txt
│       ├── img_1.txt
│       ├── img_2.txt
│       ├── ...

Training model

Now that we have the images and labels, we can start training the model. First let's install the package:

pip install ultralytics==8.0.38

The ultralytics library updates very quickly and sometimes breaks the API, so I prefer to use one version. The code below relies on version 8.0.38 (the latest version as I write these words). If you upgrade to a newer version, some code adaptations may be required to make it work properly.

Then start training:

from ultralytics import YOLO

model = YOLO("yolov8n-seg.pt")

results = model.train(
        batch=8,
        device="cpu",
        data="data.yaml",
        epochs=7,
        imgsz=120,
    )

To simplify this article, I used a nano model (yolov8n-seg) and trained it on CPU only for 7 epochs. On my laptop, training takes just a few seconds.

For more information about the parameters used to train the model, you can check here.

Understand the results

Once training is complete, you will see a line similar to the following at the end of the output:

Results saved to runs/segment/train60

Let's look at some of the results found here:

Verification tag
from IPython.display import Image as show_image

show_image(filename="runs/segment/train60/val_batch0_labels.jpg")
7a5600fe0bace57ee237a1b20d1f8648.png

Here we can see the true labels of the validation set part. This should line up almost perfectly. If you find that the labels don't cover the object well, it's likely that your annotations are incorrect.

Predicted validation label

show_image(filename="runs/segment/train60/val_batch0_pred.jpg")"runs/segment/train60/val_batch0_pred.jpg")
369d7b382b86fda98023f8b0f6f62f0c.png

Here we can see the predictions made by the trained model on a portion of the validation set (the same portion seen above). This can give you a feel for the model's performance. Note that in order to create this image, a confidence threshold should be chosen, the threshold used here is 0.5, which is not always optimal (we will discuss this later).

Accuracy curve

In order to understand this and the following charts, you need to be familiar with the concepts of precision and recall. Here's a good explanation of how they work.

show_image(filename="runs/segment/train60/MaskP_curve.png")"runs/segment/train60/MaskP_curve.png")
7f00b5e4e072f5df31b1ec616ded4986.png

Each object detected by the model has a certain level of confidence. Typically, you will only use high confidence values ​​(high confidence thresholds) if you are as certain as possible when stating "this is a circle". Of course, there's a trade-off - you might miss some "circles." On the other hand, if you wish to "catch" as many "circles" as possible, but are willing to trade off some that are not true "circles", you will use both low and high confidence values ​​(low confidence threshold).

The chart above (and the chart below) can help you decide which confidence threshold to use. In our case, we can see that for thresholds larger than 0.128, we obtain 100% accuracy, which means that all objects are predicted correctly.

Note that since we are actually doing a segmentation task, there is another important threshold to focus on - IoU (intersection over union), if you are not familiar with this you can read about it here. For this graph, an IoU of 0.5 was used.

recall curve

show_image(filename="runs/segment/train60/MaskR_curve.png")"runs/segment/train60/MaskR_curve.png")
f8b7aefa8360d6843363946a0477bcf8.png

Here you can see the recall chart, where the recall decreases as the confidence threshold value increases. This means the fewer "circles" you "catch".

Here you can see why using a confidence threshold of 0.5 is a bad idea in this case. For a threshold of 0.5, you can get approximately 90% recall. However, in the accuracy curve we see that for thresholds larger than 0.128 we get 100% accuracy, so we don't need to go to 0.5, we can safely use a threshold of 0.128 and get both 100% accuracy and almost 100% recall rate :)

Precision-Recall Curve

Here's a good explanation of precision-recall curves.

https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248

show_image(filename="runs/segment/train60/MaskPR_curve.png")"runs/segment/train60/MaskPR_curve.png")
0907838f5135710007db40cfc835e3d6.png

We can clearly see the conclusion drawn before, in this model we can achieve almost 100% precision and 100% recall.

The disadvantage of this chart is that we can't see which threshold should be used, which is why we still need the chart above.

Losses over time

show_image(filename="runs/segment/train60/results.png")"runs/segment/train60/results.png")
84e7e76dc3123c3d34dc3ff2ccbb01ed.png

Here you can see how different losses change during training and how they perform on the validation set after each epoch.

There is a lot to say about losses and the conclusions that can be drawn from these charts, however, it is beyond the scope of this article. I just wanted to point out that you can find the information here :)

Use the trained model

The model itself can also be found in the results directory. Here's how to use the model on a new image:

my_model = YOLO('runs/segment/train60/weights/best.pt')'runs/segment/train60/weights/best.pt')

results = list(my_model('datasets/test/images/img_5.png', conf=0.128))

result = results[0]

The resulting list may have multiple values, one for each detected object. Since in this example there is only one object in each image, we take the first list item.

You can see that I'm passing here the best confidence threshold value we found earlier (0.128).

There are two ways to get the actual location of a detected object in an image. Choosing the right method depends on what you plan to do with the results. I will show both methods.

result.masks.segments
[array([[    0.10156,     0.34375],0.10156,     0.34375],
        [    0.09375,     0.35156],
        [    0.09375,     0.35937],
        [   0.078125,       0.375],
        [   0.070312,       0.375],
        [     0.0625,     0.38281],
        [    0.38281,     0.71094],
        [    0.39062,     0.71094],
        [    0.39844,     0.70312],
        [    0.39844,     0.69531],
        [    0.41406,     0.67969],
        [    0.42187,     0.67969],
        [    0.44531,     0.46875],
        [    0.42969,     0.45312],
        [    0.42969,     0.41406],
        [    0.42187,     0.40625],
        [    0.41406,     0.40625],
        [    0.39844,     0.39062],
        [    0.39844,     0.38281],
        [    0.39062,       0.375],
        [    0.38281,       0.375],
        [    0.35156,     0.34375]], dtype=float32)]

This will return the bounding polygon of the object, similar to the format of the marker data we passed.

The second method:

result.masks.masks
tensor([[[0., 0., 0.,  ..., 0., 0., 0.],0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])

This returns a tensor of shape (1, 128, 128) representing all pixels in the image. Pixels belonging to the object receive 1 and background pixels receive 0.

Let's see what the mask looks like:

import torchvision.transforms as T

T.ToPILImage()(result.masks.masks).show()
b401d61664dcc6c0bcdeb3e5e702f6da.png

This is the original image:

7721e0b358f0a178f6003c310c8b8b86.png

Although not perfect, it is good enough for many applications, and the IoU is definitely higher than 0.5.


Overall, the new ultralytics library is much easier to use than previous Yolo versions, especially for segmentation tasks, and is now a first-class tool. You can also find Yolov5 in the new package of ultralytics, so if you don't want to use the new Yolo version, you can continue to use the well-known yolov5:

323ec4ba7a25711efe36e1dc6a866af3.png

There are some topics not covered, such as the different loss functions used by the model, the architectural changes made to create yolov8, etc.

☆ END ☆

If you see this, it means you like this article, please forward it and like it. Search "uncle_pn" on WeChat. Welcome to add the editor's WeChat "woshicver". A high-quality blog post will be updated in the circle of friends every day.

Scan the QR code to add editor↓

4e3fcf81f5f51084515e20f28db42076.jpeg

Guess you like

Origin blog.csdn.net/woshicver/article/details/135121190