[Selected] Remote sensing image segmentation system: fused spatial pyramid pooling (FocalModulation) improved YOLOv8

1. Research background and significance

Project ReferenceAAAI Association for the Advancement of Artificial Intelligence

research background and meaning

Remote sensing image segmentation is an important research direction in the field of remote sensing technology. Its goal is to effectively segment and identify different features or types of features in remote sensing images. With the continuous development of remote sensing technology and the large-scale acquisition of remote sensing image data, remote sensing image segmentation has broad application prospects in agriculture, urban planning, environmental monitoring and other fields.

However, due to the particularity of remote sensing images, such as high image resolution, complex lighting conditions, and diverse types of ground objects, traditional image segmentation methods face some challenges when processing remote sensing images. Therefore, proposing an efficient and accurate remote sensing image segmentation system is of great significance for realizing automated processing of remote sensing images.

In recent years, deep learning technology has achieved remarkable results in the field of image segmentation. Among them, the image segmentation method based on Convolutional Neural Network (CNN) has been widely used. However, the traditional CNN method has some problems when processing remote sensing images, such as poor detection of small targets and limited ability to extract detailed information in remote sensing images.

In order to solve these problems, researchers have proposed an improved remote sensing image segmentation system, that is, fused spatial pyramid pooling (Focal Modulation) improved YOLOv8. This system combines spatial pyramid pooling and YOLOv8 models to improve the accuracy and robustness of remote sensing image segmentation by performing multi-scale feature extraction and pooling operations on images.

Specifically, the remote sensing image segmentation system of improving YOLOv8 by integrating spatial pyramid pooling includes the following key steps:

First, target detection is performed on remote sensing images through the YOLOv8 model to obtain candidate target areas in the image.

Then, spatial pyramid pooling technology is used to perform multi-scale feature extraction and pooling operations on the candidate target areas to obtain feature maps of different scales.

Next, the feature maps of different scales are fused to obtain the fused feature map.

Finally, the fused feature map is used for target segmentation to obtain segmentation results of different features or categories of features in remote sensing images.

The remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 has the following advantages:

First, by integrating spatial pyramid pooling technology, feature information of images can be extracted at different scales, thereby improving the accuracy of remote sensing image segmentation.

Secondly, using the YOLOv8 model for target detection can effectively detect small targets in remote sensing images and improve the detection capability of the system.

In addition, the remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 is also highly robust and can perform image segmentation stably under complex lighting conditions and diverse ground object categories.

In summary, the remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 is of great significance in improving the accuracy and robustness of remote sensing image segmentation. The research results of this system will provide effective technical support for the automated processing of remote sensing images and promote the application of remote sensing technology in agriculture, urban planning, environmental monitoring and other fields.

2. Picture demonstration

Insert image description here
Insert image description here
Insert image description here

3. Video demonstration

Remote sensing image segmentation system: fused spatial pyramid pooling (FocalModulation) improved YOLOv8_bilibili_bilibili

4. Collection, labeling and organization of data sets

collection of pictures

First, we need to collect the required images. This can be achieved in different ways, such as using existing public datasets YGDatasets.

Insert image description here

eiseg is a graphical image annotation tool that supports COCO and YOLO formats. The following are the steps to use eiseg to annotate images into COCO format:

(1) Download and install eiseg.
(2) Open eiseg and select "Open Dir" to select your picture directory.
(3) Set a label name for your target object.
(4) Draw a rectangular frame on the picture and select the corresponding label.
(5) Save the annotation information, which will generate a JSON file with the same name as the image in the image directory.
(6) Repeat this process until all pictures are labeled.

Since YOLO uses txt format annotations, we need to convert the VOC format to YOLO format. This can be achieved using various conversion tools or scripts.

Here's a simple way to do it using a Python script that reads the XML file and then converts it to the txt format required by YOLO.

import contextlib
import json

import cv2
import pandas as pd
from PIL import Image
from collections import defaultdict

from utils import *


# Convert INFOLKS JSON file into YOLO-format labels ----------------------------
def convert_infolks_json(name, files, img_path):
    # Create folders
    path = make_dirs()

    # Import json
    data = []
    for file in glob.glob(files):
        with open(file) as f:
            jdata = json.load(f)
            jdata['json_file'] = file
            data.append(jdata)

    # Write images and shapes
    name = path + os.sep + name
    file_id, file_name, wh, cat = [], [], [], []
    for x in tqdm(data, desc='Files and Shapes'):
        f = glob.glob(img_path + Path(x['json_file']).stem + '.*')[0]
        file_name.append(f)
        wh.append(exif_size(Image.open(f)))  # (width, height)
        cat.extend(a['classTitle'].lower() for a in x['output']['objects'])  # categories

        # filename
        with open(name + '.txt', 'a') as file:
            file.write('%s\n' % f)

    # Write *.names file
    names = sorted(np.unique(cat))
    # names.pop(names.index('Missing product'))  # remove
    with open(name + '.names', 'a') as file:
        [file.write('%s\n' % a) for a in names]

    # Write labels file
    for i, x in enumerate(tqdm(data, desc='Annotations')):
        label_name = Path(file_name[i]).stem + '.txt'

        with open(path + '/labels/' + label_name, 'a') as file:
            for a in x['output']['objects']:
                # if a['classTitle'] == 'Missing product':
                #    continue  # skip

                category_id = names.index(a['classTitle'].lower())

                # The INFOLKS bounding box format is [x-min, y-min, x-max, y-max]
                box = np.array(a['points']['exterior'], dtype=np.float32).ravel()
                box[[0, 2]] /= wh[i][0]  # normalize x by width
                box[[1, 3]] /= wh[i][1]  # normalize y by height
                box = [box[[0, 2]].mean(), box[[1, 3]].mean(), box[2] - box[0], box[3] - box[1]]  # xywh
                if (box[2] > 0.) and (box[3] > 0.):  # if w > 0 and h > 0
                    file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))

    # Split data into train, test, and validate files
    split_files(name, file_name)
    write_data_data(name + '.data', nc=len(names))
    print(f'Done. Output saved to {
      
      os.getcwd() + os.sep + path}')


# Convert vott JSON file into YOLO-format labels -------------------------------
def convert_vott_json(name, files, img_path):
    # Create folders
    path = make_dirs()
    name = path + os.sep + name

    # Import json
    data = []
    for file in glob.glob(files):
        with open(file) as f:
            jdata = json.load(f)
            jdata['json_file'] = file
            data.append(jdata)

    # Get all categories
    file_name, wh, cat = [], [], []
    for i, x in enumerate(tqdm(data, desc='Files and Shapes')):
        with contextlib.suppress(Exception):
            cat.extend(a['tags'][0] for a in x['regions'])  # categories

    # Write *.names file
    names = sorted(pd.unique(cat))
    with open(name + '.names', 'a') as file:
        [file.write('%s\n' % a) for a in names]

    # Write labels file
    n1, n2 = 0, 0
    missing_images = []
    for i, x in enumerate(tqdm(data, desc='Annotations')):

        f = glob.glob(img_path + x['asset']['name'] + '.jpg')
        if len(f):
            f = f[0]
            file_name.append(f)
            wh = exif_size(Image.open(f))  # (width, height)

            n1 += 1
            if (len(f) > 0) and (wh[0] > 0) and (wh[1] > 0):
                n2 += 1

                # append filename to list
                with open(name + '.txt', 'a') as file:
                    file.write('%s\n' % f)

                # write labelsfile
                label_name = Path(f).stem + '.txt'
                with open(path + '/labels/' + label_name, 'a') as file:
                    for a in x['regions']:
                        category_id = names.index(a['tags'][0])

                        # The INFOLKS bounding box format is [x-min, y-min, x-max, y-max]
                        box = a['boundingBox']
                        box = np.array([box['left'], box['top'], box['width'], box['height']]).ravel()
                        box[[0, 2]] /= wh[0]  # normalize x by width
                        box[[1, 3]] /= wh[1]  # normalize y by height
                        box = [box[0] + box[2] / 2, box[1] + box[3] / 2, box[2], box[3]]  # xywh

                        if (box[2] > 0.) and (box[3] > 0.):  # if w > 0 and h > 0
                            file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))
        else:
            missing_images.append(x['asset']['name'])

    print('Attempted %g json imports, found %g images, imported %g annotations successfully' % (i, n1, n2))
    if len(missing_images):
        print('WARNING, missing images:', missing_images)

    # Split data into train, test, and validate files
    split_files(name, file_name)
    print(f'Done. Output saved to {
      
      os.getcwd() + os.sep + path}')


# Convert ath JSON file into YOLO-format labels --------------------------------
def convert_ath_json(json_dir):  # dir contains json annotations and images
    # Create folders
    dir = make_dirs()  # output directory

    jsons = []
    for dirpath, dirnames, filenames in os.walk(json_dir):
        jsons.extend(
            os.path.join(dirpath, filename)
            for filename in [
                f for f in filenames if f.lower().endswith('.json')
            ]
        )

    # Import json
    n1, n2, n3 = 0, 0, 0
    missing_images, file_name = [], []
    for json_file in sorted(jsons):
        with open(json_file) as f:
            data = json.load(f)

        # # Get classes
        # try:
        #     classes = list(data['_via_attributes']['region']['class']['options'].values())  # classes
        # except:
        #     classes = list(data['_via_attributes']['region']['Class']['options'].values())  # classes

        # # Write *.names file
        # names = pd.unique(classes)  # preserves sort order
        # with open(dir + 'data.names', 'w') as f:
        #     [f.write('%s\n' % a) for a in names]

        # Write labels file
        for x in tqdm(data['_via_img_metadata'].values(), desc=f'Processing {
      
      json_file}'):
            image_file = str(Path(json_file).parent / x['filename'])
            f = glob.glob(image_file)  # image file
            if len(f):
                f = f[0]
                file_name.append(f)
                wh = exif_size(Image.open(f))  # (width, height)

                n1 += 1  # all images
                if len(f) > 0 and wh[0] > 0 and wh[1] > 0:
                    label_file = dir + 'labels/' + Path(f).stem + '.txt'

                    nlabels = 0
                    try:
                        with open(label_file, 'a') as file:  # write labelsfile
                            # try:
                            #     category_id = int(a['region_attributes']['class'])
                            # except:
                            #     category_id = int(a['region_attributes']['Class'])
                            category_id = 0  # single-class

                            for a in x['regions']:
                                # bounding box format is [x-min, y-min, x-max, y-max]
                                box = a['shape_attributes']
                                box = np.array([box['x'], box['y'], box['width'], box['height']],
                                               dtype=np.float32).ravel()
                                box[[0, 2]] /= wh[0]  # normalize x by width
                                box[[1, 3]] /= wh[1]  # normalize y by height
                                box = [box[0] + box[2] / 2, box[1] + box[3] / 2, box[2],
                                       box[3]]  # xywh (left-top to center x-y)

                                if box[2] > 0. and box[3] > 0.:  # if w > 0 and h > 0
                                    file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))
                                    n3 += 1
                                    nlabels += 1

                        if nlabels == 0:  # remove non-labelled images from dataset
                            os.system(f'rm {
      
      label_file}')
                            # print('no labels for %s' % f)
                            continue  # next file

                        # write image
                        img_size = 4096  # resize to maximum
                        img = cv2.imread(f)  # BGR
                        assert img is not None, 'Image Not Found ' + f
                        r = img_size / max(img.shape)  # size ratio
                        if r < 1:  # downsize if necessary
                            h, w, _ = img.shape
                            img = cv2.resize(img, (int(w * r), int(h * r)), interpolation=cv2.INTER_AREA)

                        ifile = dir + 'images/' + Path(f).name
                        if cv2.imwrite(ifile, img):  # if success append image to list
                            with open(dir + 'data.txt', 'a') as file:
                                file.write('%s\n' % ifile)
                            n2 += 1  # correct images

                    except Exception:
                        os.system(f'rm {
      
      label_file}')
                        print(f'problem with {
      
      f}')

            else:
                missing_images.append(image_file)

    nm = len(missing_images)  # number missing
    print('\nFound %g JSONs with %g labels over %g images. Found %g images, labelled %g images successfully' %
          (len(jsons), n3, n1, n1 - nm, n2))
    if len(missing_images):
        print('WARNING, missing images:', missing_images)

    # Write *.names file
    names = ['knife']  # preserves sort order
    with open(dir + 'data.names', 'w') as f:
        [f.write('%s\n' % a) for a in names]

    # Split data into train, test, and validate files
    split_rows_simple(dir + 'data.txt')
    write_data_data(dir + 'data.data', nc=1)
    print(f'Done. Output saved to {
      
      Path(dir).absolute()}')


def convert_coco_json(json_dir='../coco/annotations/', use_segments=False, cls91to80=False):
    save_dir = make_dirs()  # output directory
    coco80 = coco91_to_coco80_class()

    # Import json
    for json_file in sorted(Path(json_dir).resolve().glob('*.json')):
        fn = Path(save_dir) / 'labels' / json_file.stem.replace('instances_', '')  # folder name
        fn.mkdir()
        with open(json_file) as f:
            data = json.load(f)

        # Create image dict
        images = {
    
    '%g' % x['id']: x for x in data['images']}
        # Create image-annotations dict
        imgToAnns = defaultdict(list)
        for ann in data['annotations']:
            imgToAnns[ann['image_id']].append(ann)

        # Write labels file
        for img_id, anns in tqdm(imgToAnns.items(), desc=f'Annotations {
      
      json_file}'):
            img = images['%g' % img_id]
            h, w, f = img['height'], img['width'], img['file_name']

            bboxes = []
            segments = []
            for ann in anns:
                if ann['iscrowd']:
                    continue
                # The COCO box format is [top left x, top left y, width, height]
                box = np.array(ann['bbox'], dtype=np.float64)
                box[:2] += box[2:] / 2  # xy top-left corner to center
                box[[0, 2]] /= w  # normalize x
                box[[1, 3]] /= h  # normalize y
                if box[2] <= 0 or box[3] <= 0:  # if w <= 0 and h <= 0
                    continue

                cls = coco80[ann['category_id'] - 1] if cls91to80 else ann['category_id'] - 1  # class
                box = [cls] + box.tolist()
                if box not in bboxes:
                    bboxes.append(box)
                # Segments
                if use_segments:
                    if len(ann['segmentation']) > 1:
                        s = merge_multi_segment(ann['segmentation'])
                        s = (np.concatenate(s, axis=0) / np.array([w, h])).reshape(-1).tolist()
                    else:
                        s = [j for i in ann['segmentation'] for j in i]  # all segments concatenated
                        s = (np.array(s).reshape(-1, 2) / np.array([w, h])).reshape(-1).tolist()
                    s = [cls] + s
                    if s not in segments:
                        segments.append(s)

            # Write
            with open((fn / f).with_suffix('.txt'), 'a') as file:
                for i in range(len(bboxes)):
                    line = *(segments[i] if use_segments else bboxes[i]),  # cls, box or segments
                    file.write(('%g ' * len(line)).rstrip() % line + '\n')


def min_index(arr1, arr2):
    """Find a pair of indexes with the shortest distance. 
    Args:
        arr1: (N, 2).
        arr2: (M, 2).
    Return:
        a pair of indexes(tuple).
    """
    dis = ((arr1[:, None, :] - arr2[None, :, :]) ** 2).sum(-1)
    return np.unravel_index(np.argmin(dis, axis=None), dis.shape)


def merge_multi_segment(segments):
    """Merge multi segments to one list.
    Find the coordinates with min distance between each segment,
    then connect these coordinates with one thin line to merge all 
    segments into one.

    Args:
        segments(List(List)): original segmentations in coco's json file.
            like [segmentation1, segmentation2,...], 
            each segmentation is a list of coordinates.
    """
    s = []
    segments = [np.array(i).reshape(-1, 2) for i in segments]
    idx_list = [[] for _ in range(len(segments))]

    # record the indexes with min distance between each segment
    for i in range(1, len(segments)):
        idx1, idx2 = min_index(segments[i - 1], segments[i])
        idx_list[i - 1].append(idx1)
        idx_list[i].append(idx2)

    # use two round to connect all the segments
    for k in range(2):
        # forward connection
        if k == 0:
            for i, idx in enumerate(idx_list):
                # middle segments have two indexes
                # reverse the index of middle segments
                if len(idx) == 2 and idx[0] > idx[1]:
                    idx = idx[::-1]
                    segments[i] = segments[i][::-1, :]

                segments[i] = np.roll(segments[i], -idx[0], axis=0)
                segments[i] = np.concatenate([segments[i], segments[i][:1]])
                # deal with the first segment and the last one
                if i in [0, len(idx_list) - 1]:
                    s.append(segments[i])
                else:
                    idx = [0, idx[1] - idx[0]]
                    s.append(segments[i][idx[0]:idx[1] + 1])

        else:
            for i in range(len(idx_list) - 1, -1, -1):
                if i not in [0, len(idx_list) - 1]:
                    idx = idx_list[i]
                    nidx = abs(idx[1] - idx[0])
                    s.append(segments[i][nidx:])
    return s


def delete_dsstore(path='../datasets'):
    # Delete apple .DS_store files
    from pathlib import Path
    files = list(Path(path).rglob('.DS_store'))
    print(files)
    for f in files:
        f.unlink()


if __name__ == '__main__':
    source = 'COCO'

    if source == 'COCO':
        convert_coco_json('./annotations',  # directory with *.json
                          use_segments=True,
                          cls91to80=True)

    elif source == 'infolks':  # Infolks https://infolks.info/
        convert_infolks_json(name='out',
                             files='../data/sm4/json/*.json',
                             img_path='../data/sm4/images/')

    elif source == 'vott':  # VoTT https://github.com/microsoft/VoTT
        convert_vott_json(name='data',
                          files='../../Downloads/athena_day/20190715/*.json',
                          img_path='../../Downloads/athena_day/20190715/')  # images folder

    elif source == 'ath':  # ath format
        convert_ath_json(json_dir='../../Downloads/athena/')  # images folder

    # zip results
    # os.system('zip -r ../coco.zip ../coco')


Organize data folder structure

We need to organize the dataset into the following structure:

-----datasets
	-----coco128-seg
	   |-----images
	   |   |-----train
	   |   |-----valid
	   |   |-----test
	   |
	   |-----labels
	   |   |-----train
	   |   |-----valid
	   |   |-----test
	   |

Model training
 Epoch   gpu_mem       box       obj       cls    labels  img_size
 1/200     20.8G   0.01576   0.01955  0.007536        22      1280: 100%|██████████| 849/849 [14:42<00:00,  1.04s/it]
           Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 213/213 [01:14<00:00,  2.87it/s]
             all       3395      17314      0.994      0.957      0.0957      0.0843

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 2/200     20.8G   0.01578   0.01923  0.007006        22      1280: 100%|██████████| 849/849 [14:44<00:00,  1.04s/it]
           Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|██████████| 213/213 [01:12<00:00,  2.95it/s]
             all       3395      17314      0.996      0.956      0.0957      0.0845

 Epoch   gpu_mem       box       obj       cls    labels  img_size
 3/200     20.8G   0.01561    0.0191  0.006895        27      1280: 100%|██████████| 849/849 [10:56<00:00,  1.29it/s]
           Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100%|███████   | 187/213 [00:52<00:00,  4.04it/s]
             all       3395      17314      0.996      0.957      0.0957      0.0845

5. Core code explanation

5.2 predict.py

The code after encapsulating it into a class is as follows:

from ultralytics.engine.predictor import BasePredictor
from ultralytics.engine.results import Results
from ultralytics.utils import ops

class DetectionPredictor(BasePredictor):
    def postprocess(self, preds, img, orig_imgs):
        preds = ops.non_max_suppression(preds,
                                        self.args.conf,
                                        self.args.iou,
                                        agnostic=self.args.agnostic_nms,
                                        max_det=self.args.max_det,
                                        classes=self.args.classes)

        if not isinstance(orig_imgs, list):
            orig_imgs = ops.convert_torch2numpy_batch(orig_imgs)

        results = []
        for i, pred in enumerate(preds):
            orig_img = orig_imgs[i]
            pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
            img_path = self.batch[0][i]
            results.append(Results(orig_img, path=img_path, names=self.model.names, boxes=pred))
        return results

This program file is a file named predict.py, which is the definition of a class DetectionPredictor for prediction based on the detection model. This class inherits from the BasePredictor class and is used to post-process the prediction results.

The code in this file uses the Ultralytics YOLO library, which is a toolkit for object detection. It provides some functions and classes for processing prediction results.

In the DetectionPredictor class, there is a postprocess method for post-processing the prediction results. In this method, the ops.non_max_suppression function is first used to perform non-maximum suppression processing on the prediction results to remove overlapping bounding boxes. Then, based on the input original image and prediction results, a list of Results objects is generated. Each Results object contains the original image, image path, category name and bounding box information.

This file also contains a sample code showing how to use the DetectionPredictor class to make predictions. The sample code uses a pre-trained YOLOv8 model and a set of input images to make predictions by calling the predict_cli method.

In general, this program file is a tool class for target detection and prediction, which provides the function of post-processing the prediction results.

5.3 train.py
from copy import copy
import numpy as np
from ultralytics.data import build_dataloader, build_yolo_dataset
from ultralytics.engine.trainer import BaseTrainer
from ultralytics.models import yolo
from ultralytics.nn.tasks import DetectionModel
from ultralytics.utils import LOGGER, RANK
from ultralytics.utils.torch_utils import de_parallel, torch_distributed_zero_first

class DetectionTrainer(BaseTrainer):
    def build_dataset(self, img_path, mode='train', batch=None):
        gs = max(int(de_parallel(self.model).stride.max() if self.model else 0), 32)
        return build_yolo_dataset(self.args, img_path, batch, self.data, mode=mode, rect=mode == 'val', stride=gs)

    def get_dataloader(self, dataset_path, batch_size=16, rank=0, mode='train'):
        assert mode in ['train', 'val']
        with torch_distributed_zero_first(rank):
            dataset = self.build_dataset(dataset_path, mode, batch_size)
        shuffle = mode == 'train'
        if getattr(dataset, 'rect', False) and shuffle:
            LOGGER.warning("WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False")
            shuffle = False
        workers = 0
        return build_dataloader(dataset, batch_size, workers, shuffle, rank)

    def preprocess_batch(self, batch):
        batch['img'] = batch['img'].to(self.device, non_blocking=True).float() / 255
        return batch

    def set_model_attributes(self):
        self.model.nc = self.data['nc']
        self.model.names = self.data['names']
        self.model.args = self.args

    def get_model(self, cfg=None, weights=None, verbose=True):
        model = DetectionModel(cfg, nc=self.data['nc'], verbose=verbose and RANK == -1)
        if weights:
            model.load(weights)
        return model

    def get_validator(self):
        self.loss_names = 'box_loss', 'cls_loss', 'dfl_loss'
        return yolo.detect.DetectionValidator(self.test_loader, save_dir=self.save_dir, args=copy(self.args))

    def label_loss_items(self, loss_items=None, prefix='train'):
        keys = [f'{
      
      prefix}/{
      
      x}' for x in self.loss_names]
        if loss_items is not None:
            loss_items = [round(float(x), 5) for x in loss_items]
            return dict(zip(keys, loss_items))
        else:
            return keys

    def progress_string(self):
        return ('\n' + '%11s' *
                (4 + len(self.loss_names))) % ('Epoch', 'GPU_mem', *self.loss_names, 'Instances', 'Size')

    def plot_training_samples(self, batch, ni):
        plot_images(images=batch['img'],
                    batch_idx=batch['batch_idx'],
                    cls=batch['cls'].squeeze(-1),
                    bboxes=batch['bboxes'],
                    paths=batch['im_file'],
                    fname=self.save_dir / f'train_batch{
      
      ni}.jpg',
                    on_plot=self.on_plot)

    def plot_metrics(self):
        plot_results(file=self.csv, on_plot=self.on_plot)

    def plot_training_labels(self):
        boxes = np.concatenate([lb['bboxes'] for lb in self.train_loader.dataset.labels], 0)
        cls = np.concatenate([lb['cls'] for lb in self.train_loader.dataset.labels], 0)
        plot_labels(boxes, cls.squeeze(), names=self.data['names'], save_dir=self.save_dir, on_plot=self.on_plot)

This program file is a program for training an object detection model. It uses the Ultralytics YOLO library, which provides functionality to train and evaluate YOLO models.

A class named is defined in the program file DetectionTrainer, which inherits from BaseTrainerclass, and is used for training based on the target detection model. This class provides methods for building datasets, building data loaders, preprocessing data, setting model properties, etc.

In __main__the function, some parameters are first defined, including model file path, data file path and number of training rounds. Then an object is created DetectionTrainerand its trainmethods are called to start training.

Overall, this program file implements a training process based on the target detection model and uses the functions provided by the Ultralytics YOLO library.

5.5 backbone\convnextv2.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.layers import trunc_normal_, DropPath

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise NotImplementedError 
        self.normalized_shape = (normalized_shape, )
    
    def forward(self, x):
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            u = x.mean(1, keepdim=True)
            s = (x - u).pow(2).mean(1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x

class GRN(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
        self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))

    def forward(self, x):
        Gx = torch.norm(x, p=2, dim=(1,2), keepdim=True)
        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
        return self.gamma * (x * Nx) + self.beta + x

class Block(nn.Module):
    def __init__(self, dim, drop_path=0.):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        self.norm = LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.act = nn.GELU()
        self.grn = GRN(4 * dim)
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()

    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.grn(x)
        x = self.pwconv2(x)
        x = x.permute(0, 3, 1, 2)

        x = input + self.drop_path(x)
        return x

class ConvNeXtV2(nn.Module):
    def __init__(self, in_chans=3, num_classes=1000, 
                 depths=[3, 3, 9, 3], dims=[96, 192, 384, 768], 
                 drop_path_rate=0., head_init_scale=1.
                 ):
        super().__init__()
        self.depths = depths
        self.downsample_layers = nn.ModuleList()
        stem = nn.Sequential(
            nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
            LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
        )
        self.downsample_layers.append(stem)
        for i in range(3):
            downsample_layer = nn.Sequential(
                    LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
                    nn.Conv2d(dims[i], dims[i+1], kernel_size=2, stride=2),
            )
            self.downsample_layers.append(downsample_layer)

        self.stages = nn.ModuleList()
        dp_rates=[x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))] 
        cur = 0
        for i in range(4):
            stage = nn.Sequential(
                *[Block(dim=dims[i], drop_path=dp_rates[cur + j]) for j in range(depths[i])]
            )
            self.stages.append(stage)
            cur += depths[i]

        self.norm = nn.LayerNorm(dims[-1], eps=1e-6)
        self.head = nn.Linear(dims[-1], num_classes)

        self.apply(self._init_weights)
        self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]

    def _init_weights(self, m):
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            trunc_normal_(m.weight, std=.02)
            nn.init.constant_(m.bias, 0)

    def forward(self, x):
        res = []
        for i in range(4):
            x = self.downsample_layers[i](x)
            x = self.stages[i](x)
            res.append(x)
        return res

This program file is a PyTorch module that implements the ConvNeXt V2 model. ConvNeXt V2 is a convolutional neural network model for image classification tasks.

This program file contains the following classes and functions:

  1. LayerNorm class: Implements the LayerNorm layer that supports two data formats (channels_last and channels_first).

  2. GRN class: Implements the global response normalization (GRN) layer.

  3. Block class: implements the basic blocks of the ConvNeXtV2 model.

  4. ConvNeXtV2 class: Implements the ConvNeXt V2 model.

  5. update_weight function: used to update the weight of the model.

  6. convnextv2_atto function: Create a ConvNeXtV2 model instance and use atto configuration.

  7. convnextv2_femto function: Create a ConvNeXtV2 model instance and use femto configuration.

  8. convnextv2_pico function: Create a ConvNeXtV2 model instance and use pico configuration.

  9. convnextv2_nano function: Create a ConvNeXtV2 model instance using nano configuration.

  10. convnextv2_tiny function: Create a ConvNeXtV2 model instance using tiny configuration.

  11. convnextv2_base function: Create a ConvNeXtV2 model instance and use base configuration.

  12. convnextv2_large function: Create a ConvNeXtV2 model instance using large configuration.

  13. convnextv2_huge function: Create a ConvNeXtV2 model instance and use huge configuration.

The program file also contains some auxiliary functions and initialization functions.

The ConvNeXt V2 model is a deep convolutional neural network model with multiple residual blocks for image classification tasks. It uses some special layers and techniques, such as LayerNorm, GRN and DropPath, etc., to improve the performance and effect of the model. Different configurations control the depth and width of the model to suit different tasks and data sets.

5.6 backbone\CSwomTramsformer.py
class CSWinTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm):
        super().__init__()
        self.num_classes = num_classes
        self.depths = depths
        self.num_features = self.embed_dim = embed_dim

        self.patch_embed = PatchEmbed(
            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
        self.pos_drop = nn.Dropout(p=drop_rate)

        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]  # stochastic depth decay rule
        self.blocks = nn.ModuleList([
            CSWinBlock(
                dim=embed_dim, reso=img_size // patch_size, num_heads=num_heads[i], mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias, qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate,
                drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])], norm_layer=norm_layer,
                last_stage=(i == len(depths) - 1))
            for i in range(len(depths))])

        self.norm = norm_layer(embed_dim)

        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()

        trunc_normal_(self.head.weight, std=0.02)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes, global_pool=''):
        self.num_classes = num_classes
        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def forward_features(self, x):
        x = self.patch_embed(x)
        x = self.pos_drop(x)

        for blk in self.blocks:
            x = blk(x)

        x = self.norm(x)
        return x

    def forward(self, x):
        x = self.forward_features(x)
        x = self.head(x[:, 0])
        return x

This program file is an implementation of the CSWin Transformer model. CSWin Transformer is a model for image classification tasks. It uses the CSWin (Content-Style Window) structure to process image data. The CSWin Transformer model consists of multiple CSWinBlocks. Each CSWinBlock contains a LePEAttention module and an MLP module. The LePEAttention module is used to calculate the relationship between features at different locations in the image, and the MLP module is used to perform nonlinear transformation of features. The model also contains auxiliary functions for converting images into window-form feature representations and converting window-form feature representations back to image form. Finally, the model also includes a Merge_Block module to reduce the size of the feature map by half.

6. Overall structure of the system

The following is a breakdown of the functions of each file:

file path Function
export.py Export the model to files in different formats, such as CoreML, TensorRT, TensorFlow SavedModel, etc.
predict.py Using models for object detection predictions
train.py Training object detection model
ui.py Graphical user interface for using models for object detection and image segmentation tasks
backbone\convnextv2.py Definition and configuration of ConvNeXtV2 model
backbone\CSwomTramsformer.py CSwomTramsformer model definition and configuration
backbone\EfficientFormerV2.py Definition and configuration of EfficientFormerV2 model
backbone\efficientViT.py Definition and configuration of efficientViT model
backbone\fasternet.py Fasternet model definition and configuration
backbone\lsknet.py Definition and configuration of lsknet model
backbone\repvit.py Repvit model definition and configuration
backbone\revcol.py Revcol model definition and configuration
backbone\SwinTransformer.py Definition and configuration of SwinTransformer model
backbone\VanillaNet.py VanillaNet model definition and configuration
extra_modules\orepa.py Definition and configuration of orepa module
extra_modules\rep_block.py Definition and configuration of rep_block module
extra_modules\RFAConv.py Definition and configuration of RFAConv module
extra_modules_init_.py Initialization file for extra_modules module
extra_modules\ops_dcnv3\setup.py Installation script of ops_dcnv3 module
extra_modules\ops_dcnv3\test.py Test script for ops_dcnv3 module
extra_modules\ops_dcnv3\functions\dcnv3_func.py Function definition of ops_dcnv3 module
extra_modules\ops_dcnv3\functions_init_.py Initialization file of ops_dcnv3 module
extra_modules\ops_dcnv3\modules\dcnv3.py Model definition of ops_dcnv3 module
extra_modules\ops_dcnv3\modules_init_.py Initialization file of ops_dcnv3 module
models\common.py Common model definitions and functions
models\experimental.py Experimental model definitions and functions
models\tf.py TensorFlow model definitions and functions
models\yolo.py YOLO model definition and functions
models_init_.py Initialization file of models module
segment\predict.py Using models for image segmentation prediction
segment\train.py Train image segmentation model

7.YOLOv8 Introduction

YOLO (You Only Look Once) is a popular object detection and image segmentation model developed by Joseph Redmon and Ali Farhadi at the University of Washington. YOLO was launched in 2015 and quickly became popular due to its high speed and accuracy.

YOLOv2, released in 2016, improved the original model by incorporating batch normalization, anchor boxes, and dimension clustering.
YOLOv3, launched in 2018, further enhanced the performance of the model using a more efficient backbone network, multiple anchors, and spatial pyramid pooling.
YOLOv4 in Released in 2020, YOLOv5 introduces innovations such as Mosaic data enhancement, new anchor-free detection heads, and new dropout functions to
further improve the performance of the model, and adds new features such as hyperparameter optimization, integrated experiment tracking, and automatic export to popular export formats. Feature
YOLOv6 was open sourced by Meituan in 2022 and is currently used in many of the company's autonomous delivery robots.
YOLOv7 adds additional tasks on the COCO keypoint dataset, such as pose estimation.
YOLOv8 is the latest version of YOLO launched by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions by introducing new features and improvements to enhance performance, flexibility and efficiency. YOLOv8 supports a full range of visual AI tasks, including detection, segmentation, pose estimation, tracking and classification. This versatility allows users to leverage YOLOv8’s capabilities across different applications and domains

New features and available models of YOLOv8

Ultralytics did not directly name the open source library YOLOv8, but directly used the word ultralytics. The reason is that ultralytics positioned the library as an algorithm framework rather than a specific algorithm. One of its main features is scalability. It is hoped that this library can not only be used for YOLO series models, but also support various tasks such as non-YOLO models and classification, segmentation and pose estimation. To summarize, the two main advantages of the ultralytics open source library are:

Integrating many current SOTA technologies into one,
it will support other YOLO series and more algorithms besides YOLO in the future.
Ultralytics has released a new repository for the YOLO model. It is built as a unified framework for training object detection, instance segmentation and image classification models.

A new SOTA model is provided, including P5 640 and P6 1280 resolution target detection networks and YOLACT-based instance segmentation models. Like YOLOv5, models of different sizes in N/S/M/L/X scales are also provided based on scaling factors to meet the needs of different scenarios. The
backbone network and Neck part may refer to the YOLOv7 ELAN design idea and replace the C3 structure of YOLOv5. It has become a richer C2f structure of gradient flow, and different channel numbers have been adjusted for different scale models. This is a careful fine-tuning of the model structure. It is no longer a brainless set of parameters to apply to all models, which greatly improves model performance. However, operations such as Split in this C2f module are not as friendly to specific hardware deployments as before.
Compared with YOLOv5, the Head part has been greatly changed. It has been replaced by the current mainstream decoupling head structure, which separates the classification and detection heads, and also separates the Anchor- Based was replaced by Anchor-Free
Loss. In terms of calculation, the TaskAlignedAssigner positive sample distribution strategy was adopted, and
the data enhancement part of Distribution Focal Loss training was introduced. The last 10 epochs in YOLOX were introduced to turn off Mosiac. The enhanced operation can effectively improve the accuracy.
YOLOv8 also Multiple export formats are supported efficiently and flexibly, and the model can run on both CPU and GPU. There are five models in each category of YOLOv8 models for detection, segmentation, and classification. YOLOv8 Nano is the fastest and smallest, while YOLOv8 Extra Large (YOLOv8x) is the most accurate but slowest of them all.

Insert image description here

8.Basic principles of FocalModulation model

Referring to this blog, the basic principle of Focal Modulation Networks (FocalNets) is to replace the self-attention (Self-Attention) module and use the focal modulation (focal modulation) mechanism to capture long-distance dependencies and contextual information in the image. The figure below is a comparison of the two methods of self-attention and focus modulation.

Insert image description here

Self-attention requires complex query-key interactions and query-value aggregations of each query token with other tokens to calculate attention scores and capture context. Focus modulation, on the other hand, first aggregates spatial context into modulators at different granularities, and then injects these modulators into query tokens in a query-dependent manner. Focus modulation simplifies interaction and aggregation operations, making them more lightweight. In the figure, the self-attention part uses red dashed lines for query-key interactions and yellow dashed lines for query-value aggregation, while the focus modulation part uses blue for modulator aggregation and yellow for query-modulator interactions.

The FocalModulation model is implemented through the following steps:

  1. Focus contextualization: Stacking deep convolutional layers to encode different ranges of visual context.

  2. Gated aggregation: Selectively aggregate contextual information into modulators for each query token through a gating mechanism.

  3. Element-wise affine transformation: The aggregated modulator is injected into each query token via an affine transformation.

Let’s introduce these three mechanisms respectively ->

focus contextualization

Focal Contextualization is a component of Focal Modulation. Focus contextualization uses a series of depth-wise convolutional layers to encode visual context information at different scales. These layers capture visual features from near to far, allowing the network to understand image content at different levels. In this way, the network is able to maintain sensitivity to local details while aggregating contextual information and enhances awareness of global structure.

Insert image description here

This figure compares the mechanisms of self-attention (SA) and focus modulation in detail, and especially shows the context aggregation process in focus modulation. The diagram on the left shows how the self-attention model generates output through the interaction between key (k) and query (q), and subsequent aggregation. The middle and right figures illustrate how focus modulation replaces the self-attention model through hierarchical context aggregation and gated aggregation processes. In focus modulation, the input is first processed through a lightweight linear layer, then through hierarchical contextualization modules and gating mechanisms to selectively aggregate information, and finally through a modulator that interacts with the query (q) to generate an output.

Gated aggregation

"Gated Aggregation" in Focal Modulation Networks (FocalNets) is one of the key components. This process involves using gating mechanisms to selectively aggregate contextual information. Here is a detailed analysis of this process:

  1. What is a gating mechanism?
    Gating mechanisms are often used in deep learning to control the flow of information. It is often used to decide which information should be passed on and which should be blocked. In recurrent neural networks (RNN), specifically in long short-term memory networks (LSTM) and gated recurrent units (GRU), gating mechanisms are used to regulate the flow of information in time series data.

  2. The purpose of gated aggregation
    In FocalNets, the purpose of gated aggregation is to selectively aggregate contextual information for each query token (i.e., the data unit under processing). This means that the network is able to decide which specific contextual information is important for the query token currently being processed, thus focusing on those that are most relevant.

  3. How to implement gated aggregation?
    Implementing gated aggregation may involve a series of computational steps, including:

Compute contextual information: This may involve encoding different regions of the input image using deep convolutional layers (as mentioned in the article) to capture visual context from local to global.
Gating operation: This step involves a decision-making process to decide which contextual information is relevant based on the characteristics of the current query token. This may be achieved through a learned weight (gate) that determines the importance of different contextual information.
Information aggregation: Finally, context information is selectively aggregated into a modulator based on the results of the gating operation. This modulator is then used to adjust or "modulate" the representation of the query token.
4. Benefits of gated aggregation
Through gated aggregation, FocalNets can more effectively focus on the most critical information for the current task. This approach improves model efficiency and performance because it reduces the processing of unnecessary information while enhancing focus on key features. In vision tasks, this could mean better object detection and image classification performance, especially in complex or changing visual environments.

Summary: Gated aggregation is a core component of FocalNets, which improves the efficiency and performance of the network by selectively focusing on important contextual information.

Element-wise affine transformation

The third key component in Focal Modulation Networks (FocalNets) is element-wise affine transformation, which involves injecting modulators obtained by gated aggregation into each query token. Here's a detailed breakdown of the process:

  1. The basic concept of affine transformation:
    Affine transformation is a linear transformation used to perform operations such as scaling, rotation, translation, and tilt on data. In deep learning, element-wise affine transformation usually refers to a linear transformation of each element. This transformation can be described as y = ax + b, where x is the input, y is the output, and a and b are transformations. parameters.

  2. The role of element-wise affine transformation:
    In FocalNets, the role of element-wise affine transformation is to inject aggregated modulator information into each query token. This step is important to integrate contextual information and the original characteristics of the query token. In this way, the contextual information contained by the modulator can directly influence the representation of the query token.

  3. Perform an affine transformation:
    When performing this step, the aggregated modulator performs an element-wise affine transformation on each query token. In practice, this might mean applying the corresponding weight (a) and bias (b) in the modulator to each feature of the query token. In this way, each element in the modulator directly corresponds to a feature of the query token, and adjusting these features changes its expression.

  4. Effect of affine transformation:
    Through element-by-element affine transformation, the model is able to more finely tune the characteristics of each query token, enhancing or suppressing certain features based on contextual information. This fine adjustment mechanism allows the network to better adapt to complex visual scenes and improve the ability to capture details, thereby improving the model's performance in various visual tasks, such as target detection and image classification.

Summary: Element-wise affine transformation enables the model to leverage contextual information to effectively adjust query tokens, enhancing the model's ability to capture and express key visual features.

9. Visual analysis of training results

Evaluation index

Training loss: train/box_loss, train/seg_loss, train/obj_loss, train/cls_loss
Construction indicators (B): metrics/precision(B), metrics/recall(B), metrics/mAP_0.5(B), metrics/mAP_0 .5:0.95(B)
Mechanical indicators (M): metrics/precision(M), metrics/recall(M), metrics/mAP_0.5(M), metrics/mAP_0.5:0.95(M)
Verification loss: val /box_loss, val/seg_loss, val/obj_loss, val/cls_loss
learning rate: x/lr0, x/lr1, x/lr2

Visualization of training results

To analyze this data, we can create visualizations that track the progress of these metrics and losses over time. We will focus on the following key aspects:

Loss over Epochs: Observe how the model's training and validation losses decrease over time.
Precision and Recall: Evaluate a model's performance in correctly identifying buildings and machinery.
mAP (Mean Average Precision): Evaluates the overall performance of the model in detecting objects under different thresholds.
Learning rate changes: Understand how the learning rate changes over time.
Let's start by visualizing these aspects.

import matplotlib.pyplot as plt

# Setting up the plots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=6.0)

# Plotting Training and Validation Losses
axes[0, 0].plot(data['epoch'], data['train/box_loss'], label='Train Box Loss')
axes[0, 0].plot(data['epoch'], data['train/seg_loss'], label='Train Segmentation Loss')
axes[0, 0].plot(data['epoch'], data['train/obj_loss'], label='Train Object Loss')
axes[0, 0].plot(data['epoch'], data['train/cls_loss'], label='Train Class Loss')
axes[0, 0].plot(data['epoch'], data['val/box_loss'], label='Validation Box Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/seg_loss'], label='Validation Segmentation Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/obj_loss'], label='Validation Object Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/cls_loss'], label='Validation Class Loss', linestyle='dashed')
axes[0, 0].set_title('Training & Validation Losses over Epochs')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()

# Plotting Precision and Recall for Buildings
axes[0, 1].plot(data['epoch'], data['metrics/precision(B)'], label='Precision (Buildings)')
axes[0, 1].plot(data['epoch'], data['metrics/recall(B)'], label='Recall (Buildings)')
axes[0, 1].set_title('Precision & Recall for Buildings')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Metric Value')
axes[0, 1].legend()

# Plotting Precision and Recall for Machinery
axes[1, 0].plot(data['epoch'], data['metrics/precision(M)'], label='Precision (Machinery)')
axes[1, 0].plot(data['epoch'], data['metrics/recall(M)'], label='Recall (Machinery)')
axes[1, 0].set_title('Precision & Recall for Machinery')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Metric Value')
axes[1, 0].legend()

# Plotting mAP for Buildings and Machinery
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5(B)'], label='mAP_0.5 (Buildings)')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5:0.95(B)'], label='mAP_0.5:0.95 (Buildings)')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5(M)'], label='mAP_0.5 (Machinery)', linestyle='dashed')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5:0.95(M)'], label='mAP_0.5:0.95 (Machinery)', linestyle='dashed')
axes[1, 1].set_title('mAP for Buildings and Machinery')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('mAP Value')
axes[1, 1].legend()

# Plotting Learning Rates
axes[2, 0].plot(data['epoch'], data['x/lr0'], label='Learning Rate 0')
axes[2, 0].plot(data['epoch'], data['x/lr1'], label='Learning Rate 1')
axes[2, 0].plot(data['epoch'], data['x/lr2'], label='Learning Rate 2')
axes[2, 0].set_title('Learning Rates over Epochs')
axes[2, 0].set_xlabel('Epoch')
axes[2, 0].set_ylabel('Learning Rate')
axes[2, 0].legend()

# Adjusting layout for better visualization
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25, wspace=0.35)

# Show plot
plt.show()

Insert image description here

Result analysis

Training loss: Training losses (frame loss, segmentation loss, object loss, and classification loss) generally show a downward trend, indicating that the model is learning effectively in previous iterations.

Validation loss: Validation loss follows a similar trend as training loss. This indicates that the model is not significantly overfitting the training data.

Bounding Box (B) Metrics: Precision and recall of bounding boxes show different trends. High precision indicates that the model correctly identified most bounding boxes, while recall indicates its ability to detect all relevant cases. A trade-off between these two metrics can be observed.

Bounding Box (B) mAP: mAP (Mean Average Precision) of bounding boxes under different IoU (Intersection over Union) thresholds (0.5 and 0.5:0.95) shows the model's accuracy in detecting objects with bounding boxes. The mAP at 0.5:0.95 is particularly important because it is a more stringent metric that requires the model to remain accurate within an IoU threshold.

Mask (M) Metric: Similar to bounding boxes, the precision and recall of masks are critical to understanding the model’s segmentation performance.

Mask (M) mAP: The mAP of the mask further indicates how well the model performs at segmenting objects, with higher values ​​indicating better performance.

Learning rate: The learning rate graph shows how the learning rate adjusts over time. These adjustments are critical for efficient training, allowing the model to learn quickly initially and then refine its learning as it converges.

This comprehensive analysis provides a detailed understanding of the model's performance in different aspects.

10. System integration

The picture below [Complete source code & data set & environment deployment video tutorial & custom UI interface]

Insert image description here

Reference blog "Remote Sensing Image Segmentation System: Fusion of Spatial Pyramid Pooling (FocalModulation) to Improve YOLOv8"

Guess you like

Origin blog.csdn.net/cheng2333333/article/details/135369242