1. Research background and significance
Project ReferenceAAAI Association for the Advancement of Artificial Intelligence
research background and meaning
Remote sensing image segmentation is an important research direction in the field of remote sensing technology. Its goal is to effectively segment and identify different features or types of features in remote sensing images. With the continuous development of remote sensing technology and the large-scale acquisition of remote sensing image data, remote sensing image segmentation has broad application prospects in agriculture, urban planning, environmental monitoring and other fields.
However, due to the particularity of remote sensing images, such as high image resolution, complex lighting conditions, and diverse types of ground objects, traditional image segmentation methods face some challenges when processing remote sensing images. Therefore, proposing an efficient and accurate remote sensing image segmentation system is of great significance for realizing automated processing of remote sensing images.
In recent years, deep learning technology has achieved remarkable results in the field of image segmentation. Among them, the image segmentation method based on Convolutional Neural Network (CNN) has been widely used. However, the traditional CNN method has some problems when processing remote sensing images, such as poor detection of small targets and limited ability to extract detailed information in remote sensing images.
In order to solve these problems, researchers have proposed an improved remote sensing image segmentation system, that is, fused spatial pyramid pooling (Focal Modulation) improved YOLOv8. This system combines spatial pyramid pooling and YOLOv8 models to improve the accuracy and robustness of remote sensing image segmentation by performing multi-scale feature extraction and pooling operations on images.
Specifically, the remote sensing image segmentation system of improving YOLOv8 by integrating spatial pyramid pooling includes the following key steps:
First, target detection is performed on remote sensing images through the YOLOv8 model to obtain candidate target areas in the image.
Then, spatial pyramid pooling technology is used to perform multi-scale feature extraction and pooling operations on the candidate target areas to obtain feature maps of different scales.
Next, the feature maps of different scales are fused to obtain the fused feature map.
Finally, the fused feature map is used for target segmentation to obtain segmentation results of different features or categories of features in remote sensing images.
The remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 has the following advantages:
First, by integrating spatial pyramid pooling technology, feature information of images can be extracted at different scales, thereby improving the accuracy of remote sensing image segmentation.
Secondly, using the YOLOv8 model for target detection can effectively detect small targets in remote sensing images and improve the detection capability of the system.
In addition, the remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 is also highly robust and can perform image segmentation stably under complex lighting conditions and diverse ground object categories.
In summary, the remote sensing image segmentation system that integrates spatial pyramid pooling to improve YOLOv8 is of great significance in improving the accuracy and robustness of remote sensing image segmentation. The research results of this system will provide effective technical support for the automated processing of remote sensing images and promote the application of remote sensing technology in agriculture, urban planning, environmental monitoring and other fields.
2. Picture demonstration
3. Video demonstration
4. Collection, labeling and organization of data sets
collection of pictures
First, we need to collect the required images. This can be achieved in different ways, such as using existing public datasets YGDatasets.
eiseg is a graphical image annotation tool that supports COCO and YOLO formats. The following are the steps to use eiseg to annotate images into COCO format:
(1) Download and install eiseg.
(2) Open eiseg and select "Open Dir" to select your picture directory.
(3) Set a label name for your target object.
(4) Draw a rectangular frame on the picture and select the corresponding label.
(5) Save the annotation information, which will generate a JSON file with the same name as the image in the image directory.
(6) Repeat this process until all pictures are labeled.
Since YOLO uses txt format annotations, we need to convert the VOC format to YOLO format. This can be achieved using various conversion tools or scripts.
Here's a simple way to do it using a Python script that reads the XML file and then converts it to the txt format required by YOLO.
import contextlib
import json
import cv2
import pandas as pd
from PIL import Image
from collections import defaultdict
from utils import *
# Convert INFOLKS JSON file into YOLO-format labels ----------------------------
def convert_infolks_json(name, files, img_path):
# Create folders
path = make_dirs()
# Import json
data = []
for file in glob.glob(files):
with open(file) as f:
jdata = json.load(f)
jdata['json_file'] = file
data.append(jdata)
# Write images and shapes
name = path + os.sep + name
file_id, file_name, wh, cat = [], [], [], []
for x in tqdm(data, desc='Files and Shapes'):
f = glob.glob(img_path + Path(x['json_file']).stem + '.*')[0]
file_name.append(f)
wh.append(exif_size(Image.open(f))) # (width, height)
cat.extend(a['classTitle'].lower() for a in x['output']['objects']) # categories
# filename
with open(name + '.txt', 'a') as file:
file.write('%s\n' % f)
# Write *.names file
names = sorted(np.unique(cat))
# names.pop(names.index('Missing product')) # remove
with open(name + '.names', 'a') as file:
[file.write('%s\n' % a) for a in names]
# Write labels file
for i, x in enumerate(tqdm(data, desc='Annotations')):
label_name = Path(file_name[i]).stem + '.txt'
with open(path + '/labels/' + label_name, 'a') as file:
for a in x['output']['objects']:
# if a['classTitle'] == 'Missing product':
# continue # skip
category_id = names.index(a['classTitle'].lower())
# The INFOLKS bounding box format is [x-min, y-min, x-max, y-max]
box = np.array(a['points']['exterior'], dtype=np.float32).ravel()
box[[0, 2]] /= wh[i][0] # normalize x by width
box[[1, 3]] /= wh[i][1] # normalize y by height
box = [box[[0, 2]].mean(), box[[1, 3]].mean(), box[2] - box[0], box[3] - box[1]] # xywh
if (box[2] > 0.) and (box[3] > 0.): # if w > 0 and h > 0
file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))
# Split data into train, test, and validate files
split_files(name, file_name)
write_data_data(name + '.data', nc=len(names))
print(f'Done. Output saved to {
os.getcwd() + os.sep + path}')
# Convert vott JSON file into YOLO-format labels -------------------------------
def convert_vott_json(name, files, img_path):
# Create folders
path = make_dirs()
name = path + os.sep + name
# Import json
data = []
for file in glob.glob(files):
with open(file) as f:
jdata = json.load(f)
jdata['json_file'] = file
data.append(jdata)
# Get all categories
file_name, wh, cat = [], [], []
for i, x in enumerate(tqdm(data, desc='Files and Shapes')):
with contextlib.suppress(Exception):
cat.extend(a['tags'][0] for a in x['regions']) # categories
# Write *.names file
names = sorted(pd.unique(cat))
with open(name + '.names', 'a') as file:
[file.write('%s\n' % a) for a in names]
# Write labels file
n1, n2 = 0, 0
missing_images = []
for i, x in enumerate(tqdm(data, desc='Annotations')):
f = glob.glob(img_path + x['asset']['name'] + '.jpg')
if len(f):
f = f[0]
file_name.append(f)
wh = exif_size(Image.open(f)) # (width, height)
n1 += 1
if (len(f) > 0) and (wh[0] > 0) and (wh[1] > 0):
n2 += 1
# append filename to list
with open(name + '.txt', 'a') as file:
file.write('%s\n' % f)
# write labelsfile
label_name = Path(f).stem + '.txt'
with open(path + '/labels/' + label_name, 'a') as file:
for a in x['regions']:
category_id = names.index(a['tags'][0])
# The INFOLKS bounding box format is [x-min, y-min, x-max, y-max]
box = a['boundingBox']
box = np.array([box['left'], box['top'], box['width'], box['height']]).ravel()
box[[0, 2]] /= wh[0] # normalize x by width
box[[1, 3]] /= wh[1] # normalize y by height
box = [box[0] + box[2] / 2, box[1] + box[3] / 2, box[2], box[3]] # xywh
if (box[2] > 0.) and (box[3] > 0.): # if w > 0 and h > 0
file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))
else:
missing_images.append(x['asset']['name'])
print('Attempted %g json imports, found %g images, imported %g annotations successfully' % (i, n1, n2))
if len(missing_images):
print('WARNING, missing images:', missing_images)
# Split data into train, test, and validate files
split_files(name, file_name)
print(f'Done. Output saved to {
os.getcwd() + os.sep + path}')
# Convert ath JSON file into YOLO-format labels --------------------------------
def convert_ath_json(json_dir): # dir contains json annotations and images
# Create folders
dir = make_dirs() # output directory
jsons = []
for dirpath, dirnames, filenames in os.walk(json_dir):
jsons.extend(
os.path.join(dirpath, filename)
for filename in [
f for f in filenames if f.lower().endswith('.json')
]
)
# Import json
n1, n2, n3 = 0, 0, 0
missing_images, file_name = [], []
for json_file in sorted(jsons):
with open(json_file) as f:
data = json.load(f)
# # Get classes
# try:
# classes = list(data['_via_attributes']['region']['class']['options'].values()) # classes
# except:
# classes = list(data['_via_attributes']['region']['Class']['options'].values()) # classes
# # Write *.names file
# names = pd.unique(classes) # preserves sort order
# with open(dir + 'data.names', 'w') as f:
# [f.write('%s\n' % a) for a in names]
# Write labels file
for x in tqdm(data['_via_img_metadata'].values(), desc=f'Processing {
json_file}'):
image_file = str(Path(json_file).parent / x['filename'])
f = glob.glob(image_file) # image file
if len(f):
f = f[0]
file_name.append(f)
wh = exif_size(Image.open(f)) # (width, height)
n1 += 1 # all images
if len(f) > 0 and wh[0] > 0 and wh[1] > 0:
label_file = dir + 'labels/' + Path(f).stem + '.txt'
nlabels = 0
try:
with open(label_file, 'a') as file: # write labelsfile
# try:
# category_id = int(a['region_attributes']['class'])
# except:
# category_id = int(a['region_attributes']['Class'])
category_id = 0 # single-class
for a in x['regions']:
# bounding box format is [x-min, y-min, x-max, y-max]
box = a['shape_attributes']
box = np.array([box['x'], box['y'], box['width'], box['height']],
dtype=np.float32).ravel()
box[[0, 2]] /= wh[0] # normalize x by width
box[[1, 3]] /= wh[1] # normalize y by height
box = [box[0] + box[2] / 2, box[1] + box[3] / 2, box[2],
box[3]] # xywh (left-top to center x-y)
if box[2] > 0. and box[3] > 0.: # if w > 0 and h > 0
file.write('%g %.6f %.6f %.6f %.6f\n' % (category_id, *box))
n3 += 1
nlabels += 1
if nlabels == 0: # remove non-labelled images from dataset
os.system(f'rm {
label_file}')
# print('no labels for %s' % f)
continue # next file
# write image
img_size = 4096 # resize to maximum
img = cv2.imread(f) # BGR
assert img is not None, 'Image Not Found ' + f
r = img_size / max(img.shape) # size ratio
if r < 1: # downsize if necessary
h, w, _ = img.shape
img = cv2.resize(img, (int(w * r), int(h * r)), interpolation=cv2.INTER_AREA)
ifile = dir + 'images/' + Path(f).name
if cv2.imwrite(ifile, img): # if success append image to list
with open(dir + 'data.txt', 'a') as file:
file.write('%s\n' % ifile)
n2 += 1 # correct images
except Exception:
os.system(f'rm {
label_file}')
print(f'problem with {
f}')
else:
missing_images.append(image_file)
nm = len(missing_images) # number missing
print('\nFound %g JSONs with %g labels over %g images. Found %g images, labelled %g images successfully' %
(len(jsons), n3, n1, n1 - nm, n2))
if len(missing_images):
print('WARNING, missing images:', missing_images)
# Write *.names file
names = ['knife'] # preserves sort order
with open(dir + 'data.names', 'w') as f:
[f.write('%s\n' % a) for a in names]
# Split data into train, test, and validate files
split_rows_simple(dir + 'data.txt')
write_data_data(dir + 'data.data', nc=1)
print(f'Done. Output saved to {
Path(dir).absolute()}')
def convert_coco_json(json_dir='../coco/annotations/', use_segments=False, cls91to80=False):
save_dir = make_dirs() # output directory
coco80 = coco91_to_coco80_class()
# Import json
for json_file in sorted(Path(json_dir).resolve().glob('*.json')):
fn = Path(save_dir) / 'labels' / json_file.stem.replace('instances_', '') # folder name
fn.mkdir()
with open(json_file) as f:
data = json.load(f)
# Create image dict
images = {
'%g' % x['id']: x for x in data['images']}
# Create image-annotations dict
imgToAnns = defaultdict(list)
for ann in data['annotations']:
imgToAnns[ann['image_id']].append(ann)
# Write labels file
for img_id, anns in tqdm(imgToAnns.items(), desc=f'Annotations {
json_file}'):
img = images['%g' % img_id]
h, w, f = img['height'], img['width'], img['file_name']
bboxes = []
segments = []
for ann in anns:
if ann['iscrowd']:
continue
# The COCO box format is [top left x, top left y, width, height]
box = np.array(ann['bbox'], dtype=np.float64)
box[:2] += box[2:] / 2 # xy top-left corner to center
box[[0, 2]] /= w # normalize x
box[[1, 3]] /= h # normalize y
if box[2] <= 0 or box[3] <= 0: # if w <= 0 and h <= 0
continue
cls = coco80[ann['category_id'] - 1] if cls91to80 else ann['category_id'] - 1 # class
box = [cls] + box.tolist()
if box not in bboxes:
bboxes.append(box)
# Segments
if use_segments:
if len(ann['segmentation']) > 1:
s = merge_multi_segment(ann['segmentation'])
s = (np.concatenate(s, axis=0) / np.array([w, h])).reshape(-1).tolist()
else:
s = [j for i in ann['segmentation'] for j in i] # all segments concatenated
s = (np.array(s).reshape(-1, 2) / np.array([w, h])).reshape(-1).tolist()
s = [cls] + s
if s not in segments:
segments.append(s)
# Write
with open((fn / f).with_suffix('.txt'), 'a') as file:
for i in range(len(bboxes)):
line = *(segments[i] if use_segments else bboxes[i]), # cls, box or segments
file.write(('%g ' * len(line)).rstrip() % line + '\n')
def min_index(arr1, arr2):
"""Find a pair of indexes with the shortest distance.
Args:
arr1: (N, 2).
arr2: (M, 2).
Return:
a pair of indexes(tuple).
"""
dis = ((arr1[:, None, :] - arr2[None, :, :]) ** 2).sum(-1)
return np.unravel_index(np.argmin(dis, axis=None), dis.shape)
def merge_multi_segment(segments):
"""Merge multi segments to one list.
Find the coordinates with min distance between each segment,
then connect these coordinates with one thin line to merge all
segments into one.
Args:
segments(List(List)): original segmentations in coco's json file.
like [segmentation1, segmentation2,...],
each segmentation is a list of coordinates.
"""
s = []
segments = [np.array(i).reshape(-1, 2) for i in segments]
idx_list = [[] for _ in range(len(segments))]
# record the indexes with min distance between each segment
for i in range(1, len(segments)):
idx1, idx2 = min_index(segments[i - 1], segments[i])
idx_list[i - 1].append(idx1)
idx_list[i].append(idx2)
# use two round to connect all the segments
for k in range(2):
# forward connection
if k == 0:
for i, idx in enumerate(idx_list):
# middle segments have two indexes
# reverse the index of middle segments
if len(idx) == 2 and idx[0] > idx[1]:
idx = idx[::-1]
segments[i] = segments[i][::-1, :]
segments[i] = np.roll(segments[i], -idx[0], axis=0)
segments[i] = np.concatenate([segments[i], segments[i][:1]])
# deal with the first segment and the last one
if i in [0, len(idx_list) - 1]:
s.append(segments[i])
else:
idx = [0, idx[1] - idx[0]]
s.append(segments[i][idx[0]:idx[1] + 1])
else:
for i in range(len(idx_list) - 1, -1, -1):
if i not in [0, len(idx_list) - 1]:
idx = idx_list[i]
nidx = abs(idx[1] - idx[0])
s.append(segments[i][nidx:])
return s
def delete_dsstore(path='../datasets'):
# Delete apple .DS_store files
from pathlib import Path
files = list(Path(path).rglob('.DS_store'))
print(files)
for f in files:
f.unlink()
if __name__ == '__main__':
source = 'COCO'
if source == 'COCO':
convert_coco_json('./annotations', # directory with *.json
use_segments=True,
cls91to80=True)
elif source == 'infolks': # Infolks https://infolks.info/
convert_infolks_json(name='out',
files='../data/sm4/json/*.json',
img_path='../data/sm4/images/')
elif source == 'vott': # VoTT https://github.com/microsoft/VoTT
convert_vott_json(name='data',
files='../../Downloads/athena_day/20190715/*.json',
img_path='../../Downloads/athena_day/20190715/') # images folder
elif source == 'ath': # ath format
convert_ath_json(json_dir='../../Downloads/athena/') # images folder
# zip results
# os.system('zip -r ../coco.zip ../coco')
Organize data folder structure
We need to organize the dataset into the following structure:
-----datasets
-----coco128-seg
|-----images
| |-----train
| |-----valid
| |-----test
|
|-----labels
| |-----train
| |-----valid
| |-----test
|
Model training
Epoch gpu_mem box obj cls labels img_size
1/200 20.8G 0.01576 0.01955 0.007536 22 1280: 100%|██████████| 849/849 [14:42<00:00, 1.04s/it]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 213/213 [01:14<00:00, 2.87it/s]
all 3395 17314 0.994 0.957 0.0957 0.0843
Epoch gpu_mem box obj cls labels img_size
2/200 20.8G 0.01578 0.01923 0.007006 22 1280: 100%|██████████| 849/849 [14:44<00:00, 1.04s/it]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 213/213 [01:12<00:00, 2.95it/s]
all 3395 17314 0.996 0.956 0.0957 0.0845
Epoch gpu_mem box obj cls labels img_size
3/200 20.8G 0.01561 0.0191 0.006895 27 1280: 100%|██████████| 849/849 [10:56<00:00, 1.29it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|███████ | 187/213 [00:52<00:00, 4.04it/s]
all 3395 17314 0.996 0.957 0.0957 0.0845
5. Core code explanation
5.2 predict.py
The code after encapsulating it into a class is as follows:
from ultralytics.engine.predictor import BasePredictor
from ultralytics.engine.results import Results
from ultralytics.utils import ops
class DetectionPredictor(BasePredictor):
def postprocess(self, preds, img, orig_imgs):
preds = ops.non_max_suppression(preds,
self.args.conf,
self.args.iou,
agnostic=self.args.agnostic_nms,
max_det=self.args.max_det,
classes=self.args.classes)
if not isinstance(orig_imgs, list):
orig_imgs = ops.convert_torch2numpy_batch(orig_imgs)
results = []
for i, pred in enumerate(preds):
orig_img = orig_imgs[i]
pred[:, :4] = ops.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
img_path = self.batch[0][i]
results.append(Results(orig_img, path=img_path, names=self.model.names, boxes=pred))
return results
This program file is a file named predict.py, which is the definition of a class DetectionPredictor for prediction based on the detection model. This class inherits from the BasePredictor class and is used to post-process the prediction results.
The code in this file uses the Ultralytics YOLO library, which is a toolkit for object detection. It provides some functions and classes for processing prediction results.
In the DetectionPredictor class, there is a postprocess method for post-processing the prediction results. In this method, the ops.non_max_suppression function is first used to perform non-maximum suppression processing on the prediction results to remove overlapping bounding boxes. Then, based on the input original image and prediction results, a list of Results objects is generated. Each Results object contains the original image, image path, category name and bounding box information.
This file also contains a sample code showing how to use the DetectionPredictor class to make predictions. The sample code uses a pre-trained YOLOv8 model and a set of input images to make predictions by calling the predict_cli method.
In general, this program file is a tool class for target detection and prediction, which provides the function of post-processing the prediction results.
5.3 train.py
from copy import copy
import numpy as np
from ultralytics.data import build_dataloader, build_yolo_dataset
from ultralytics.engine.trainer import BaseTrainer
from ultralytics.models import yolo
from ultralytics.nn.tasks import DetectionModel
from ultralytics.utils import LOGGER, RANK
from ultralytics.utils.torch_utils import de_parallel, torch_distributed_zero_first
class DetectionTrainer(BaseTrainer):
def build_dataset(self, img_path, mode='train', batch=None):
gs = max(int(de_parallel(self.model).stride.max() if self.model else 0), 32)
return build_yolo_dataset(self.args, img_path, batch, self.data, mode=mode, rect=mode == 'val', stride=gs)
def get_dataloader(self, dataset_path, batch_size=16, rank=0, mode='train'):
assert mode in ['train', 'val']
with torch_distributed_zero_first(rank):
dataset = self.build_dataset(dataset_path, mode, batch_size)
shuffle = mode == 'train'
if getattr(dataset, 'rect', False) and shuffle:
LOGGER.warning("WARNING ⚠️ 'rect=True' is incompatible with DataLoader shuffle, setting shuffle=False")
shuffle = False
workers = 0
return build_dataloader(dataset, batch_size, workers, shuffle, rank)
def preprocess_batch(self, batch):
batch['img'] = batch['img'].to(self.device, non_blocking=True).float() / 255
return batch
def set_model_attributes(self):
self.model.nc = self.data['nc']
self.model.names = self.data['names']
self.model.args = self.args
def get_model(self, cfg=None, weights=None, verbose=True):
model = DetectionModel(cfg, nc=self.data['nc'], verbose=verbose and RANK == -1)
if weights:
model.load(weights)
return model
def get_validator(self):
self.loss_names = 'box_loss', 'cls_loss', 'dfl_loss'
return yolo.detect.DetectionValidator(self.test_loader, save_dir=self.save_dir, args=copy(self.args))
def label_loss_items(self, loss_items=None, prefix='train'):
keys = [f'{
prefix}/{
x}' for x in self.loss_names]
if loss_items is not None:
loss_items = [round(float(x), 5) for x in loss_items]
return dict(zip(keys, loss_items))
else:
return keys
def progress_string(self):
return ('\n' + '%11s' *
(4 + len(self.loss_names))) % ('Epoch', 'GPU_mem', *self.loss_names, 'Instances', 'Size')
def plot_training_samples(self, batch, ni):
plot_images(images=batch['img'],
batch_idx=batch['batch_idx'],
cls=batch['cls'].squeeze(-1),
bboxes=batch['bboxes'],
paths=batch['im_file'],
fname=self.save_dir / f'train_batch{
ni}.jpg',
on_plot=self.on_plot)
def plot_metrics(self):
plot_results(file=self.csv, on_plot=self.on_plot)
def plot_training_labels(self):
boxes = np.concatenate([lb['bboxes'] for lb in self.train_loader.dataset.labels], 0)
cls = np.concatenate([lb['cls'] for lb in self.train_loader.dataset.labels], 0)
plot_labels(boxes, cls.squeeze(), names=self.data['names'], save_dir=self.save_dir, on_plot=self.on_plot)
This program file is a program for training an object detection model. It uses the Ultralytics YOLO library, which provides functionality to train and evaluate YOLO models.
A class named is defined in the program file DetectionTrainer
, which inherits from BaseTrainer
class, and is used for training based on the target detection model. This class provides methods for building datasets, building data loaders, preprocessing data, setting model properties, etc.
In __main__
the function, some parameters are first defined, including model file path, data file path and number of training rounds. Then an object is created DetectionTrainer
and its train
methods are called to start training.
Overall, this program file implements a training process based on the target detection model and uses the functions provided by the Ultralytics YOLO library.
5.5 backbone\convnextv2.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.layers import trunc_normal_, DropPath
class LayerNorm(nn.Module):
def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
super().__init__()
self.weight = nn.Parameter(torch.ones(normalized_shape))
self.bias = nn.Parameter(torch.zeros(normalized_shape))
self.eps = eps
self.data_format = data_format
if self.data_format not in ["channels_last", "channels_first"]:
raise NotImplementedError
self.normalized_shape = (normalized_shape, )
def forward(self, x):
if self.data_format == "channels_last":
return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
elif self.data_format == "channels_first":
u = x.mean(1, keepdim=True)
s = (x - u).pow(2).mean(1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.eps)
x = self.weight[:, None, None] * x + self.bias[:, None, None]
return x
class GRN(nn.Module):
def __init__(self, dim):
super().__init__()
self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))
def forward(self, x):
Gx = torch.norm(x, p=2, dim=(1,2), keepdim=True)
Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
return self.gamma * (x * Nx) + self.beta + x
class Block(nn.Module):
def __init__(self, dim, drop_path=0.):
super().__init__()
self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
self.norm = LayerNorm(dim, eps=1e-6)
self.pwconv1 = nn.Linear(dim, 4 * dim)
self.act = nn.GELU()
self.grn = GRN(4 * dim)
self.pwconv2 = nn.Linear(4 * dim, dim)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x):
input = x
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1)
x = self.norm(x)
x = self.pwconv1(x)
x = self.act(x)
x = self.grn(x)
x = self.pwconv2(x)
x = x.permute(0, 3, 1, 2)
x = input + self.drop_path(x)
return x
class ConvNeXtV2(nn.Module):
def __init__(self, in_chans=3, num_classes=1000,
depths=[3, 3, 9, 3], dims=[96, 192, 384, 768],
drop_path_rate=0., head_init_scale=1.
):
super().__init__()
self.depths = depths
self.downsample_layers = nn.ModuleList()
stem = nn.Sequential(
nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
LayerNorm(dims[0], eps=1e-6, data_format="channels_first")
)
self.downsample_layers.append(stem)
for i in range(3):
downsample_layer = nn.Sequential(
LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
nn.Conv2d(dims[i], dims[i+1], kernel_size=2, stride=2),
)
self.downsample_layers.append(downsample_layer)
self.stages = nn.ModuleList()
dp_rates=[x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
cur = 0
for i in range(4):
stage = nn.Sequential(
*[Block(dim=dims[i], drop_path=dp_rates[cur + j]) for j in range(depths[i])]
)
self.stages.append(stage)
cur += depths[i]
self.norm = nn.LayerNorm(dims[-1], eps=1e-6)
self.head = nn.Linear(dims[-1], num_classes)
self.apply(self._init_weights)
self.channel = [i.size(1) for i in self.forward(torch.randn(1, 3, 640, 640))]
def _init_weights(self, m):
if isinstance(m, (nn.Conv2d, nn.Linear)):
trunc_normal_(m.weight, std=.02)
nn.init.constant_(m.bias, 0)
def forward(self, x):
res = []
for i in range(4):
x = self.downsample_layers[i](x)
x = self.stages[i](x)
res.append(x)
return res
This program file is a PyTorch module that implements the ConvNeXt V2 model. ConvNeXt V2 is a convolutional neural network model for image classification tasks.
This program file contains the following classes and functions:
-
LayerNorm class: Implements the LayerNorm layer that supports two data formats (channels_last and channels_first).
-
GRN class: Implements the global response normalization (GRN) layer.
-
Block class: implements the basic blocks of the ConvNeXtV2 model.
-
ConvNeXtV2 class: Implements the ConvNeXt V2 model.
-
update_weight function: used to update the weight of the model.
-
convnextv2_atto function: Create a ConvNeXtV2 model instance and use atto configuration.
-
convnextv2_femto function: Create a ConvNeXtV2 model instance and use femto configuration.
-
convnextv2_pico function: Create a ConvNeXtV2 model instance and use pico configuration.
-
convnextv2_nano function: Create a ConvNeXtV2 model instance using nano configuration.
-
convnextv2_tiny function: Create a ConvNeXtV2 model instance using tiny configuration.
-
convnextv2_base function: Create a ConvNeXtV2 model instance and use base configuration.
-
convnextv2_large function: Create a ConvNeXtV2 model instance using large configuration.
-
convnextv2_huge function: Create a ConvNeXtV2 model instance and use huge configuration.
The program file also contains some auxiliary functions and initialization functions.
The ConvNeXt V2 model is a deep convolutional neural network model with multiple residual blocks for image classification tasks. It uses some special layers and techniques, such as LayerNorm, GRN and DropPath, etc., to improve the performance and effect of the model. Different configurations control the depth and width of the model to suit different tasks and data sets.
5.6 backbone\CSwomTramsformer.py
class CSWinTransformer(nn.Module):
def __init__(self, img_size=224, patch_size=4, in_chans=3, num_classes=1000, embed_dim=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm):
super().__init__()
self.num_classes = num_classes
self.depths = depths
self.num_features = self.embed_dim = embed_dim
self.patch_embed = PatchEmbed(
img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
self.pos_drop = nn.Dropout(p=drop_rate)
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))] # stochastic depth decay rule
self.blocks = nn.ModuleList([
CSWinBlock(
dim=embed_dim, reso=img_size // patch_size, num_heads=num_heads[i], mlp_ratio=mlp_ratio,
qkv_bias=qkv_bias, qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate,
drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])], norm_layer=norm_layer,
last_stage=(i == len(depths) - 1))
for i in range(len(depths))])
self.norm = norm_layer(embed_dim)
self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
trunc_normal_(self.head.weight, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def get_classifier(self):
return self.head
def reset_classifier(self, num_classes, global_pool=''):
self.num_classes = num_classes
self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
def forward_features(self, x):
x = self.patch_embed(x)
x = self.pos_drop(x)
for blk in self.blocks:
x = blk(x)
x = self.norm(x)
return x
def forward(self, x):
x = self.forward_features(x)
x = self.head(x[:, 0])
return x
This program file is an implementation of the CSWin Transformer model. CSWin Transformer is a model for image classification tasks. It uses the CSWin (Content-Style Window) structure to process image data. The CSWin Transformer model consists of multiple CSWinBlocks. Each CSWinBlock contains a LePEAttention module and an MLP module. The LePEAttention module is used to calculate the relationship between features at different locations in the image, and the MLP module is used to perform nonlinear transformation of features. The model also contains auxiliary functions for converting images into window-form feature representations and converting window-form feature representations back to image form. Finally, the model also includes a Merge_Block module to reduce the size of the feature map by half.
6. Overall structure of the system
The following is a breakdown of the functions of each file:
file path | Function |
---|---|
export.py | Export the model to files in different formats, such as CoreML, TensorRT, TensorFlow SavedModel, etc. |
predict.py | Using models for object detection predictions |
train.py | Training object detection model |
ui.py | Graphical user interface for using models for object detection and image segmentation tasks |
backbone\convnextv2.py | Definition and configuration of ConvNeXtV2 model |
backbone\CSwomTramsformer.py | CSwomTramsformer model definition and configuration |
backbone\EfficientFormerV2.py | Definition and configuration of EfficientFormerV2 model |
backbone\efficientViT.py | Definition and configuration of efficientViT model |
backbone\fasternet.py | Fasternet model definition and configuration |
backbone\lsknet.py | Definition and configuration of lsknet model |
backbone\repvit.py | Repvit model definition and configuration |
backbone\revcol.py | Revcol model definition and configuration |
backbone\SwinTransformer.py | Definition and configuration of SwinTransformer model |
backbone\VanillaNet.py | VanillaNet model definition and configuration |
extra_modules\orepa.py | Definition and configuration of orepa module |
extra_modules\rep_block.py | Definition and configuration of rep_block module |
extra_modules\RFAConv.py | Definition and configuration of RFAConv module |
extra_modules_init_.py | Initialization file for extra_modules module |
extra_modules\ops_dcnv3\setup.py | Installation script of ops_dcnv3 module |
extra_modules\ops_dcnv3\test.py | Test script for ops_dcnv3 module |
extra_modules\ops_dcnv3\functions\dcnv3_func.py | Function definition of ops_dcnv3 module |
extra_modules\ops_dcnv3\functions_init_.py | Initialization file of ops_dcnv3 module |
extra_modules\ops_dcnv3\modules\dcnv3.py | Model definition of ops_dcnv3 module |
extra_modules\ops_dcnv3\modules_init_.py | Initialization file of ops_dcnv3 module |
models\common.py | Common model definitions and functions |
models\experimental.py | Experimental model definitions and functions |
models\tf.py | TensorFlow model definitions and functions |
models\yolo.py | YOLO model definition and functions |
models_init_.py | Initialization file of models module |
segment\predict.py | Using models for image segmentation prediction |
segment\train.py | Train image segmentation model |
7.YOLOv8 Introduction
YOLO (You Only Look Once) is a popular object detection and image segmentation model developed by Joseph Redmon and Ali Farhadi at the University of Washington. YOLO was launched in 2015 and quickly became popular due to its high speed and accuracy.
YOLOv2, released in 2016, improved the original model by incorporating batch normalization, anchor boxes, and dimension clustering.
YOLOv3, launched in 2018, further enhanced the performance of the model using a more efficient backbone network, multiple anchors, and spatial pyramid pooling.
YOLOv4 in Released in 2020, YOLOv5 introduces innovations such as Mosaic data enhancement, new anchor-free detection heads, and new dropout functions to
further improve the performance of the model, and adds new features such as hyperparameter optimization, integrated experiment tracking, and automatic export to popular export formats. Feature
YOLOv6 was open sourced by Meituan in 2022 and is currently used in many of the company's autonomous delivery robots.
YOLOv7 adds additional tasks on the COCO keypoint dataset, such as pose estimation.
YOLOv8 is the latest version of YOLO launched by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions by introducing new features and improvements to enhance performance, flexibility and efficiency. YOLOv8 supports a full range of visual AI tasks, including detection, segmentation, pose estimation, tracking and classification. This versatility allows users to leverage YOLOv8’s capabilities across different applications and domains
New features and available models of YOLOv8
Ultralytics did not directly name the open source library YOLOv8, but directly used the word ultralytics. The reason is that ultralytics positioned the library as an algorithm framework rather than a specific algorithm. One of its main features is scalability. It is hoped that this library can not only be used for YOLO series models, but also support various tasks such as non-YOLO models and classification, segmentation and pose estimation. To summarize, the two main advantages of the ultralytics open source library are:
Integrating many current SOTA technologies into one,
it will support other YOLO series and more algorithms besides YOLO in the future.
Ultralytics has released a new repository for the YOLO model. It is built as a unified framework for training object detection, instance segmentation and image classification models.
A new SOTA model is provided, including P5 640 and P6 1280 resolution target detection networks and YOLACT-based instance segmentation models. Like YOLOv5, models of different sizes in N/S/M/L/X scales are also provided based on scaling factors to meet the needs of different scenarios. The
backbone network and Neck part may refer to the YOLOv7 ELAN design idea and replace the C3 structure of YOLOv5. It has become a richer C2f structure of gradient flow, and different channel numbers have been adjusted for different scale models. This is a careful fine-tuning of the model structure. It is no longer a brainless set of parameters to apply to all models, which greatly improves model performance. However, operations such as Split in this C2f module are not as friendly to specific hardware deployments as before.
Compared with YOLOv5, the Head part has been greatly changed. It has been replaced by the current mainstream decoupling head structure, which separates the classification and detection heads, and also separates the Anchor- Based was replaced by Anchor-Free
Loss. In terms of calculation, the TaskAlignedAssigner positive sample distribution strategy was adopted, and
the data enhancement part of Distribution Focal Loss training was introduced. The last 10 epochs in YOLOX were introduced to turn off Mosiac. The enhanced operation can effectively improve the accuracy.
YOLOv8 also Multiple export formats are supported efficiently and flexibly, and the model can run on both CPU and GPU. There are five models in each category of YOLOv8 models for detection, segmentation, and classification. YOLOv8 Nano is the fastest and smallest, while YOLOv8 Extra Large (YOLOv8x) is the most accurate but slowest of them all.
8.Basic principles of FocalModulation model
Referring to this blog, the basic principle of Focal Modulation Networks (FocalNets) is to replace the self-attention (Self-Attention) module and use the focal modulation (focal modulation) mechanism to capture long-distance dependencies and contextual information in the image. The figure below is a comparison of the two methods of self-attention and focus modulation.
Self-attention requires complex query-key interactions and query-value aggregations of each query token with other tokens to calculate attention scores and capture context. Focus modulation, on the other hand, first aggregates spatial context into modulators at different granularities, and then injects these modulators into query tokens in a query-dependent manner. Focus modulation simplifies interaction and aggregation operations, making them more lightweight. In the figure, the self-attention part uses red dashed lines for query-key interactions and yellow dashed lines for query-value aggregation, while the focus modulation part uses blue for modulator aggregation and yellow for query-modulator interactions.
The FocalModulation model is implemented through the following steps:
-
Focus contextualization: Stacking deep convolutional layers to encode different ranges of visual context.
-
Gated aggregation: Selectively aggregate contextual information into modulators for each query token through a gating mechanism.
-
Element-wise affine transformation: The aggregated modulator is injected into each query token via an affine transformation.
Let’s introduce these three mechanisms respectively ->
focus contextualization
Focal Contextualization is a component of Focal Modulation. Focus contextualization uses a series of depth-wise convolutional layers to encode visual context information at different scales. These layers capture visual features from near to far, allowing the network to understand image content at different levels. In this way, the network is able to maintain sensitivity to local details while aggregating contextual information and enhances awareness of global structure.
This figure compares the mechanisms of self-attention (SA) and focus modulation in detail, and especially shows the context aggregation process in focus modulation. The diagram on the left shows how the self-attention model generates output through the interaction between key (k) and query (q), and subsequent aggregation. The middle and right figures illustrate how focus modulation replaces the self-attention model through hierarchical context aggregation and gated aggregation processes. In focus modulation, the input is first processed through a lightweight linear layer, then through hierarchical contextualization modules and gating mechanisms to selectively aggregate information, and finally through a modulator that interacts with the query (q) to generate an output.
Gated aggregation
"Gated Aggregation" in Focal Modulation Networks (FocalNets) is one of the key components. This process involves using gating mechanisms to selectively aggregate contextual information. Here is a detailed analysis of this process:
-
What is a gating mechanism?
Gating mechanisms are often used in deep learning to control the flow of information. It is often used to decide which information should be passed on and which should be blocked. In recurrent neural networks (RNN), specifically in long short-term memory networks (LSTM) and gated recurrent units (GRU), gating mechanisms are used to regulate the flow of information in time series data. -
The purpose of gated aggregation
In FocalNets, the purpose of gated aggregation is to selectively aggregate contextual information for each query token (i.e., the data unit under processing). This means that the network is able to decide which specific contextual information is important for the query token currently being processed, thus focusing on those that are most relevant. -
How to implement gated aggregation?
Implementing gated aggregation may involve a series of computational steps, including:
Compute contextual information: This may involve encoding different regions of the input image using deep convolutional layers (as mentioned in the article) to capture visual context from local to global.
Gating operation: This step involves a decision-making process to decide which contextual information is relevant based on the characteristics of the current query token. This may be achieved through a learned weight (gate) that determines the importance of different contextual information.
Information aggregation: Finally, context information is selectively aggregated into a modulator based on the results of the gating operation. This modulator is then used to adjust or "modulate" the representation of the query token.
4. Benefits of gated aggregation
Through gated aggregation, FocalNets can more effectively focus on the most critical information for the current task. This approach improves model efficiency and performance because it reduces the processing of unnecessary information while enhancing focus on key features. In vision tasks, this could mean better object detection and image classification performance, especially in complex or changing visual environments.
Summary: Gated aggregation is a core component of FocalNets, which improves the efficiency and performance of the network by selectively focusing on important contextual information.
Element-wise affine transformation
The third key component in Focal Modulation Networks (FocalNets) is element-wise affine transformation, which involves injecting modulators obtained by gated aggregation into each query token. Here's a detailed breakdown of the process:
-
The basic concept of affine transformation:
Affine transformation is a linear transformation used to perform operations such as scaling, rotation, translation, and tilt on data. In deep learning, element-wise affine transformation usually refers to a linear transformation of each element. This transformation can be described as y = ax + b, where x is the input, y is the output, and a and b are transformations. parameters. -
The role of element-wise affine transformation:
In FocalNets, the role of element-wise affine transformation is to inject aggregated modulator information into each query token. This step is important to integrate contextual information and the original characteristics of the query token. In this way, the contextual information contained by the modulator can directly influence the representation of the query token. -
Perform an affine transformation:
When performing this step, the aggregated modulator performs an element-wise affine transformation on each query token. In practice, this might mean applying the corresponding weight (a) and bias (b) in the modulator to each feature of the query token. In this way, each element in the modulator directly corresponds to a feature of the query token, and adjusting these features changes its expression. -
Effect of affine transformation:
Through element-by-element affine transformation, the model is able to more finely tune the characteristics of each query token, enhancing or suppressing certain features based on contextual information. This fine adjustment mechanism allows the network to better adapt to complex visual scenes and improve the ability to capture details, thereby improving the model's performance in various visual tasks, such as target detection and image classification.
Summary: Element-wise affine transformation enables the model to leverage contextual information to effectively adjust query tokens, enhancing the model's ability to capture and express key visual features.
9. Visual analysis of training results
Evaluation index
Training loss: train/box_loss, train/seg_loss, train/obj_loss, train/cls_loss
Construction indicators (B): metrics/precision(B), metrics/recall(B), metrics/mAP_0.5(B), metrics/mAP_0 .5:0.95(B)
Mechanical indicators (M): metrics/precision(M), metrics/recall(M), metrics/mAP_0.5(M), metrics/mAP_0.5:0.95(M)
Verification loss: val /box_loss, val/seg_loss, val/obj_loss, val/cls_loss
learning rate: x/lr0, x/lr1, x/lr2
Visualization of training results
To analyze this data, we can create visualizations that track the progress of these metrics and losses over time. We will focus on the following key aspects:
Loss over Epochs: Observe how the model's training and validation losses decrease over time.
Precision and Recall: Evaluate a model's performance in correctly identifying buildings and machinery.
mAP (Mean Average Precision): Evaluates the overall performance of the model in detecting objects under different thresholds.
Learning rate changes: Understand how the learning rate changes over time.
Let's start by visualizing these aspects.
import matplotlib.pyplot as plt
# Setting up the plots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=6.0)
# Plotting Training and Validation Losses
axes[0, 0].plot(data['epoch'], data['train/box_loss'], label='Train Box Loss')
axes[0, 0].plot(data['epoch'], data['train/seg_loss'], label='Train Segmentation Loss')
axes[0, 0].plot(data['epoch'], data['train/obj_loss'], label='Train Object Loss')
axes[0, 0].plot(data['epoch'], data['train/cls_loss'], label='Train Class Loss')
axes[0, 0].plot(data['epoch'], data['val/box_loss'], label='Validation Box Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/seg_loss'], label='Validation Segmentation Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/obj_loss'], label='Validation Object Loss', linestyle='dashed')
axes[0, 0].plot(data['epoch'], data['val/cls_loss'], label='Validation Class Loss', linestyle='dashed')
axes[0, 0].set_title('Training & Validation Losses over Epochs')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
# Plotting Precision and Recall for Buildings
axes[0, 1].plot(data['epoch'], data['metrics/precision(B)'], label='Precision (Buildings)')
axes[0, 1].plot(data['epoch'], data['metrics/recall(B)'], label='Recall (Buildings)')
axes[0, 1].set_title('Precision & Recall for Buildings')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Metric Value')
axes[0, 1].legend()
# Plotting Precision and Recall for Machinery
axes[1, 0].plot(data['epoch'], data['metrics/precision(M)'], label='Precision (Machinery)')
axes[1, 0].plot(data['epoch'], data['metrics/recall(M)'], label='Recall (Machinery)')
axes[1, 0].set_title('Precision & Recall for Machinery')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Metric Value')
axes[1, 0].legend()
# Plotting mAP for Buildings and Machinery
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5(B)'], label='mAP_0.5 (Buildings)')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5:0.95(B)'], label='mAP_0.5:0.95 (Buildings)')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5(M)'], label='mAP_0.5 (Machinery)', linestyle='dashed')
axes[1, 1].plot(data['epoch'], data['metrics/mAP_0.5:0.95(M)'], label='mAP_0.5:0.95 (Machinery)', linestyle='dashed')
axes[1, 1].set_title('mAP for Buildings and Machinery')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('mAP Value')
axes[1, 1].legend()
# Plotting Learning Rates
axes[2, 0].plot(data['epoch'], data['x/lr0'], label='Learning Rate 0')
axes[2, 0].plot(data['epoch'], data['x/lr1'], label='Learning Rate 1')
axes[2, 0].plot(data['epoch'], data['x/lr2'], label='Learning Rate 2')
axes[2, 0].set_title('Learning Rates over Epochs')
axes[2, 0].set_xlabel('Epoch')
axes[2, 0].set_ylabel('Learning Rate')
axes[2, 0].legend()
# Adjusting layout for better visualization
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.25, wspace=0.35)
# Show plot
plt.show()
Result analysis
Training loss: Training losses (frame loss, segmentation loss, object loss, and classification loss) generally show a downward trend, indicating that the model is learning effectively in previous iterations.
Validation loss: Validation loss follows a similar trend as training loss. This indicates that the model is not significantly overfitting the training data.
Bounding Box (B) Metrics: Precision and recall of bounding boxes show different trends. High precision indicates that the model correctly identified most bounding boxes, while recall indicates its ability to detect all relevant cases. A trade-off between these two metrics can be observed.
Bounding Box (B) mAP: mAP (Mean Average Precision) of bounding boxes under different IoU (Intersection over Union) thresholds (0.5 and 0.5:0.95) shows the model's accuracy in detecting objects with bounding boxes. The mAP at 0.5:0.95 is particularly important because it is a more stringent metric that requires the model to remain accurate within an IoU threshold.
Mask (M) Metric: Similar to bounding boxes, the precision and recall of masks are critical to understanding the model’s segmentation performance.
Mask (M) mAP: The mAP of the mask further indicates how well the model performs at segmenting objects, with higher values indicating better performance.
Learning rate: The learning rate graph shows how the learning rate adjusts over time. These adjustments are critical for efficient training, allowing the model to learn quickly initially and then refine its learning as it converges.
This comprehensive analysis provides a detailed understanding of the model's performance in different aspects.
10. System integration
The picture below [Complete source code & data set & environment deployment video tutorial & custom UI interface]
Reference blog "Remote Sensing Image Segmentation System: Fusion of Spatial Pyramid Pooling (FocalModulation) to Improve YOLOv8"