Technology improves safety. Based on SSD, we develop and construct a pedestrian safety behavior and attitude detection and recognition system in supermarket escalator scenarios.

In scenes with dense traffic such as shopping malls and supermarkets, it is often reported that some pedestrians fall and get injured on escalators. With the rapid development and popularization of AI technology, more and more scenes such as shopping malls, supermarkets and subways have begun to Install a dedicated safety detection and early warning system. The core working principle is the real-time calculation of the AI ​​model and the camera image and video stream. Through real-time detection and identification of behaviors on the behavioral escalator, rapid warning and response to dangerous behaviors can be performed to avoid subsequent serious consequences. as a result of.

The main purpose of this article is to develop and construct a pedestrian safety behavior detection and recognition system based on the escalator scene of supermarkets, and explore and analyze the feasibility of improving safety assurance based on AI technology. First, let’s look at the effect of an example:

The target detection model SSD (Single Shot Multibox Detector) is an end-to-end target detection model that can simultaneously predict the location and category of the target in a single forward propagation. SSD combines the regression idea with the anchor box mechanism, eliminates the candidate area generation and subsequent pixel or feature resampling stages in the two-stage algorithm, and encapsulates all calculations in a network, making it easy to train and fast. It discretizes the output space of bounding boxes into a set of default bounding boxes, which are generated on different levels of feature maps and have different aspect ratios. At prediction time, the network predicts the likelihood of belonging to each category in each default bounding box so that it tightly surrounds the target. The network makes predictions on multiple feature maps with different resolutions and can handle objects of various sizes.
The SSD network can be divided into two parts: feature extraction and detection frame generation. The basic network used for feature extraction is borrowed from the classification network. SSD uses VGG-16 as the basic network structure, uses the first 5 layers of VGG-16, and converts the FC6 and FC7 layers into two convolutional layers. The model adds 3 additional convolutional layers and an average pooling layer. However, this change will change the size of the receptive field, so dilated convolution is used. Convolutional layers are added after the pruned base network, and the feature map sizes of these layers are gradually reduced to enable prediction at multiple scales. Multi-scale feature maps include conv4-3, conv7, conv8-2, conv9-2, conv10-2, and conv11-2, a total of 6 scales. SSD uses small convolution kernels on each added feature layer to predict a series of bounding box offsets. The prediction part is used to predict the confidence of the object category and directly predict the bounding box coordinates of the object by using a small size convolution kernel on the feature map, because the prediction is performed at 6 different scales, and each scale has Anchor frames with different aspect ratios can improve the accuracy of target detection, and the entire algorithm can be trained end-to-end, which also has a great advantage in detection speed.

The algorithm construction principle of SSD is as follows:

Extracting features: SSD first uses a convolutional neural network (CNN) such as VGG or ResNet to extract features of the input image. These feature maps contain different levels of semantic information, which can help the model detect targets of different sizes and categories.

Multi-scale detection: SSD applies a series of convolutional layers and pooling layers on feature maps at different levels to detect objects at different scales. This multi-scale detection enables the model to better adapt to objects of different sizes.

Predict bounding boxes and categories: In each feature map, SSD uses a convolutional neural network to predict the location of the bounding box and object category. For each position and size of the anchor box, SSD predicts the matching target bounding box and the corresponding class probability.

Matching strategy: SSD determines which predictions are valid by matching the IoU (intersection over union) between predicted bounding boxes and ground-truth target bounding boxes, and uses a loss function for optimization.

Advantages of SSD include:

Efficiency: SSD can complete target detection in a single forward propagation, which is faster.
Multi-scale detection: SSD can effectively detect targets of different sizes and adapt to the needs of multi-scale target detection.
Simple and direct: SSD uses a single model to complete detection, simplifying the complexity of the model.
Disadvantages of SSD include:

Lower positioning accuracy: In positioning small targets, the accuracy of SSD may be limited.
Target exclusion problem: The anchor box preset in SSD may cause mutual exclusion between multiple detection results, which requires additional processing to solve.
The paper address is here, as follows:

You can read the paper for further details.

Kanpo kametchi is located这り, As shown below:

The project provides three different Backbone networks for use. Here we use mobilenetv3, as shown below:

"""
Creates a MobileNetV3 Model as defined in:
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, Hartwig Adam. (2019).
Searching for MobileNetV3
arXiv preprint arXiv:1905.02244.
@ Credit from https://github.com/d-li14/mobilenetv3.pytorch
@ Modified by Chakkrit Termritthikun (https://github.com/chakkritte)
"""
 
import torch.nn as nn
import math
 
from ssd.modeling import registry
from ssd.utils.model_zoo import load_state_dict_from_url
 
model_urls = {
    'mobilenet_v3': 'https://github.com/d-li14/mobilenetv3.pytorch/raw/master/pretrained/mobilenetv3-large-1cd25616.pth',
}
 
 
def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v
 
 
class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)
 
    def forward(self, x):
        return self.relu(x + 3) / 6
 
 
class h_swish(nn.Module):
    def __init__(self, inplace=True):
        super(h_swish, self).__init__()
        self.sigmoid = h_sigmoid(inplace=inplace)
 
    def forward(self, x):
        return x * self.sigmoid(x)
 
 
class SELayer(nn.Module):
    def __init__(self, channel, reduction=4):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, _make_divisible(channel // reduction, 8)),
            nn.ReLU(inplace=True),
            nn.Linear(_make_divisible(channel // reduction, 8), channel),
            h_sigmoid()
        )
 
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y
 
 
def conv_3x3_bn(inp, oup, stride):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        h_swish()
    )
 
 
def conv_1x1_bn(inp, oup):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
        h_swish()
    )
 
 
class InvertedResidual(nn.Module):
    def __init__(self, inp, hidden_dim, oup, kernel_size, stride, use_se, use_hs):
        super(InvertedResidual, self).__init__()
        assert stride in [1, 2]
 
        self.identity = stride == 1 and inp == oup
 
        if inp == hidden_dim:
            self.conv = nn.Sequential(
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # Squeeze-and-Excite
                SELayer(hidden_dim) if use_se else nn.Identity(),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        else:
            self.conv = nn.Sequential(
                # pw
                nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                # Squeeze-and-Excite
                SELayer(hidden_dim) if use_se else nn.Identity(),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
 
    def forward(self, x):
        if self.identity:
            return x + self.conv(x)
        else:
            return self.conv(x)
 
 
class MobileNetV3(nn.Module):
    def __init__(self, mode='large', num_classes=1000, width_mult=1.):
        super(MobileNetV3, self).__init__()
        # setting of inverted residual blocks
        self.cfgs = [
            # k, t, c, SE, HS, s
            [3, 1, 16, 0, 0, 1],
            [3, 4, 24, 0, 0, 2],
            [3, 3, 24, 0, 0, 1],
            [5, 3, 40, 1, 0, 2],
            [5, 3, 40, 1, 0, 1],
            [5, 3, 40, 1, 0, 1],
            [3, 6, 80, 0, 1, 2],
            [3, 2.5, 80, 0, 1, 1],
            [3, 2.3, 80, 0, 1, 1],
            [3, 2.3, 80, 0, 1, 1],
            [3, 6, 112, 1, 1, 1],
            [3, 6, 112, 1, 1, 1],
            [5, 6, 160, 1, 1, 2],
            [5, 6, 160, 1, 1, 1],
            [5, 6, 160, 1, 1, 1]]
 
        assert mode in ['large', 'small']
 
        # building first layer
        input_channel = _make_divisible(16 * width_mult, 8)
 
        layers = [conv_3x3_bn(3, input_channel, 2)]
        # building inverted residual blocks
        block = InvertedResidual
        for k, t, c, use_se, use_hs, s in self.cfgs:
            output_channel = _make_divisible(c * width_mult, 8)
            exp_size = _make_divisible(input_channel * t, 8)
            layers.append(block(input_channel, exp_size, output_channel, k, s, use_se, use_hs))
            input_channel = output_channel
        # building last several layers
        layers.append(conv_1x1_bn(input_channel, exp_size))
        self.features = nn.Sequential(*layers)
        self.extras = nn.ModuleList([
            InvertedResidual(960, _make_divisible(960 * 0.2, 8), 512, 3, 2, True, True),
            InvertedResidual(512, _make_divisible(512 * 0.25, 8), 256, 3, 2, True, True),
            InvertedResidual(256, _make_divisible(256 * 0.5, 8), 256, 3, 2, True, True),
            InvertedResidual(256, _make_divisible(256 * 0.25, 8), 64, 3, 2, True, True),
        ])
 
        self.reset_parameters()
 
    def forward(self, x):
        features = []
        for i in range(13):
            x = self.features[i](x)
        features.append(x)
 
        for i in range(13, len(self.features)):
            x = self.features[i](x)
        features.append(x)
 
        for i in range(len(self.extras)):
            x = self.extras[i](x)
            features.append(x)
 
        return tuple(features)
 
    def reset_parameters(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                n = m.weight.size(1)
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()
 
 
@registry.BACKBONES.register('mobilenet_v3')
def mobilenet_v3(cfg, pretrained=True):
    model = MobileNetV3()
    if pretrained:
        model.load_state_dict(load_state_dict_from_url(model_urls['mobilenet_v3']), strict=False)
    return model

Follow the README to implement the model development process based on your owndata set. I will not go into details here. The previous article There are more detailed introductions in .

After training is completed, a weight file that can be used for reasoning is obtained. The visual reasoning example is as follows:

It is formatted and stored here as follows:

{
	"shake": [
		[
			0.6235582828521729,
			[
				409,
				837,
				1019,
				1054
			]
		]
	]
}

It is convenient for subsequent analysis and use of the back-end business system. If you are interested, you can also practice it!

Guess you like

Origin blog.csdn.net/Together_CZ/article/details/134892776