10 minutes to read the depth of the residual shrinkage network

Residual depth ResNet network won the 2016 IEEE Conference on Computer Vision and Pattern Recognition Best Paper Award, currently in Google Scholar citations have up to 38295 times.

Depth residual shrinkage network is a network of deep residual improved version, in fact, is the depth of the residual network, integrated attention mechanism and soft threshold function .

To a certain extent, the working principle of the depth of the residual shrinkage network , can be understood as: notice unimportant feature by attentional mechanisms, through the soft threshold function they are set to zero; or said, noting that an important mechanism by attention characteristics, they are retained , thereby enhancing the capacity depth of the neural network to extract useful features from noisy signals.

1. Why propose depth residual shrinkage network it?

First, at the time of sample classification, the sample inevitably there will be some noise , like Gaussian noise, pink noise, Laplace noise. More broadly speaking, the sample is likely to contain unrelated to the current classification task information, which can also be interpreted as noise. The noise may have an adverse impact on the classification results. (Soft thresholding is a key step in many noise reduction algorithm)

For example, at the roadside chat, voice chat could be mixed vehicle sirens, the sound of the wheels and so on. When the speech recognition of the sound signals, recognition results inevitably affected siren, the sound of wheels. From the perspective of learning in terms of depth, the siren, the sound of wheels corresponding features, it should be deleted in the internal depth neural network, in order to avoid affecting speech recognition.

Secondly, even with a sample set, the noise amount of each sample are often different . (This and attention mechanisms have in common; with an image sample set, for example, the positions of the images of the target object is located may be different; attention mechanism for each picture, notice the position of the target object is located )

For example, when cats and dogs trained classifier for five images tagged as "dog", the first one image may contain both the dogs and rats, the two images may contain both the dogs and geese, the first three images may contain both a dog and a chicken, the first four images may contain both a dog and a donkey, the first five images may contain both a dog and a duck. When we train dogs and cats classifier, will inevitably be disturbed by extraneous objects mice, geese, chickens, donkeys and ducks, resulting in classification accuracy rate. If we can notice these unrelated mice, geese, chickens, ducks and donkeys, and their corresponding features removed, it is possible to improve the accuracy of cats and dogs classifier.

2. Soft thresholding is a key step in many signal noise reduction algorithm

Soft thresholding, is a key step in many noise reduction algorithm of the absolute value is smaller than a certain threshold characteristic deleted, the absolute value greater than the threshold value, wherein the direction of zero to contract. It can be implemented by the following formula:

Soft thresholding output is input to the derivative

From the above, soft thresholding derivative is either 1 or zero. The nature and ReLU activation function are the same. Therefore, the soft thresholding algorithm can reduce the depth learning experience and risk gradient diffusion gradient explosion.

In soft thresholding function, a threshold must meet two conditions: First, the threshold value is a positive number; and a second, is not greater than the maximum threshold value of the input signal, otherwise the output will be all zeros.

At the same time, the threshold could best meet the third condition: Each sample should be based on their noise content, it has its own independent threshold.

This is because a lot of the noise content of the sample is often different. For example, often be the case, in which the same set of samples, sample A contained less noise, sample B contained more noise. So, if it is soft thresholding when the noise reduction algorithm, the sample should be used A larger threshold value, the sample B should be used smaller threshold. In the depth of neural networks, although these features and thresholds lost any clear physical meaning, but the basic principle is interlinked. That is, each sample should be based on their noise content, it has its own independent threshold.

3. attention mechanism

Attentional mechanisms in the field of computer vision is relatively easy to understand. Animals visual system can quickly scan the entire region, the target object is found, thereby to focus on a target object, to extract more detail, while suppressing irrelevant information. For details, please refer to the article attention mechanisms.

Squeeze-and-Excitation Network (SENet) is the depth of learning in a relatively new mechanism of attention. In different samples, different channel characteristics, contribution in the classification task is often different. SENet using a small sub-network, a set of weights is obtained, and thus the set of weights are multiplied with the characteristics of each channel, to adjust the size of each of the channel characteristics. This process can be considered to be applied to respective channels of different sizes of features in focus.

In this way, each sample will have their own separate set of weights. In other words, any two samples, their weights are not the same. In SENet, obtaining the specific weight path, "global pool of full connection layer → → → RELU function fully connected layer → the Sigmoid function."

4. The depth of the soft threshold value of attentional mechanisms

Depth residual shrinkage draws the network configuration of the subnet SENet to achieve soft threshold value at a depth of focus mechanisms. Blue box through the sub-network, can learn to obtain a set of thresholds for each channel characteristics soft thresholding.

In this sub-network, wherein first of all the input feature map, find their absolute values. After pooling and global mean and the mean, a characteristic is obtained, referred to as A. In another path, the feature of the global mean of view after pooling, are input to a small fully connected network. In the fully connected network Sigmoid function as the last layer, the output of a normalized between 0 and 1, to obtain a coefficient, denoted as α. The final threshold value may be expressed as α × A. Therefore, the threshold value is the absolute value of the average number of features of FIG × a between 0 and 1. In this way, not only to ensure that the threshold value is positive, but not too much.

Moreover, different samples have different thresholds. Thus, to some extent, can be understood as a special attention mechanisms: note independent characteristics of the current task, by soft thresholding, they are set to zero; or, noted that the current task-related features, the they retained.

Finally, a certain number of stacked modules and a substantially convolution layer, batch standardization, activation function, and the global mean of pooled whole connecting the output layer or the like, to obtain a full depth of residual shrinkage network.

The network may have a depth of residual shrinkage wider versatility

深度残差收缩网络事实上是一种通用的特征学习方法。这是因为很多特征学习的任务中,样本中或多或少都会包含一些噪声,以及不相关的信息。这些噪声和不相关的信息,有可能会对特征学习的效果造成影响。例如说:

在图片分类的时候,如果图片同时包含着很多其他的物体,那么这些物体就可以被理解成“噪声”;深度残差收缩网络或许能够借助注意力机制,注意到这些“噪声”,然后借助软阈值化,将这些“噪声”所对应的特征置为零,就有可能提高图像分类的准确率。

在语音识别的时候,如果在声音较为嘈杂的环境里,比如在马路边、工厂车间里聊天的时候,深度残差收缩网络也许可以提高语音识别的准确率,或者给出了一种能够提高语音识别准确率的思路。

6.Keras和TFLearn程序简介

本程序以图像分类为例,构建了小型的深度残差收缩网络,超参数也未进行优化。为追求高准确率的话,可以适当增加深度,增加训练迭代次数,以及适当调整超参数。下面是Keras程序:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat Dec 28 23:24:05 2019
Implemented using TensorFlow 1.0.1 and Keras 2.2.1
 
M. Zhao, S. Zhong, X. Fu, et al., Deep Residual Shrinkage Networks for Fault Diagnosis, 
IEEE Transactions on Industrial Informatics, 2019, DOI: 10.1109/TII.2019.2943898
@author: super_9527
"""

from __future__ import print_function
import keras
import numpy as np
from keras.datasets import mnist
from keras.layers import Dense, Conv2D, BatchNormalization, Activation
from keras.layers import AveragePooling2D, Input, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.regularizers import l2
from keras import backend as K
from keras.models import Model
from keras.layers.core import Lambda
K.set_learning_phase(1)

# Input image dimensions
img_rows, img_cols = 28, 28

# The data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

# Noised data
x_train = x_train.astype('float32') / 255. + 0.5*np.random.random([x_train.shape[0], img_rows, img_cols, 1])
x_test = x_test.astype('float32') / 255. + 0.5*np.random.random([x_test.shape[0], img_rows, img_cols, 1])
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)


def abs_backend(inputs):
    return K.abs(inputs)

def expand_dim_backend(inputs):
    return K.expand_dims(K.expand_dims(inputs,1),1)

def sign_backend(inputs):
    return K.sign(inputs)

def pad_backend(inputs, in_channels, out_channels):
    pad_dim = (out_channels - in_channels)//2
    return K.spatial_3d_padding(inputs, padding = ((0,0),(0,0),(pad_dim,pad_dim)))

# Residual Shrinakge Block
def residual_shrinkage_block(incoming, nb_blocks, out_channels, downsample=False,
                             downsample_strides=2):
    
    residual = incoming
    in_channels = incoming.get_shape().as_list()[-1]
    
    for i in range(nb_blocks):
        
        identity = residual
        
        if not downsample:
            downsample_strides = 1
        
        residual = BatchNormalization()(residual)
        residual = Activation('relu')(residual)
        residual = Conv2D(out_channels, 3, strides=(downsample_strides, downsample_strides), 
                          padding='same', kernel_initializer='he_normal', 
                          kernel_regularizer=l2(1e-4))(residual)
        
        residual = BatchNormalization()(residual)
        residual = Activation('relu')(residual)
        residual = Conv2D(out_channels, 3, padding='same', kernel_initializer='he_normal', 
                          kernel_regularizer=l2(1e-4))(residual)
        
        # Calculate global means
        residual_abs = Lambda(abs_backend)(residual)
        abs_mean = GlobalAveragePooling2D()(residual_abs)
        
        # Calculate scaling coefficients
        scales = Dense(out_channels, activation=None, kernel_initializer='he_normal', 
                       kernel_regularizer=l2(1e-4))(abs_mean)
        scales = BatchNormalization()(scales)
        scales = Activation('relu')(scales)
        scales = Dense(out_channels, activation='sigmoid', kernel_regularizer=l2(1e-4))(scales)
        scales = Lambda(expand_dim_backend)(scales)
        
        # Calculate thresholds
        thres = keras.layers.multiply([abs_mean, scales])
        
        # Soft thresholding
        sub = keras.layers.subtract([residual_abs, thres])
        zeros = keras.layers.subtract([sub, sub])
        n_sub = keras.layers.maximum([sub, zeros])
        residual = keras.layers.multiply([Lambda(sign_backend)(residual), n_sub])
        
        # Downsampling (it is important to use the pooL-size of (1, 1))
        if downsample_strides > 1:
            identity = AveragePooling2D(pool_size=(1,1), strides=(2,2))(identity)
            
        # Zero_padding to match channels (it is important to use zero padding rather than 1by1 convolution)
        if in_channels != out_channels:
            identity = Lambda(pad_backend)(identity, in_channels, out_channels)
        
        residual = keras.layers.add([residual, identity])
    
    return residual


# define and train a model
inputs = Input(shape=input_shape)
net = Conv2D(8, 3, padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(1e-4))(inputs)
net = residual_shrinkage_block(net, 1, 8, downsample=True)
net = BatchNormalization()(net)
net = Activation('relu')(net)
net = GlobalAveragePooling2D()(net)
outputs = Dense(10, activation='softmax', kernel_initializer='he_normal', kernel_regularizer=l2(1e-4))(net)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=100, epochs=5, verbose=1, validation_data=(x_test, y_test))

# get results
K.set_learning_phase(0)
DRSN_train_score = model.evaluate(x_train, y_train, batch_size=100, verbose=0)
print('Train loss:', DRSN_train_score[0])
print('Train accuracy:', DRSN_train_score[1])
DRSN_test_score = model.evaluate(x_test, y_test, batch_size=100, verbose=0)
print('Test loss:', DRSN_test_score[0])
print('Test accuracy:', DRSN_test_score[1])

下面是TFLearn程序:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 21:23:09 2019
Implemented using TensorFlow 1.0 and TFLearn 0.3.2
 
M. Zhao, S. Zhong, X. Fu, B. Tang, M. Pecht, Deep Residual Shrinkage Networks for Fault Diagnosis, 
IEEE Transactions on Industrial Informatics, 2019, DOI: 10.1109/TII.2019.2943898
 
@author: super_9527
"""
  
from __future__ import division, print_function, absolute_import
  
import tflearn
import numpy as np
import tensorflow as tf
from tflearn.layers.conv import conv_2d
  
# Data loading
from tflearn.datasets import cifar10
(X, Y), (testX, testY) = cifar10.load_data()
  
# Add noise
X = X + np.random.random((50000, 32, 32, 3))*0.1
testX = testX + np.random.random((10000, 32, 32, 3))*0.1
  
# Transform labels to one-hot format
Y = tflearn.data_utils.to_categorical(Y,10)
testY = tflearn.data_utils.to_categorical(testY,10)
  
def residual_shrinkage_block(incoming, nb_blocks, out_channels, downsample=False,
                   downsample_strides=2, activation='relu', batch_norm=True,
                   bias=True, weights_init='variance_scaling',
                   bias_init='zeros', regularizer='L2', weight_decay=0.0001,
                   trainable=True, restore=True, reuse=False, scope=None,
                   name="ResidualBlock"):
      
    # residual shrinkage blocks with channel-wise thresholds
  
    residual = incoming
    in_channels = incoming.get_shape().as_list()[-1]
  
    # Variable Scope fix for older TF
    try:
        vscope = tf.variable_scope(scope, default_name=name, values=[incoming],
                                   reuse=reuse)
    except Exception:
        vscope = tf.variable_op_scope([incoming], scope, name, reuse=reuse)
  
    with vscope as scope:
        name = scope.name #TODO
  
        for i in range(nb_blocks):
  
            identity = residual
  
            if not downsample:
                downsample_strides = 1
  
            if batch_norm:
                residual = tflearn.batch_normalization(residual)
            residual = tflearn.activation(residual, activation)
            residual = conv_2d(residual, out_channels, 3,
                             downsample_strides, 'same', 'linear',
                             bias, weights_init, bias_init,
                             regularizer, weight_decay, trainable,
                             restore)
  
            if batch_norm:
                residual = tflearn.batch_normalization(residual)
            residual = tflearn.activation(residual, activation)
            residual = conv_2d(residual, out_channels, 3, 1, 'same',
                             'linear', bias, weights_init,
                             bias_init, regularizer, weight_decay,
                             trainable, restore)
              
            # get thresholds and apply thresholding
            abs_mean = tf.reduce_mean(tf.reduce_mean(tf.abs(residual),axis=2,keep_dims=True),axis=1,keep_dims=True)
            scales = tflearn.fully_connected(abs_mean, out_channels//4, activation='linear',regularizer='L2',weight_decay=0.0001,weights_init='variance_scaling')
            scales = tflearn.batch_normalization(scales)
            scales = tflearn.activation(scales, 'relu')
            scales = tflearn.fully_connected(scales, out_channels, activation='linear',regularizer='L2',weight_decay=0.0001,weights_init='variance_scaling')
            scales = tf.expand_dims(tf.expand_dims(scales,axis=1),axis=1)
            thres = tf.multiply(abs_mean,tflearn.activations.sigmoid(scales))
            # soft thresholding
            residual = tf.multiply(tf.sign(residual), tf.maximum(tf.abs(residual)-thres,0))
              
  
            # Downsampling
            if downsample_strides > 1:
                identity = tflearn.avg_pool_2d(identity, 1,
                                               downsample_strides)
  
            # Projection to new dimension
            if in_channels != out_channels:
                if (out_channels - in_channels) % 2 == 0:
                    ch = (out_channels - in_channels)//2
                    identity = tf.pad(identity,
                                      [[0, 0], [0, 0], [0, 0], [ch, ch]])
                else:
                    ch = (out_channels - in_channels)//2
                    identity = tf.pad(identity,
                                      [[0, 0], [0, 0], [0, 0], [ch, ch+1]])
                in_channels = out_channels
  
            residual = residual + identity
  
    return residual
  
  
# Real-time data preprocessing
img_prep = tflearn.ImagePreprocessing()
img_prep.add_featurewise_zero_center(per_channel=True)
  
# Real-time data augmentation
img_aug = tflearn.ImageAugmentation()
img_aug.add_random_flip_leftright()
img_aug.add_random_crop([32, 32], padding=4)
  
# Build a Deep Residual Shrinkage Network with 3 blocks
net = tflearn.input_data(shape=[None, 32, 32, 3],
                         data_preprocessing=img_prep,
                         data_augmentation=img_aug)
net = tflearn.conv_2d(net, 16, 3, regularizer='L2', weight_decay=0.0001)
net = residual_shrinkage_block(net, 1, 16)
net = residual_shrinkage_block(net, 1, 32, downsample=True)
net = residual_shrinkage_block(net, 1, 32, downsample=True)
net = tflearn.batch_normalization(net)
net = tflearn.activation(net, 'relu')
net = tflearn.global_avg_pool(net)
# Regression
net = tflearn.fully_connected(net, 10, activation='softmax')
mom = tflearn.Momentum(0.1, lr_decay=0.1, decay_step=20000, staircase=True)
net = tflearn.regression(net, optimizer=mom, loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, checkpoint_path='model_cifar10',
                    max_checkpoints=10, tensorboard_verbose=0,
                    clip_gradients=0.)
  
model.fit(X, Y, n_epoch=100, snapshot_epoch=False, snapshot_step=500,
          show_metric=True, batch_size=100, shuffle=True, run_id='model_cifar10')
  
training_acc = model.evaluate(X, Y)[0]
validation_acc = model.evaluate(testX, testY)[0]

 

论文网址

M. Zhao, S. Zhong, X. Fu, et al., Deep residual shrinkage networks for fault diagnosis, IEEE Transactions on Industrial Informatics, DOI: 10.1109/TII.2019.2943898

https://ieeexplore.ieee.org/document/8850096

 

Guess you like

Origin www.cnblogs.com/uizhi/p/12239690.html