Face recognition and neural style transfer

The article refers to the reference , and on this basis, a brief introduction to the content of face recognition and neural style transfer is given.


face recognition

Model building

       It is not enough for face recognition to only use surface features for comparison. What is used here is FaceNet feature extraction . Since FaceNet requires a large amount of data and takes a long time to train, following common practice in applied deep learning settings, we load weights that others have already trained. Hahahahahaha, actually I don’t know how to train, I don’t know what the model is like, to put it bluntly, I don’t know how.

       Network information: The network uses 96x96 RGB images as input, the number of images is m, the input data dimensions are (m, 3, 96, 96), and the output is (m, 128), which is an m 128-bit vector.

       Anyway, load a wave of models first. I have to say that the encapsulation is really good. I don’t know how to load it...

from keras.models import Sequential
from keras.layers import Conv2D, ZeroPadding2D, Activation, Input, concatenate
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.layers.pooling import MaxPooling2D, AveragePooling2D
from keras.layers.merge import Concatenate
from keras.layers.core import Lambda, Flatten, Dense
from keras.initializers import glorot_uniform
from keras.engine.topology import Layer
from keras import backend as K

#------------用于绘制模型细节,可选--------------#
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
#------------------------------------------------#

K.set_image_data_format('channels_first')

import time
import cv2
import os
import numpy as np
from numpy import genfromtxt
import pandas as pd
import tensorflow as tf
import fr_utils
from inception_blocks_v2 import *

%matplotlib inline
%reload_ext autoreload
%autoreload 2


#获取模型
FRmodel = faceRecoModel(input_shape=(3,96,96))    #这个模型存在于inception_block_v2中

#打印模型的总参数数量
print("参数数量:" + str(FRmodel.count_params()))

#参数数量:3743280

The network structure is very complex and something I have never seen before. If you look around, several layers have been deleted, and you can see that the final output is a 128-bit vector, which is the deep feature extracted from the photo.

FRmodel.summary()    #可以看到最后输出的是一个(none, 128)的向量
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 96, 96)    0                                            
__________________________________________________________________________________________________
zero_padding2d_1 (ZeroPadding2D (None, 3, 102, 102)  0           input_1[0][0]                    
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 48, 48)   9472        zero_padding2d_1[0][0]           
__________________________________________________________________________________________________
bn1 (BatchNormalization)        (None, 64, 48, 48)   256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 48, 48)   0           bn1[0][0]                        
__________________________________________________________________________________________________
zero_padding2d_2 (ZeroPadding2D (None, 64, 50, 50)   0           activation_1[0][0]               
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 64, 24, 24)   0           zero_padding2d_2[0][0]           
__________________________________________________________________________________________________
conv2 (Conv2D)                  (None, 64, 24, 24)   4160        max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
bn2 (BatchNormalization)        (None, 64, 24, 24)   256         conv2[0][0]                      
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, 24, 24)   0           bn2[0][0]                        
__________________________________________________________________________________________________


.......
__________________________________________________________________________________________________
concatenate_7 (Concatenate)     (None, 736, 3, 3)    0           activation_35[0][0]              
                                                                 zero_padding2d_23[0][0]          
                                                                 activation_37[0][0]              
__________________________________________________________________________________________________
average_pooling2d_4 (AveragePoo (None, 736, 1, 1)    0           concatenate_7[0][0]              
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 736)          0           average_pooling2d_4[0][0]        
__________________________________________________________________________________________________
dense_layer (Dense)             (None, 128)          94336       flatten_1[0][0]                  
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 128)          0           dense_layer[0][0]                
==================================================================================================
Total params: 3,743,280
Trainable params: 3,733,968
Non-trainable params: 9,312

Three yuan loss

  • Two images of the same person are encoded very similarly.

  • Images of two different people are encoded very differently.

       So we would give two different codes for the same person and a code for another person. The ternary loss will bring the two codes of the same person closer and separate the codes of different people.

       The code uses the second norm to calculate the coding distance. To put it bluntly, it is the sum of squares of the differences. In order to ensure that the coding difference of the same person is smaller than the coding difference of different people, a spacing alpha is added here, and then compared with 0. The processing here is very similar to the loss function of SVM.

def triplet_loss(y_true, y_pred, alpha = 0.2):
    anchor, positive,negative = y_pred[0],y_pred[1],y_pred[2]
    
    pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)),axis= -1)
    neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)),axis= -1)
    
    basic_loss = tf.add(tf.subtract(pos_dist,neg_dist),alpha)
    loss = tf.reduce_sum(tf.maximum(basic_loss,0))
    return loss

Now start loading the model. What was loaded before was a model framework. Here we need to assign values ​​to the parameters in the model.

开始时间
start_time = time.clock()

#编译模型
FRmodel.compile(optimizer = 'adam', loss = triplet_loss, metrics = ['accuracy'])

#加载权值
fr_utils.load_weights_from_FaceNet(FRmodel)    #根据层标签加载权重

#结束时间
end_time = time.clock()

#计算时差
minium = end_time - start_time

print("执行了:" + str(int(minium / 60)) + "分" + str(int(minium%60)) + "秒")

 Model application

Build a database that stores people's names and corresponding codes. Once someone swipes their name, query their codes, and then calculate the difference.

database = {}
database["danielle"] = fr_utils.img_to_encoding("images/danielle.png", FRmodel)
database["younes"] = fr_utils.img_to_encoding("images/younes.jpg", FRmodel)
database["tian"] = fr_utils.img_to_encoding("images/tian.jpg", FRmodel)
database["andrew"] = fr_utils.img_to_encoding("images/andrew.jpg", FRmodel)
database["kian"] = fr_utils.img_to_encoding("images/kian.jpg", FRmodel)
database["dan"] = fr_utils.img_to_encoding("images/dan.jpg", FRmodel)
database["sebastiano"] = fr_utils.img_to_encoding("images/sebastiano.jpg", FRmodel)
database["bertrand"] = fr_utils.img_to_encoding("images/bertrand.jpg", FRmodel)
database["kevin"] = fr_utils.img_to_encoding("images/kevin.jpg", FRmodel)
database["felix"] = fr_utils.img_to_encoding("images/felix.jpg", FRmodel)
database["benoit"] = fr_utils.img_to_encoding("images/benoit.jpg", FRmodel)
database["arnaud"] = fr_utils.img_to_encoding("images/arnaud.jpg", FRmodel)

During the verification process, the ternary loss is not used to calculate the distance, but the two-norm is used directly. 

def verify(image_path, identify, database,model):
    encoding = fr_utils.img_to_encoding(image_path, model)
    dist = np.linalg.norm(encoding - database[identify])   #三元损失的计算方式和距离的计算方式不同
    
    if dist<0.7:
        print("欢迎 " + str(identify) + "回家!")
        is_door_open = True
    else:
        print("经验证,您与" + str(identify) + "不符!")
        is_door_open = False
    return dist,is_door_open

The above verification uses ID to query the code. The following verification does not require an ID. The code is generated directly from the photo taken, and the database is traversed to see if there is a similar one.

def who_is_it(image_path, database, model):
    encoding = fr_utils.img_to_encoding(image_path, model)
    min_dist = 100 
    for (name,db_enc) in database.items():
        dist = np.linalg.norm(encoding - db_enc)
        if dist < min_dist:
            min_dist = dist
            identity = name
    if min_dist >0.7:
        print("抱歉,您的信息不在数据库中。")

    else:
        print("姓名" + str(identity) + "  差距:" + str(min_dist))

    return min_dist, identity

neural style transfer

       Since I used tensorflow in my previous blog, I don’t know what I’m talking about. My understanding is consistent with that of another blog, so I attach additional references.

       This part uses transfer learning, which means using other people's networks to implement your own ideas. The one used here is VGG-19. Again, I don’t know what VGG-19 is. Fortunately, with this picture, the network structure is very clear. The last two FCs are not needed when using it here.

       We will use two images, one as content and one as style, to be input into the network to extract corresponding features. Note that the input of the VGG-19 network is a 400x300 3-channel image.

       In simple terms, how to do neural style transfer. You can see that there are three pictures on the left, middle and right in the diagram. The one on the left is the style, the one on the right is the content, and the one in the middle is the generated image.

       Left: Style features are extracted during image convolution operation. There are five layers in total. The obtained feature matrix needs to be converted into a style matrix.

       Middle: We will pre-generate an image (add noise to the content image), perform the same convolution, and there will also be 5 layers of features. We will operate on the features of these five layers to reduce their distance, then the style of the middle image will be It's similar to the left.

       Right: The content images extracted by different convolutional layers are different. We operate the fifth-layer features of the intermediate image and the fourth-layer features of the content image (this can be compared with any layer, of course it can be the second layer) The layer can also be the third layer), shortening the distance between them, then the content of the intermediate image and the content image will be very similar.

Implementation details

Calculate content loss

        The calculation of content loss is relatively simple. Just square the difference between the features of the generated image and the feature of the content image, and then sum them up.

        There is a problem here . The code converts a three-dimensional matrix into a two-dimensional matrix for simple calculation. However, due to the different features used, the fifth layer used to generate the image and the fourth layer (or other layers) used for the content image obviously cannot How can it be calculated when converted to the same dimension. The conversion method is as shown in the first picture below

def compute_content_cost(a_C, a_G):
    #计算内容代价函数
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    a_C_unrolled = tf.transpose(tf.reshape(a_C,[n_H*n_W,n_C]))
    a_G_unrolled = tf.transpose(tf.reshape(a_G,[n_H*n_W,n_C]))
    J_content = 1/(4*n_C*n_H*n_W)*tf.reduce_sum(tf.square(tf.subtract(a_C_unrolled,a_G_unrolled)))
    return J_content

Compute style loss

       The style matrix uses the "gram matrix":

 The matrix A here is a two-dimensional matrix, which is transformed from the feature matrix of each layer. The transformation method is as shown in the first picture above.

def gram_matrix(A):
    """
    计算矩阵A的风格矩阵

    参数:
        A -- 矩阵,维度为(n_C, n_H * n_W)

    返回:
        GA -- A的风格矩阵,维度为(n_C, n_C)

    """
    GA = tf.matmul(A, A, transpose_b = True)

    return GA

With the feature matrix, we can calculate the feature cost:

def compute_layer_style_cost(a_S, a_G):
    """
    计算单隐藏层的风格损失

    参数:
        a_S -- tensor类型,维度为(1, n_H, n_W, n_C),表示隐藏层中图像S的风格的激活值。
        a_G -- tensor类型,维度为(1, n_H, n_W, n_C),表示隐藏层中图像G的风格的激活值。

    返回:
        J_content -- 实数,用上面的公式2计算的值。

    """
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    
    a_S = tf.transpose(tf.reshape(a_S,[n_H*n_W, n_C]))
    a_G = tf.transpose(tf.reshape(a_G,[n_H*n_W, n_C]))
    
    #第3步,计算S与G的风格矩阵
    GS = gram_matrix(a_S)
    GG = gram_matrix(a_G)

    #第4步:计算风格损失
    #J_style_layer = (1/(4 * np.square(n_C) * np.square(n_H * n_W))) * (tf.reduce_sum(tf.square(tf.subtract(GS, GG))))
    J_style_layer = 1/(4*n_C*n_C*n_H*n_H*n_W*n_W)*tf.reduce_sum(tf.square(tf.subtract(GS, GG)))

    return J_style_layer

Since we have 5 feature layers, we need to use a loop to add up the five feature costs and give them five weights.

STYLE_LAYERS = [
    ('conv1_1', 0.2),
    ('conv2_1', 0.2),
    ('conv3_1', 0.2),
    ('conv4_1', 0.2),
    ('conv5_1', 0.2)]

def compute_style_cost(model, STYLE_LAYERS):
    """
    计算几个选定层的总体风格成本

    参数:
        model -- 加载了的tensorflow模型
        STYLE_LAYERS -- 字典,包含了:
                        - 我们希望从中提取风格的层的名称
                        - 每一层的系数(coeff)
    返回:
        J_style - tensor类型,实数,由公式(2)定义的成本计算方式来计算的值。

    """
    # 初始化所有的成本值
    J_style = 0

    for layer_name, coeff in STYLE_LAYERS:

        #选择当前选定层的输出
        out = model[layer_name]

        #运行会话,将a_S设置为我们选择的隐藏层的激活值
        a_S = sess.run(out)

        # 将a_G设置为来自同一图层的隐藏层激活,这里a_G引用model[layer_name],并且还没有计算,
        # 在后面的代码中,我们将图像G指定为模型输入,这样当我们运行会话时,
        # 这将是以图像G作为输入,从隐藏层中获取的激活值。
        a_G = out 

        #计算当前层的风格成本
        J_style_layer = compute_layer_style_cost(a_S,a_G)

        # 计算总风格成本,同时考虑到系数。
        J_style += coeff * J_style_layer

    return J_style

Optimization formula

def total_cost(J_content, J_style, alpha = 10, beta = 40):
    """
    计算总成本

    参数:
        J_content -- 内容成本函数的输出
        J_style -- 风格成本函数的输出
        alpha -- 超参数,内容成本的权值
        beta -- 超参数,风格成本的权值

    """

    J = alpha * J_content + beta * J_style

    return J

The value of J is what we need to optimize. It can be seen that the value of J has nothing to do with the style image, the content image, or the network weight, but only the generated image. Therefore, the gradient descent of the entire network optimization process is derivation of this image.


Actual combat

       Since I really don’t understand what tensorflow is doing, the principles and ideas have been explained clearly above, so I only put the code and effects here.

       Finally, there is another problem. Content loss is increasing rather than decreasing. This is because the initial image we generate is noise added to the original image, so the cost is relatively low at the beginning, and as it continues to evolve, the cost increases. Guess if the initial image is random then the cost will go down

第 0轮训练,  总成本为:24287758000.0  内容成本为:8792.098  风格成本为:607191740.0
第 20轮训练,  总成本为:4335552500.0  内容成本为:26040.2  风格成本为:108382296.0
第 40轮训练,  总成本为:1796130700.0  内容成本为:28367.377  风格成本为:44896176.0
第 60轮训练,  总成本为:974274600.0  内容成本为:29780.506  风格成本为:24349420.0
第 80轮训练,  总成本为:656141400.0  内容成本为:30443.324  风格成本为:16395924.0
第 100轮训练,  总成本为:496794700.0  内容成本为:30877.447  风格成本为:12412148.0
第 120轮训练,  总成本为:400211620.0  内容成本为:31226.652  风格成本为:9997484.0
第 140轮训练,  总成本为:333818020.0  内容成本为:31505.57  风格成本为:8337574.5
第 160轮训练,  总成本为:285234000.0  内容成本为:31755.535  风格成本为:7122910.5
第 180轮训练,  总成本为:247696830.0  内容成本为:31973.885  风格成本为:6184427.0
执行了:0分13秒

tf.reset_default_graph()

#创建交互绘画
sess = tf.InteractiveSession()

#加载内容图像并且归一化
content_image = scipy.misc.imread("images/resize1.jpg")
content_image = nst_utils.reshape_and_normalize_image(content_image)

#加载风格图像并且归一化
style_image = scipy.misc.imread("images/resize5.jpg")
style_image = nst_utils.reshape_and_normalize_image(style_image)

#随机初始化生成的图像,通过在内容图像中添加随机噪声来产生噪声图像
generated_image = nst_utils.generate_noise_image(content_image)
imshow(generated_image[0])

#加载VGG16
model = nst_utils.load_vgg_model("pretrained-model/imagenet-vgg-verydeep-19.mat")
#将内容图像作为VGG模型输入
sess.run(model['input'].assign(content_image))

out = model['conv4_2']

#a_C设置成为conv4_2的激活值
a_C = sess.run(out)

a_G = out 

#计算内容成本
J_content = compute_content_cost(a_C, a_G)


#将风格图像输入
sess.run(model['input'].assign(style_image))


#计算风格成本
J_style = compute_style_cost(model, STYLE_LAYERS)

J = total_cost(J_content,J_style)

optimizer = tf.train.AdamOptimizer(2.0)
train_step = optimizer.minimize(J)
def model_nn(sess, input_image, num_iterations = 200, is_print_info = True, 
             is_plot =True, is_save_process_image = True, save_last_image_to= "output/generated_image.jpg"):
    #初始化全局变量
    sess.run(tf.global_variables_initializer())

    #运行带噪声的输入图像
    sess.run(model["input"].assign(input_image))

    for i in range(num_iterations):
        #运行最小化的目标:
        sess.run(train_step)

        #产生把数据输入模型后生成的图像
        generated_image = sess.run(model["input"])

        if is_print_info and i % 20 == 0:
            Jt, Jc, Js = sess.run([J, J_content, J_style])
            print("第 " + str(i) + "轮训练," + 
                  "  总成本为:"+ str(Jt) + 
                  "  内容成本为:" + str(Jc) + 
                  "  风格成本为:" + str(Js))
        if is_save_process_image: 
            nst_utils.save_image("output/" + str(i) + ".png", generated_image)

    nst_utils.save_image(save_last_image_to, generated_image)

    return generated_image
#开始时间
start_time = time.clock()

#非GPU版本,约25-30min
#generated_image = model_nn(sess, generated_image)


#使用GPU,约1-2min
with tf.device("/cpu:0"):
    generated_image = model_nn(sess, generated_image)

#结束时间
end_time = time.clock()

#计算时差
minium = end_time - start_time

print("执行了:" + str(int(minium / 60)) + "分" + str(int(minium%60)) + "秒")

Guess you like

Origin blog.csdn.net/qq_41828351/article/details/90516217