文章目录
一、Wide&Deep模型
谷歌于2016年提出的Wide&Deep模型。Wide&Deep模型的主要思路正如其名,是由单层的Wide部分和多层的Deep部分组成的混合模型。其中Wide部分让模型拥有“记忆能力(memorization)”;Deep部分的主要作用是让模型容易“泛化能力(generalization)”。
Wide&Deep模型的paper地址如下: https://arxiv.org/pdf/1606.07792v1.pdf
Wide&Deep模型主要特点: 使模型同时具有逻辑回归和深度神经网络的优点——能够快速处理并记忆大量历史行为特征,并且具有强大的表达能力。
Wide&Deep模型的优点:
* 简单有效。结构简单易于理解,效果优异。目前仍在工业界广泛使用,也证明了该模型的有效性。
* 结构新颖。使用不同于以往的线性模型与DNN串行连接的方式,而将线性模型与DNN并行连接,同时兼顾模型的Memorization与Generalization。
Wide&Deep模型的缺点:
* Wide侧的特征工程仍无法避免。
二、算法原理
2.1 Wide部分——“记忆能力”
wide部分是一个广义的线性模型,可以参考逻辑回归模型。
“记忆能力”: 可以被理解为模型的直接学习并利用历史数据中物品或者特征的“共显频率”的能力。
像以前学习的逻辑回归、协同过滤等简单的模型有较强的“记忆能力”。这类模型类似于“如果点击A,就推荐B”这类似的规则式的推荐,这就相当于模型直接记住了历史数据的分布特点。
2.2 Deep部分——“泛化能力”
Deep部分是一个经典的DNN模型。
“泛化能力” 可以被理解为模型的传递特征的相关性,以及挖掘稀疏甚至从未出现过的稀有特征与最终的标签相关性能力。
像矩阵分解比协同过滤的泛化能力强,由于矩阵分解引入了隐向量的结构;
像DNN模型,通过特征多次的自由组合,可以深度挖掘数据中的潜在的模式;这两种使得稀疏的数据,获得有数据支撑,使得模型能得到稳定的推荐。这就是简单模型所缺乏的“泛化能力”。
三、代码实现
采取的数据是movielens 100.为了操作的方便,只为了展示FM实现的过程,只选取了uid、itemId作为输入特征,rating作为lable。
数据集
u.item: 电影信息数据
movie id | movie title | release date | video release date |IMDb URL |unknown | Action | Adventure | Animation |Children's | Comedy | Crime |Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery |Romance | Sci-Fi |Thriller | War | Western
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
u.user: 用户信息数据
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
ua.base: 训练数据集
ua.test: 测试数据集
user id | item id | rating | timestamp
1 1 5 874965758
1 2 3 876893171
1 3 4 878542960
数据处理
将uid和itemId使用one-hot编码,将rating作为输出标签,其评分等级为[0-5],大于3为1(表示用户感兴趣)小于3为0(表示用户不感兴趣)。
# 数据加载
def loadData():
# user信息(只取uid)
userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
uid_ = userInfo['uid']
userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
userId_dum['uid']=uid_
# item信息(只取itemId)
header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
item_id_ = ItemInfo['item_id']
item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
item_Id_dum['item_id']=item_id_
# 训练数据
trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
trainData = trainData.drop(columns=['time'])
trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')
X_train = pd.merge(trainData, userId_dum, how='left')
X_train = pd.merge(X_train, item_Id_dum, how='left')
X_train=X_train.drop(columns=['uid','item_id','rating'])
# 测试数据
testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
testData = testData.drop(columns=['time'])
testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')
X_test = pd.merge(testData, userId_dum, how='left')
X_test = pd.merge(X_test, item_Id_dum, how='left')
X_test=X_test.drop(columns=['uid','item_id','rating'])
# 对应域 uid itemid
# user信息 uid
# item信息 itemid
field_index={
}
userField=['uid']
itemField=['itemId']
field=userField+itemField
# 每个域的长度
userFieldLen=[len(uid_)]
itemFieldLen=[len(item_id_)]
field_len = userFieldLen + itemFieldLen
j=0
field_arange=[0]
for field_n in range(len(field)):
field_arange.append(field_arange[field_n]+field_len[field_n])
return X_train.values,Y_train.values,X_test.values,Y_test.values,field_arange,len(field)
Wide&Deep模型
class WideOrDeep:
def __init__(self,vec_dim,linear_lr,dnn_lr,l1_reg ,feature_length,field_arange,field_len,dnn_layers):
self.vec_dim=vec_dim
self.linear_lr=linear_lr
self.dnn_lr = dnn_lr
self.l1_reg = l1_reg
self.feature_length=feature_length
self.field_arange=field_arange
self.field_len=field_len
self.dnn_layers=dnn_layers
def add_input(self):
self.X = tf.placeholder(tf.float32,name='input_x',shape=[None,self.feature_length])
self.Y = tf.placeholder(tf.float32, shape=[None,2], name='input_y')
# 创建计算规则
def inference(self):
with tf.variable_scope('linear_part'):
b = tf.get_variable(name='b', shape=[2], dtype=tf.float32)
linear_w = tf.get_variable(shape=[self.feature_length, 2], dtype=tf.float32,
name='linear_w')
self.wide = tf.matmul(self.X, linear_w)+b
with tf.variable_scope('dnn_layer'):
Embedding = [tf.get_variable(name='Embedding_%d'%i, shape=[self.field_arange[i+1]-self.field_arange[i], self.vec_dim], dtype=tf.float32) for i in range(self.field_len)]
Embedding_layer = tf.concat([tf.matmul(tf.slice(self.X,[0,self.field_arange[i]],[-1,self.field_arange[i+1]-self.field_arange[i]]), Embedding[i]) for i in range(self.field_len)], axis=1)
x = Embedding_layer
in_num = self.field_len * self.vec_dim
for i in range(len(self.dnn_layers)):
out_num = self.dnn_layers[i]
w = tf.get_variable(name='w_%d'%i, shape=[in_num, out_num], dtype=tf.float32)
b = tf.get_variable(name='b_%d'%i, shape=[out_num], dtype=tf.float32)
x = tf.matmul(x, w) + b
if out_num == 2:
self.y_out = x + self.wide+b
else:
x = tf.nn.relu(x)
in_num = out_num
def add_loss(self):
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))
#计算accuracy
def add_accuracy(self):
# accuracy
self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
# self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')
#训练
def train(self):
wide_opt = tf.train.FtrlOptimizer(learning_rate=self.linear_lr, l1_regularization_strength=self.l1_reg)
wide_opt_train = wide_opt.minimize(loss=self.loss)
deep_opt = tf.train.AdamOptimizer(learning_rate=self.dnn_lr)
deep_opt_train = deep_opt.minimize(loss=self.loss)
self.train_op = tf.group(wide_opt_train, deep_opt_train)
#构建图
def build_graph(self):
self.add_input()
self.inference()
self.add_loss()
self.add_accuracy()
self.train()
训练和测试
def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
num = len(X_train) // batch_size+1
for step in range(epochs):
print("epochs{0}:".format(step+1))
for i in range(num):
index = np.random.choice(len(X_train), batch_size)
batch_x = X_train[index]
batch_y = Y_train[index]
feed_dict = {
model.X: batch_x,
model.Y: batch_y}
sess.run(model.train_op, feed_dict=feed_dict)
# print("Iteration {0}: with minibatch training loss = {1}"
# .format(step+1, loss))
if (i+1)%100==0:
loss ,accuracy,y_out= sess.run([model.loss,model.accuracy,model.y_out], feed_dict=feed_dict)
auc = metrics.roc_auc_score(batch_y, y_out)
print("Iteration {0}: with minibatch training loss = {1} accuracy = {2} auc={3}"
.format(step+1, loss,accuracy,auc))
def test_model(sess,model,X_test,Y_test):
loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy], feed_dict={
model.X: X_test, model.Y: Y_test})
print("loss={0} accuracy={1} auc={2}".format(loss,accuracy,metrics.roc_auc_score(Y_test, y_out)))
完整代码
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn import metrics
def loadData():
# user信息(只取uid)
userInfo = pd.read_csv('../data/u.user', sep='\|', names=['uid', 'age', 'gender', 'occupation','zip code'])
uid_ = userInfo['uid']
userId_dum = pd.get_dummies(userInfo['uid'], columns=['uid'], prefix='uid_')
userId_dum['uid']=uid_
# item信息(只取itemId)
header = ['item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
ItemInfo = pd.read_csv('../data/u.item', sep='|', names=header, encoding = "ISO-8859-1")
ItemInfo = ItemInfo.drop(columns=['title', 'release_date', 'video_release_date', 'IMDb_URL', 'unknown'])
item_id_ = ItemInfo['item_id']
item_Id_dum = pd.get_dummies(ItemInfo['item_id'], columns=['item_id'], prefix='item_id_')
item_Id_dum['item_id']=item_id_
# 训练数据
trainData = pd.read_csv('../data/ua.base', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
trainData = trainData.drop(columns=['time'])
trainData['rating']=trainData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_train=pd.get_dummies(trainData['rating'],columns=['rating'],prefix='y_')
X_train = pd.merge(trainData, userId_dum, how='left')
X_train = pd.merge(X_train, item_Id_dum, how='left')
X_train=X_train.drop(columns=['uid','item_id','rating'])
# 测试数据
testData = pd.read_csv('../data/ua.test', sep='\t', names=['uid', 'item_id', 'rating', 'time'])
testData = testData.drop(columns=['time'])
testData['rating']=testData.rating.apply(lambda x:1 if int(x)>3 else 0)
Y_test=pd.get_dummies(testData['rating'],columns=['rating'],prefix='y_')
X_test = pd.merge(testData, userId_dum, how='left')
X_test = pd.merge(X_test, item_Id_dum, how='left')
X_test=X_test.drop(columns=['uid','item_id','rating'])
# 对应域 uid itemid
# user信息 uid
# item信息 itemid
field_index={
}
userField=['uid']
itemField=['itemId']
field=userField+itemField
# 每个域的长度
userFieldLen=[len(uid_)]
itemFieldLen=[len(item_id_)]
field_len = userFieldLen + itemFieldLen
j=0
field_arange=[0]
for field_n in range(len(field)):
field_arange.append(field_arange[field_n]+field_len[field_n])
return X_train.values,Y_train.values,X_test.values,Y_test.values,field_arange,len(field)
class WideOrDeep:
def __init__(self,vec_dim,linear_lr,dnn_lr,l1_reg ,feature_length,field_arange,field_len,dnn_layers):
self.vec_dim=vec_dim
self.linear_lr=linear_lr
self.dnn_lr = dnn_lr
self.l1_reg = l1_reg
self.feature_length=feature_length
self.field_arange=field_arange
self.field_len=field_len
self.dnn_layers=dnn_layers
def add_input(self):
self.X = tf.placeholder(tf.float32,name='input_x',shape=[None,self.feature_length])
self.Y = tf.placeholder(tf.float32, shape=[None,2], name='input_y')
# 创建计算规则
def inference(self):
with tf.variable_scope('linear_part'):
b = tf.get_variable(name='b', shape=[2], dtype=tf.float32)
linear_w = tf.get_variable(shape=[self.feature_length, 2], dtype=tf.float32,
name='linear_w')
self.wide = tf.matmul(self.X, linear_w)+b
with tf.variable_scope('dnn_layer'):
Embedding = [tf.get_variable(name='Embedding_%d'%i, shape=[self.field_arange[i+1]-self.field_arange[i], self.vec_dim], dtype=tf.float32) for i in range(self.field_len)]
Embedding_layer = tf.concat([tf.matmul(tf.slice(self.X,[0,self.field_arange[i]],[-1,self.field_arange[i+1]-self.field_arange[i]]), Embedding[i]) for i in range(self.field_len)], axis=1)
x = Embedding_layer
in_num = self.field_len * self.vec_dim
for i in range(len(self.dnn_layers)):
out_num = self.dnn_layers[i]
w = tf.get_variable(name='w_%d'%i, shape=[in_num, out_num], dtype=tf.float32)
b = tf.get_variable(name='b_%d'%i, shape=[out_num], dtype=tf.float32)
x = tf.matmul(x, w) + b
if out_num == 2:
self.y_out = x + self.wide+b
else:
x = tf.nn.relu(x)
in_num = out_num
def add_loss(self):
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.Y, logits=self.y_out))
#计算accuracy
def add_accuracy(self):
# accuracy
self.correct_prediction = tf.equal(tf.cast(tf.argmax(self.y_out,1), tf.float32), tf.cast(tf.argmax(self.Y,1), tf.float32))
self.accuracy = tf.reduce_mean(tf.cast(self.correct_prediction, tf.float32))
# self.auc_value = tf.metrics.auc(tf.argmax(self.y_out,1),tf.argmax(self.Y,1), curve='ROC')
#训练
def train(self):
wide_opt = tf.train.FtrlOptimizer(learning_rate=self.linear_lr, l1_regularization_strength=self.l1_reg)
wide_opt_train = wide_opt.minimize(loss=self.loss)
deep_opt = tf.train.AdamOptimizer(learning_rate=self.dnn_lr)
deep_opt_train = deep_opt.minimize(loss=self.loss)
self.train_op = tf.group(wide_opt_train, deep_opt_train)
#构建图
def build_graph(self):
self.add_input()
self.inference()
self.add_loss()
self.add_accuracy()
self.train()
def train_model(sess, model, X_train,Y_train,batch_size, epochs=100):
num = len(X_train) // batch_size+1
for step in range(epochs):
print("epochs{0}:".format(step+1))
for i in range(num):
index = np.random.choice(len(X_train), batch_size)
batch_x = X_train[index]
batch_y = Y_train[index]
feed_dict = {
model.X: batch_x,
model.Y: batch_y}
sess.run(model.train_op, feed_dict=feed_dict)
# print("Iteration {0}: with minibatch training loss = {1}"
# .format(step+1, loss))
if (i+1)%100==0:
loss ,accuracy,y_out= sess.run([model.loss,model.accuracy,model.y_out], feed_dict=feed_dict)
auc = metrics.roc_auc_score(batch_y, y_out)
print("Iteration {0}: with minibatch training loss = {1} accuracy = {2} auc={3}"
.format(step+1, loss,accuracy,auc))
def test_model(sess,model,X_test,Y_test):
loss,y_out, accuracy= sess.run([model.loss, model.y_out,model.accuracy], feed_dict={
model.X: X_test, model.Y: Y_test})
print("loss={0} accuracy={1} auc={2}".format(loss,accuracy,metrics.roc_auc_score(Y_test, y_out)))
if __name__ == '__main__':
X_train,Y_train,X_test,Y_test,field_arange,field_len=loadData()
linear_lr = 0.001
dnn_lr=0.001
l1_reg=0.5
batch_size = 128
vec_dim = 10
feature_length = X_train.shape[1]
dnn_layers=[256,128,2]
model = WideOrDeep(vec_dim, linear_lr, dnn_lr, l1_reg, feature_length, field_arange, field_len, dnn_layers)
model.build_graph()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print('start training...')
train_model(sess,model,X_train,Y_train,batch_size,epochs=10)
print('start testing...')
test_model(sess,model,X_test,Y_test)