Tensorflow: feature_column processes feature columns, model implementation Demo

Note: Tensorflow2.13 version no longer supports tf.feature_column, use tf.keras.utils.FeatureSpace instead

tf.feature_column.bucketized_column  |  TensorFlow v2.13.0 

tf.keras.utils.FeatureSpace  |  TensorFlow v2.13.0 

---------The following is an API usage example of the old version of tf.feature_columns

Feature columns are usually used when performing feature engineering on structured data. Feature columns are generally not used for image or text data.

1. Feature column usage

Using feature columns, you can convert categorical features into one-hot encoding features, build continuous features into bucket features, and generate cross features from multiple features.

To create a feature column, call the function of the tf.feature_column module . The nine commonly used functions in this module are shown in the figure below. All nine functions will return a Categorical-Column or a Dense-Column object, but will not return bucketized_column, which is inherited from these two classes.

Note: All Catogorical Column types must eventually be converted into Dense Column types through indicator_column before they can be passed into the model!

Detailed explanation:

numerical_column numerical column, the most commonly used.
bucketized_column The bucketed column is generated from a numerical column. Multiple features can be generated from one numerical column, using one-hot encoding.
categorical_column_with_vocabulary_list Categorical vocabulary column, one-hot encoding, dictionary specified by list.
categorical_column_with_vocabulary_file categorical vocabulary column, the dictionary is specified by file file.
categorical_column_with_hash_bucket Hash column, used when the integer or dictionary is large.
indicator_column indicator column, generated by Categorical Column, one-hot encoding.
embedding_column embedding column, generated by Categorical Column, embedding vector distribution parameters need to be learned. The embedding vector dimension is recommended to be the 4th root of the number of categories.
crossed_column cross column, which can be composed of any categorical column except categorical_column_with_hash_bucket.

2. Feature column use cases

The following is a complete case of using feature columns to solve the Titanic survival problem.
 

import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers,models


#打印日志
def printlog(info):
    nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("\n"+"=========="*8 + "%s"%nowtime)
    print(info+'...\n')
    
#======================================================
#  					1.构建数据管道
#======================================================
printlog("step1: prepare dataset...")
dftrain_raw = pd.read_csv("./data/titanic/train.csv")
dftest_raw = pd.read_csv("./data/titanic/test.csv")
dfraw = pd.concat([dftrain_raw,dftest_raw])


def prepare_dfdata(dfraw):
    dfdata = dfraw.copy()
    dfdata.columns = [x.lower() for x in dfdata.columns]
    dfdata = dfdata.rename(columns={'survived':'label'})
    dfdata = dfdata.drop(['passengerid','name'],axis = 1)
    for col,dtype in dict(dfdata.dtypes).items():
    # 判断是否包含缺失值
    if dfdata[col].hasnans:
        # 添加标识是否缺失列
        dfdata[col + '_nan'] = pd.isna(dfdata[col]).astype('int32')
        # 填充
        if dtype not in [np.object,np.str,np.unicode]:
        	dfdata[col].fillna(dfdata[col].mean(),inplace = True)
        else:
        	dfdata[col].fillna('',inplace = True)
    return(dfdata)


dfdata = prepare_dfdata(dfraw)
dftrain = dfdata.iloc[0:len(dftrain_raw),:]
dftest = dfdata.iloc[len(dftrain_raw):,:]


# 从 dataframe 导⼊数据
def df_to_dataset(df, shuffle=True, batch_size=32):
    dfdata = df.copy()
    if 'label' not in dfdata.columns:
    	ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict(orient='list'))
    else:
        labels = dfdata.pop('label')
        ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict(orient='list'),labels))
    if shuffle:
    	ds = ds.shuffle(buffer_size=len(dfdata))
    ds = ds.batch(batch_size)
    return ds

ds_train = df_to_dataset(dftrain)
ds_test = df_to_dataset(dftest)


#======================================================
#  					2.定义特征列
#======================================================
printlog("step2: make feature columns...")
feature_columns = []

# 数值列
for col in ['age','fare','parch','sibsp']+[c for c in dfdata.columns if c.endswith('_nan')]:
	feature_columns.append(tf.feature_column.numeric_column(col))

# 分桶列
age = tf.feature_column.numeric_column('age')
age_buckets = tf.feature_column.bucketized_column(age,boundaries=[18,25,30,35,40,45,50,55,60,65])
feature_columns.append(age_buckets)

# 类别列
# 注意:所有的 Catogorical Column 类型最终都要通过 indicator_column 转换成 Dense Column 类型才能传⼊模型!!
sex=tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(
	key='sex',vocabulary_list=["male", "female"]))
feature_columns.append(sex)

pclass=tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(
	key='pclass',vocabulary_list=[1,2,3]))
feature_columns.append(pclass)

ticket=tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_hash_bucket('ticket', 3))
feature_columns.append(ticket)

embarked=tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(key='embarked',vocabulary_list=['S','C','B']))
feature_columns.append(embarked)

# 嵌⼊列
cabin=tf.feature_column.embedding_column(tf.feature_column.categorical_column_with_hash_bucket('cabin',32), 2)
feature_columns.append(cabin)

# 交叉列
pclass_cate=tf.feature_column.categorical_column_with_vocabulary_list(key='pclass', vocabulary_list=[1,2,3])
crossed_feature=tf.feature_column.indicator_column(tf.feature_column.crossed_column(
    [age_buckets, pclass_cate], hash_bucket_size=15))
feature_columns.append(crossed_feature)


#======================================================
#  					3.定义模型
#======================================================
printlog("step3: define model...")

tf.keras.backend.clear_session()


model = tf.keras.Sequential([
    # 将特征列放入 tf.keras.layers.DenseFeatures 中!!!
    layers.DenseFeatures(feature_columns), 
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])


#======================================================
#  					4.训练模型
#======================================================
printlog("step4: train model...")

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
history = model.fit(ds_train,validation_data=ds_test,epochs=10)


#======================================================
#  					5.评估模型
#======================================================
printlog("step5: eval model...")

model.summary()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib.pyplot as plt

def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()
    
plot_metric(history,"accuracy")

Model summary:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_features (DenseFeature multiple 64
_________________________________________________________________
dense (Dense) multiple 3008
_________________________________________________________________
dense_1 (Dense) multiple 4160
_________________________________________________________________
dense_2 (Dense) multiple 65
=================================================================
Total params: 7,297
Trainable params: 7,297
Non-trainable params: 0
__________________________________

Accuracy curve:

Guess you like

Origin blog.csdn.net/eylier/article/details/131956215