Transformation problem of discrete input

There are generally several processing methods for discrete input:

1. If it is a number, it can be directly input to the model, or regularized to [0-1] before input.

However, discrete numbers often represent an entity, such as it may be an id, and it is not appropriate to enter the model as a number. Moreover, discrete data is not necessarily data, but more strings.

2. If it is a string, it can be converted into one-hot encoding, but in this case, the data with 0 accounts for more than 90%.

3. So you need to use Embedding at this time. Before using Embedding, you need to construct a dictionary.

The shape of the Embedding layer is (input_length, dim, vocab_size). input_length is the dimension of the input, dim means that a word I want to represent as a vector of several dimensions, and vocab_size means the size of the vocabulary. The input needs to be a number between [0-vocab_size-1], so we need to convert the discrete input into a number, and then we need to construct a dictionary.

construction dictionary

In the first step, create a discrete dataset:

import numpy as np

random_numbers = np.random.randint(low=1, high=1000000, size=10000)

// array([781702, 805689, 194619, ..., 268855, 114390, 963977])

In the second step, extract the dictionary in the discrete data:

np.savetxt('voc.txt', [_ for _ in random_numbers], delimiter='\n', fmt='%d')

Create discrete data, convert dictionary indexes, create continuous data

# 加载词典
def get_vocab(path):
    vocab_dict = {
    
    }

    with open(path, 'r', encoding='utf-8') as file:
        for index, line in enumerate(file):
            word = line.strip()
            vocab_dict[word] = index

    print(f"\n===词典长度==={
      
      len(vocab_dict)}===\n")
    
    return vocab_dict

def get_data():
    # 设置随机种子，以确保结果可复现（可选）
    np.random.seed(0)

    # 生成随机数据
    data = np.random.rand(10000, 10)
    
    # 正则化数据
    scaler = StandardScaler()
    data = scaler.fit_transform(data)
    
    random_numbers = np.random.randint(low=1, high=1000000, size=10000)
    np.savetxt('voc.txt', [_ for _ in random_numbers], delimiter='\n', fmt='%d')
    vocab_dict = get_vocab('voc.txt')
    
    discrete = [vocab_dict[str(i)] for i in random_numbers]

    # 生成随机数据
    target = np.random.rand(10000, 1)
    
    

    return train_test_split(data, target, discrete, test_size=0.1, random_state=42)


data_train, data_val, target_train, target_val, discrete_train, discrete_val = get_data()

get_vocab function:
This function is used to load the dictionary from the file at the specified path. It will read the contents of the file line by line, and use the word of each line as the key of the dictionary, and the line number as the corresponding value. Ultimately returns a dictionary object containing the contents of the dictionary.

The path parameter indicates the path of the dictionary file.
The open function is used inside the function to open the file and read the file content line by line.
For each line, use the strip method to strip unwanted characters such as newlines at the end of the line and use it as a key to the dictionary.
The row number (i.e. the index value) is used as the corresponding value, and the key-value pair is added to the dictionary.
Finally, a dictionary object containing the contents of the dictionary is returned.

get_data function:
This function is used to generate random data, and combine the dictionary to map the randomly generated integers to discrete values. The execution of the function is as follows:

First, use the np.random.rand function to generate a random data matrix data of shape (10000, 10).
Next, use a StandardScaler to regularize the data, converting it to a mean of 0 and a standard deviation of 1.
Then, use np.random.randint to generate a random integer array random_numbers of length 10000 in the range 1 to 1000000.
Use the np.savetxt function to save random_numbers as a text file voc.txt, each integer occupies one line.
Call the get_vocab function, load the dictionary file voc.txt, and store it in the vocab_dict dictionary.
According to the dictionary, the integers in random_numbers are mapped to the corresponding discrete values and stored in the discrete list.
Finally, use the np.random.rand function to generate a random target value array target with shape (10000, 1).

The function returns the divided training set and validation set data, including data_train, data_val, target_train, target_val, discrete_train and discrete_val. These data will be used in subsequent model training and validation.

Create a discrete input + continuous input model

def create_mlp(dim, regress=False):
    model = Sequential()
    model.add(Dense(64, input_dim=dim, activation="relu"))
    model.add(Dense(64, activation="relu"))
    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))
    # return our model
    return model


def create_emb(dim, regress=False):
    model = Sequential()
    model.add(Embedding(input_length= dim, output_dim=8, input_dim=vocabulary_size))
    model.add(LSTM(128))
    model.add(Dense(64, activation="relu"))
    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))
    # return our model
    return model

mlp = create_mlp(10, regress=False)
emb = create_emb(1, regress=False)

combined = concatenate([mlp.output, emb.output])

z = Dense(2, activation="relu")(combined)
z = Dense(1, activation="linear")(z)

model = Model(inputs=[mlp.input, emb.input], outputs=z)

model.summary()

This code defines two functions create_mlp and create_emb for creating MLP (Multilayer Perceptron) and Embedding-LSTM models and combining them to build a joint model.

create_mlp function:
This function is used to create an MLP model. MLP is a feedforward neural network consisting of multiple fully connected layers. The input parameter dim of the function indicates the input dimension, and regress indicates whether it is a regression task.

Create a Sequential model object.
Add a fully connected layer with 64 neurons, the input dimension is dim, and the activation function is ReLU.
Add a second fully connected layer with 64 neurons and an activation function of ReLU.
If regress is True, add an output layer with 1 neuron and the activation function is a linear activation function (for regression tasks).
Return the constructed MLP model object.

create_emb function:
This function is used to create a model that includes Embedding and LSTM. Embedding is a technique for mapping discrete sequences of integers to low-dimensional continuous vectors, while LSTM is a long short-term memory network.

创建一个Sequential模型对象。
添加一个Embedding层，指定输入长度为dim，输出维度为8，输入维度为vocabulary_size（词汇表大小）。
添加一个LSTM层，具有128个神经元。
添加一个具有64个神经元的全连接层，激活函数为ReLU。
如果regress为True，则添加一个具有1个神经元的输出层，激活函数为线性激活函数（用于回归任务）。
返回构建好的Embedding-LSTM模型对象。
接下来的代码将两个模型的输出通过concatenate函数进行合并。然后，构建一个新的模型model，输入为MLP模型的输入和Embedding-LSTM模型的输入，输出为合并后的结果。

使用Model函数定义一个新的模型对象，指定输入为MLP模型的输入和Embedding-LSTM模型的输入，输出为合并后的结果。
添加一个具有2个神经元的全连接层，激活函数为ReLU。
添加一个具有1个神经元的输出层，激活函数为线性激活函数。

打印模型的摘要信息，包括每层的名称、输出形状和参数数量。
通过以上步骤，你可以创建一个包含MLP和Embedding-LSTM的联合模型，并输出该模型的摘要信息，包括每层的配置和参数数量。

训练输出

模型结构如下：

在这里插入图片描述

模型训练过程中的输出如下：