[Recommendation System] Feature Processing

"Data and features determine the upper limit of the model, and the model algorithm approaches this upper limit." The essence of features is an engineering activity, the purpose is to maximize the extraction of features from raw data for use by algorithm models. In the actual process of building a recommendation system, there are not many features that can be directly used in model algorithms. Whether useful features can be mined from the original data will directly determine the quality of the recommendation system. The general processing flow for features is feature acquisition, feature cleaning, feature processing and feature monitoring, of which the core part is feature processing.
Since the features in the original data usually cannot be used directly in the algorithm model, they need to be put into the model after feature transformation and feature selection. Feature transformation includes various transformations of original features to better express the inherent laws of the original data and facilitate model algorithm training, while feature selection selects and extracts features useful for model expression, hoping to build a more flexible and simpler model.

Feature processing method

Since the data source contains different types of variables, different variables are often processed differently.

Numerical feature processing method one: dimensionless processing

Dimensionless transformation converts data of different specifications into the same specification. Common dimensionless methods include standardization and interval scaling. The premise of standardization is that the feature values obey the normal distribution. After standardization, they are converted into standard normal distribution. The interval scaling method uses boundary value information to scale the feature value interval to a specific range, such as [0, 1], etc.

standardization

After standardized transformation, the mean value of each dimension feature is 0 and the variance is 1, which is also called Z-Score normalization. The calculation method is as follows: the mean value is subtracted from the feature value and divided by the standard deviation.

$x' = \frac{x-\hat{x}}{S}$

import numpy as np
from sklearn import preprocessing

x= np.array([[1.,-1.,2.],
            [2.,0.,0.],
            [0.,1.,-1.]])
x_scaled = preprocessing.scale(x)
x_scaled

Interval scaling method (maximum and minimum normalization), linearly changes the original data to the [0,1] interval

$x' = \frac{x-Min}{Max - Min}$

import numpy as np
from sklearn import preprocessing

x= np.array([[1.,-1.,2.],
            [2.,0.,0.],
            [0.,1.,-1.]])
x_max_min_scaled = preprocessing.MinMaxScaler().fit_transform(x)
x_max_min_scaled

Secondary core standardization

norm supports L1, L2, max

import numpy as np
from sklearn import preprocessing

x= np.array([[1.,-1.,2.],
            [2.,0.,0.],
            [0.,1.,-1.]])
x_normalize = preprocessing.normalize(x,norm='l2')
x_normalize

Numerical feature processing method 2: nonlinear transformation

Nonlinear transformation of features increases model complexity. Commonly used changes include transformations based on polynomials, exponential functions, and logarithmic functions.

Generally, the distribution is more stable after logarithmic transformation. Logarithmic transformation can well solve the problem that as the independent variable increases, the variance of the dependent variable becomes larger. Convert nonlinear data into linear data through logarithmic transformation so that linear models can be used for learning, such as SVM. For linearly inseparable data, first perform kernel function mapping on the data, and then map the low-dimensional data to high-dimensional space, so that The data is linearly separable in the projected high-dimensional space.

Numerical feature processing method three: discretization

Sometimes numerical features need to be discretized based on the business and the meaning they represent. Discretization has the following benefits:
The discretized features are very robust to abnormal data. For example, if a feature is age > 30, it is 1, otherwise it is 0. If the features are not discretized, an abnormal data "age 100 years old" will cause great interference to the model;
After feature discretization, feature intersection can be performed, and feature inner product multiplication operation is fast, which further introduces nonlinearity and improves expression ability. The calculation results are easy to store and easy to expand;
After the features are discretized, the model will be more stable. For example, if the user's age is discretized and 20 to 30 is used as an interval, a user will not become a completely different person just because he is one year older. However, samples located adjacent to the interval will be exactly the opposite, so how to divide the interval is also very important. Generally, it can be divided into unsupervised discretization and supervised discretization according to whether label information is used.

Unsupervised discretization:

Unsupervised discretization methods usually binning features, which are divided into equal width discretization methods and < a i=3>Equal frequency discretization method. The equal-width discretization method is to obtain a fixed width based on the number of bins, so that the width of the data divided into each bin is equal. The equal-frequency binning method ensures that the number of data in each bin is the same. After equal-width or equal-frequency division, each value in the bin can be replaced with the median or average value in the bin to achieve discretization of features. These two methods need to specify the number of intervals. At the same time, the equal-width discretization method is more sensitive to outliers and tends to divide features into various bins unevenly, which will destroy the decision-making ability of features. Although the equal-frequency discretization method will avoid the above problems, it may classify the same feature values with the same labels into different bins, which will also cause a decrease in decision-making ability.

The discretization method based on cluster analysis is also an unsupervised discretization method. This method consists of two steps. First, the values of a certain feature are divided into clusters using a clustering algorithm (such as the K-means algorithm) by considering the distribution of feature values and the proximity of data points, and then the clusters obtained by clustering are divided into clusters. Reprocessing,processing methods can be divided into top-down splitting,strategies and bottom-up merging strategies. The splitting strategy is to further split each initial cluster into several sub-clusters, and the merging strategy is to merge adjacent clusters repeatedly. The discretization method of cluster analysis usually also requires the user to specify the number of clusters to determine the number of intervals generated by discretization.

For the discretization of actual data, corresponding adjustments can be made according to the laws of the business, and natural intervals can be used for corresponding discretization.

Supervised discretization:

Supervised discretization methods have more expression forms and processing methods than unsupervised discretization methods.

The more commonly used methods areentropy-based discretization methodandchi-square-based discretization method Method.

Since the method of using entropy to split continuous features when building a decision tree works well in practice, this idea is extended to more general feature discretization by repeatedly splitting intervals until a stopping condition is met. This gave rise to entropy-based discretization methods. Entropy is one of the most commonly used discretization measures. The entropy-based discretization method uses class distribution information to calculate and determine splitting points and is a supervised, top-down splitting technique. ID3 and C4.5 are two commonly used algorithms that use entropy measurement criteria to build decision trees. Discretizing features based on these two methods is almost the same as the method of building decision trees.

Based on the above method, the MDLP method (minimum description distance length rule) was produced. The idea of MDLP is to assume that the breakpoint is the boundary of the class, and thus obtain many small intervals. The class labels of the instances in each interval are the same. , and then apply the MDLP criterion to measure which of the demarcation points of the class meet the requirements and can be used as endpoints, and which are not endpoints and need to merge adjacent intervals. From this, the necessary breakpoints are selected and the entire data set is discretized.

Use the discretization package of R to perform the MDLP method using the data set iris that comes with R as an example.

／／MDLP特征离散化
library(discretization) 
data(iris)
mdlp(iris)$Disc.data

Chi-square based discretization method:

Different from the entropy-based discretization method, the chi-square-based discretization method adopts a bottom-up strategy. First, all data values within the data value range are listed as a separate interval, and then the best neighbor can be found recursively. merged intervals, and then merge them to form a larger interval. When determining the best neighbor merging interval, the chi-square statistic is used to detect the correlation between two objects. The most commonly used chi-square based discretization method is the ChiMerge method.

First, each different value of the numerical feature is regarded as an interval. The chi-square statistic is calculated for each pair of adjacent intervals and compared with a threshold determined by a given confidence level. Above the threshold, the adjacent intervals are compared. Merge, because a high chi-square statistic indicates that these two adjacent intervals have similar class distributions, and intervals with similar class distributions should be merged into one interval. The merging process is performed recursively until the calculated chi-square statistic is no longer greater than the threshold, that is, if no adjacent intervals can be found for merging, the discretization process is terminated and the final discretization result is obtained.

Discrete feature processing method one: One-Hot encoding

In actual recommendation systems, many features are category attribute features, and One-Hot coding is usually used to encode these features. If a feature has m possible values, it becomes m binary features after One-Hot encoding, and these features are mutually exclusive. One-Hot coding can extend the values of discrete features to the Euclidean space. A certain value of the discrete feature corresponds to a certain point in the Euclidean space. It can facilitate calculations such as similarity in the learning algorithm, and can be represented sparsely, reducing Storage can also play a role in expanding features to a certain extent.

import numpy as np
from sklearn import preprocessing

one_hot_enc = preprocessing.OneHotEncoder()
one_hot_enc.fit([[1,1,2],[0,1,0],[0,2,1],[1,0,3]])
after_one_hot = one_hot_enc.transform([[0,1,3]]).toarray()
print(after_one_hot)
# [[1. 0. 0. 1. 0. 0. 0. 0. 1.]]

Discrete feature processing method two: feature hashing

The goal of feature hashing is to compress the original high-dimensional feature vector into a lower-dimensional feature vector without losing the expressive ability of the original feature. It is a fast and space-saving feature vectorization method. In the recommendation system, there will be many features such as ID type (of course, the embedding method can also be used, but the hash method is more resource-saving). Using feature hashing can avoid generating extremely sparse data, but it may cause collisions.

Collision may reduce the accuracy of the results, or it may improve the accuracy of the results. Generally, another function is used to resolve the collision. Its general description is to design a function v=h(x) that can convert the d-dimensional vector x=(x(1),x(2),... into a new m-dimensional vector v, where m can be greater than It can also be less than d. The usual method is to use a hash function to map x(1) to v(h(1))) and x(d) to v(h(d)). The Hash function can convert any input into a fixed range of integer output. The text below is used to illustrate this. You can see that the program converts the sentence into a fixed-dimensional vector. Features of the same ID type can also be processed in the same way, and each word is corresponding to an ID.

def hashing_vectorizer(s,N):
    x = [0 for i in range(N)]
    for f in s.split():
        h = hash(f)
        x[h % N] += 1
    return x
print( hashing_vectorizer('make a hash feature',3))
#     [2, 2, 0]

Discrete feature processing method three: temporal feature processing

Recommendation systems usually contain many time-related features, and how to effectively mine time-related features will also greatly affect the recommendation effect. The usual solution is to process relevant features according to business logic and business purpose. Christ, M, etc. proposed a solution for hierarchical processing of time features, as shown in Figure 5.7. It includes time window statistical features: maximum, minimum, mean, quantile, and uses label correlation to select features. The following is a brief introduction to using Python's tsfresh tool to extract features from the Robot Execution Failures data set.

from tsfresh.examples.robot_execution_failures import download_robot_execution_failures,\
                       load_robot_execution_failures 

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures() 

from tsfresh import extract_features
extracted_features = extract_features(timeseries, column_id="id", 
                                      column_sort="time")

from tsfresh import select_features 
from tsfresh.utilities.dataframe_functions import impute 
impute(extracted_features)
features_filtered = select_features(extracted_features,y)