Very simple in some conventional finishing process engineering features I will use, and is not specifically introduced.
A numerical feature
1. Pretreatment
2. discrete value processing
labelEncoder / map / one-hot-encoding / get_dummy
Binary conversion characteristics
Characteristic polynomial (model with SVM)
3. Continuous wherein Discretization
binning
Quantile segmentation
4. logarithmic transformation (normal analog)
II. Date feature
ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
A standard format to timestamp '2015-03-0810: 30: 00.360000 0000 +'
Function may be used to extract more features
df['Year'] = df['TS_obj'].apply(lambda d: d.year)
df['Month'] = df['TS_obj'].apply(lambda d: d.month)
df['Day'] = df['TS_obj'].apply(lambda d: d.day)
df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
df['DayName'] = df['TS_obj'].apply(lambda d: d.weekday_name)
df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)
III. Text feature
Text data set
1. Basic Pretreatment
The Corpus preprocessing: line a document or sentence, the document or sentence word (separated by spaces, the English can not word, has been separated by spaces between the English words, Chinese is expected to need to use the word tools word, common segmentation tools StandNLP , ICTCLAS, Ansj, FudanNLP, HanLP, stuttering and other word, and remove stop words) formed array format. Set the number of article 5, after removing stop words for the kind words 11
2. The bag of words model (consider word frequency, sparse matrix 5 * 11)
from sklearn.feature_extraction.text import CountVectorizer
3.N-Grams model (consider the word order, the matrix is very sparse)
bv = CountVectorizer(ngram_range=(2,2))
Model 4.TF-IDF (term frequency weight * 5 * 11)
Wherein 5.Similarity (5 * 5)
6.LDA model
7. A word model is embedded (indicated dimension of 5 *)
from gensim.models import word2vec
Detailed text processing features visible on my previous blog
https://blog.csdn.net/weixin_41814051/article/details/104393633