Wherein Engineering (three) different types of data processing pipeline

Very simple in some conventional finishing process engineering features I will use, and is not specifically introduced.

A numerical feature

1. Pretreatment

 

       

2. discrete value processing

labelEncoder / map / one-hot-encoding / get_dummy

Binary conversion characteristics

Characteristic polynomial (model with SVM)

3. Continuous wherein Discretization

binning 

Quantile segmentation

4. logarithmic transformation (normal analog)

II. Date feature

ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])

A standard format to timestamp '2015-03-0810: 30: 00.360000 0000 +'

Function may be used to extract more features

df['Year'] = df['TS_obj'].apply(lambda d: d.year)
df['Month'] = df['TS_obj'].apply(lambda d: d.month)
df['Day'] = df['TS_obj'].apply(lambda d: d.day)
df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
df['DayName'] = df['TS_obj'].apply(lambda d: d.weekday_name)
df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)

III. Text feature

Text data set

1. Basic Pretreatment

The Corpus preprocessing: line a document or sentence, the document or sentence word (separated by spaces, the English can not word, has been separated by spaces between the English words, Chinese is expected to need to use the word tools word, common segmentation tools StandNLP , ICTCLAS, Ansj, FudanNLP, HanLP, stuttering and other word, and remove stop words) formed array format. Set the number of article 5, after removing stop words for the kind words 11

2. The bag of words model (consider word frequency, sparse matrix 5 * 11) 

from sklearn.feature_extraction.text import CountVectorizer

3.N-Grams model (consider the word order, the matrix is ​​very sparse)

bv = CountVectorizer(ngram_range=(2,2))

Model 4.TF-IDF (term frequency weight * 5 * 11)

Wherein 5.Similarity (5 * 5)

6.LDA model

7. A word model is embedded (indicated dimension of 5 *)

from gensim.models import word2vec

 

Detailed text processing features visible on my previous blog

https://blog.csdn.net/weixin_41814051/article/details/104393633

 

 

 

 

发布了10 篇原创文章 · 获赞 2 · 访问量 1788

Guess you like

Origin blog.csdn.net/weixin_41814051/article/details/104408914