The previous data is normalized, and most of the models use the default initialization configuration
But is the data exposed in century research and work all organized in this way? Is the default configuration the best?
3.1 Model practical skills
Once we decide to use a certain model, the library provided in this book can help us learn the parameters required by the model from standard training data, relying on the default configuration;
Next, we can use this set of parameters to guide the model to make predictions on the test data, and then evaluate the performance of the model
But this scheme does not guarantee:
① All data features used for training are the best
②The learned parameters must be optimal
③The model in the default configuration is always the best
Together, we can improve the performance of the previously used models from multiple perspectives (preprocessing data control parameters to optimize model configuration)
Feature boosting (feature extraction and feature filtering)
Feature extraction
The so-called feature extraction is to convert the original data into the form of dimensional feature vectors one by one. This process also involves the quantitative representation of data features.
Raw data:
1 Digitized signal data (voiceprint, image)
2 There is also a large amount of symbolic text
①We cannot directly use the symbolized text itself for computing tasks, but need to quantify the text into feature vectors in advance through some processing methods
Some data features represented by symbols have been relatively structured and stored in the data structure of a dictionary.
At this time, we use DictVectorizer to extract and vectorize features
CODE
How DiceVectorizer handles features (dictionary):
1 category row uses 0/1 binary method
2 The digital type can maintain the original value
② Some other text data are more primitive knowledge and a series of strings. We use bag-of-words method to extract and vectorize features
Two calculation methods of bag-of-words method
CountVectorizer
TfidVectorizer
The more entries in the training text, the more advantageous the feature quantization method TfidVectorizer is.
Using TfidVectorizer to suppress the interference of common words on classification decisions can often improve the performance of the model
Stop words are filtered out by blacklisting
CODES
Feature screening
Good