Python Machine Learning and Practical Tips for Practical Codeing Models (Feature Enhancement Model Regularization Model Detection Hyperparameter Search)

The previous data is normalized, and most of the models use the default initialization configuration

But is the data exposed in century research and work all organized in this way? Is the default configuration the best?

3.1 Model practical skills

Once we decide to use a certain model, the library provided in this book can help us learn the parameters required by the model from standard training data, relying on the default configuration;

Next, we can use this set of parameters to guide the model to make predictions on the test data, and then evaluate the performance of the model

But this scheme does not guarantee:

① All data features used for training are the best

②The learned parameters must be optimal

③The model in the default configuration is always the best

Together, we can improve the performance of the previously used models from multiple perspectives (preprocessing data control parameters to optimize model configuration)

Feature boosting (feature extraction and feature filtering)

Feature extraction

The so-called feature extraction is to convert the original data into the form of dimensional feature vectors one by one. This process also involves the quantitative representation of data features.

Raw data:

      1 Digitized signal data (voiceprint, image)

      2 There is also a large amount of symbolic text 

①We cannot directly use the symbolized text itself for computing tasks, but need to quantify the text into feature vectors in advance through some processing methods

Some data features represented by symbols have been relatively structured and stored in the data structure of a dictionary.

At this time, we use DictVectorizer to extract and vectorize features

CODE

How DiceVectorizer handles features (dictionary):

1 category row uses 0/1 binary method

2 The digital type can maintain the original value

② Some other text data are more primitive knowledge and a series of strings. We use bag-of-words method to extract and vectorize features

Two calculation methods of bag-of-words method

CountVectorizer

TfidVectorizer

The more entries in the training text, the more advantageous the feature quantization method TfidVectorizer is. 

Using TfidVectorizer to suppress the interference of common words on classification decisions can often improve the performance of the model

Stop words are filtered out by blacklisting

CODES

Feature screening

Good

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324829380&siteId=291194637