Google machine learning framework tensorflow study notes (11)

Representation

Machine learning models cannot directly see, hear, or perceive input samples. You must create data representations that provide models with useful signals to understand key characteristics of the data. That is, in order to train a model, you must choose the feature set that best represents the data. The process of extracting features from raw data is called feature engineering. In practice, machine learning practitioners spend about 75% of their time on feature engineering, and features are what we want. Traditional programming focuses on code. In machine learning projects, the focus becomes representation. That is, developers tweak the model by adding and improving features.

Mapping raw data to features

The left side of Figure 1 represents the raw data from the input data source, and the right side represents the feature vector , which is the set of floating-point values ​​that make up the samples in the dataset.   Feature engineering refers to converting raw data into feature vectors. Doing feature engineering is expected to take a significant amount of time.
Machine learning models often expect samples to be represented as vectors of real numbers. This vector is constructed as follows: Derive features for each field, then concatenate them all together.
Figure 1. Program engineering maps raw data to machine learning features.

map values

Machine learning models are trained on floating-point values, so integer and floating-point raw data do not require special encoding. As Figure 2 shows, it makes no sense to convert the raw integer value 6 to the eigenvalue 6.0:

map string value

Models cannot learn patterns from string values, so you'll need to do some feature engineering to convert these values ​​to numeric form:

  1. First, define a vocabulary for the string values ​​of all the features you want to represent . For features, the vocabulary will contain all the streets you know. street_name 

    Note : All other streets can be grouped into a general "Other" category known as the OOV (Out of Glossary) bucket. 
  2. Then, use this vocabulary to create a one -hot encoding that represents the specified string value as a binary vector. In this vector (corresponding to the specified string value):

    • Only one element is set to . 1
    • All other elements are set to . 0

    The length of this vector is equal to the number of elements in the vocabulary.

Figure 3 shows the one-hot encoding of a particular street (Shorebird Way). In this binary vector, the value of the element representing the Shorebird Way is and the value of the element representing all other streets . 1 0


Figure 3. Mapping string values ​​via one-hot encoding.

Mapping categorical (enumeration) values

Categorical features have a discrete set of possible values. For example, a feature named contains only 3 possible values: Lowland Countries 

 
 
{'Netherlands', 'Belgium', 'Luxembourg'}

You might encode categorical features (like ) as enumerated types or discrete sets of integers representing distinct values. E.g: Lowland Countries

  • Represent the Netherlands as 0
  • Represent Belgium as 1
  • Represent Luxembourg as 2

However, machine learning models often represent each categorical feature as a separate Boolean value. For example, Lowland Countries in the model can be represented as 3 separate boolean features:

  • x 1 : Is it the Netherlands?
  • x 2 : Is it Belgium?
  • x 3 : Is it Luxembourg?

Coding in this way also simplifies situations where a value may belong to more than one classification (eg, "border with France" is True for both Belgium and Luxembourg).





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324858320&siteId=291194637