Preparing data for machine learning algorithms (Machine Learning Study 8)

This article is still based on the previous two articles!

Attribute combination experiment

Hopefully the previous sections gave you an idea of ​​a few ways to explore your data and gain insights. You discovered some data quirks that you might want to clean before feeding it to the machine learning algorithm, and you discovered interesting correlations between attributes, especially with the target attribute
correlation between. You also notice that some attributes have right-skewed distributions, so you may need to transform them (for example, by taking their logarithms or square roots). Of course, your mileage will vary greatly with each project, but the general idea is similar.

The last thing you may want to do before preparing your data for a machine learning algorithm is to try various attribute combinations. For example, the total number of rooms in an area isn't very useful if you don't know how many residents there are. What you really want is the number of rooms per home. Likewise, the total number of bedrooms is not very useful on its own: you may want to compare it with the number of rooms. The population per household also seems to be an interesting mix of attributes. Create these new properties as follows:

housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]

Then you look at the correlation matrix again:
Insert image description here

! The new bedrooms_ratio attribute has a much greater correlation with the median value of the house than with the total number of rooms or bedrooms. Obviously, homes with lower bedroom/room ratios tend to be more expensive. The number of rooms per home is also more telling than the total number of rooms in an area - obviously the bigger the house, the more expensive it will be.

This round of exploration doesn't need to be absolutely thorough; the key is to start from the right perspective and gain insights quickly that will help you get your first reasonably good prototype. But it's an iterative process: Once you have a prototype up and running, you can analyze its output to gain more insights before returning to this exploration step.

Prepare data for machine learning algorithms

It’s time to prepare the data for your machine learning algorithm. There are several good reasons why you should write a function for this rather than doing it by hand:

  • This will allow you to easily reproduce these transformations on any dataset (e.g. the next time you get a new dataset).
  • You will gradually build a library of transformation functions for reuse in future projects.
  • You can use these functions in real-time systems to transform new data before feeding it into your algorithm.
  • This will allow you to easily try out various transformations and see which combination of transformations works best.

But first, revert to a clean training set (by copying strat_train_set again). You should also keep predictors and labels separate, as you don't necessarily want to apply the same transformation to predictors and target values ​​(note that drop() creates a copy of the data and does not affect strat_train_set):

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

clear data

Most machine learning algorithms cannot handle missing features, so you need to handle these features. For example, you noticed earlier that the total_bedrooms attribute has some missing values. You have three options to solve this problem:

  1. Remove the corresponding area.
  2. Remove the entire attribute.
  3. Set missing values ​​to some value (zero, mean, median, etc.). This is called imputation.

You can easily accomplish these tasks using PandasDataFrame's dropna(), drop(), and fillna() methods:

housing.dropna(subset=["total_bedrooms"], inplace=True) # option 1
housing.drop("total_bedrooms", axis=1) # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median, inplace=True)

You decide to use option 3 because it's the least disruptive, but instead of the previous code you'll be using a convenience Scikit-Learn class: Simplelmputer. The advantage of this is that it will store the median value of each feature: this will allow it to impute missing values ​​not only on the training set, but also on the validation set, test set and any new data input to the model . To use it, you first need to create a Simplelmputer instance, specifying that you want to replace missing values ​​for each attribute with the median value of that attribute:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Since the median can only be calculated on numeric attributes, you need to create a copy of the data with only numeric attributes (this will exclude the text attribute ocean_proximity):

housing_num = housing.select_dtypes(include=[np.number])

Now you can fit the filler instance to the training data using the fit() method:

imputer.fit(housing_num)

The estimator simply calculates the median for each attribute and stores the result in its statistics_instance variable. Only the total_bedrooms attribute has missing values, but you can't be sure that there won't be any missing values ​​in the new data after the system goes live, so it's safer to apply the filler to all numeric attributes:

Insert image description here

Now you can use this "trained" estimator to transform the training set by replacing missing values ​​with the learned median:

X = imputer.transform(housing_num)

Missing values ​​can also be replaced by the mean (strategy="mean"), or by the most frequent value (strategy="most_frequent"), or by a constant value (strategy="constant", fill_value=…). The latter two strategies support non-numeric data.

There are also more powerful imputers in the sklear.impute package (both for numerical features only):

  • KNNImputer replaces each missing value with the average
    value of the k-nearest neighbor values ​​of the feature. Distance is based on all available features.
  • Iterativelmputer trains a regression model for each feature to predict missing values ​​based on all other available
    features. It then trains the model again on the updated data, and
    repeats the process multiple times, improving the model and replacing values ​​at each iteration.

Scikit-Learn converters output NumPy arrays (or sometimes SciPy sparse matrices) even when they are fed into a pandas dataframe. "Thus, the output of input.Transform(Home_Num) is a NumPy array:

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing_num.index)

Process text and categorical attributes

So far we've only dealt with numeric attributes, but your data may also contain text attributes. In this data set, there is only one: ocean_proximity attribute. Let's look at the first few instances of its value:
Insert image description here

It's not arbitrary text: there are a limited number of possible values, each representing a category. So this attribute is a classification attribute. Most machine learning algorithms prefer to work with numbers, so let's convert these categories from text to numbers. For this we can use Scikit-Learn’s OrdinalEncoder class:

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

The first few encoded values ​​in housing_cat_encoded are like this:

Insert image description here

You can use the categories_instance variable to get the list of categories. It is a list containing a one-dimensional array of categories for each categorical attribute (in this case, the list contains a single array, since there is only one categorical attribute):

Insert image description here

One problem with this representation is that ML algorithms will assume that two nearby values ​​are more similar than two distant values. This might not be a problem in some cases (e.g., for sorted categories like "Bad", "Average", "Good", and "Excellent"), but obviously this is not the case for the ocean adjacent column (e.g., Categories 0 and 4 are obviously more similar than categories 0 and 1). To solve this problem, a common solution is to create a binary attribute for each category: an attribute equal to 1 when the category is "<1H Ocean" (otherwise is 0), the other attribute is equal to 1 when "inland" (0 otherwise), and so on. This is called one-hot encoding because only one attribute will be equal to 1 (hot), while the other attributes will be equal to 0 ( cold). The new attributes are sometimes called pseudo-attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values ​​into one-hot vectors:

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

By default, the output of OneHotEncoder is a SciPy sparse matrix, not a NumPyarray:

Insert image description here

Sparse matrices are a very efficient representation of matrices that contain mostly zeros. In fact, it only stores non-zero values ​​and their positions internally. When a categorical attribute has hundreds or thousands of categories, one-hot encoding produces a very large matrix with only one 1 in each row and all the rest are 0. In this case, a sparse matrix is ​​exactly what you need: it will save a lot of memory and speed up calculations. You can use a sparse matrix, just like a normal 2D array, but if you want to convert it to a (dense) NumPy array, just call the toarray() method:

Insert image description here

Alternatively, you can set sparse=False when creating the OneHotEncoder, in which case the transform() method will directly return a regular (dense) NumPy array.

As with OrdinalEncoder, you can get a list of categories using the encoder's categories_instance variable:

Insert image description here

Pandas has a function called get_dummies() which also converts each categorical feature into a single hotspot representation,
There is one binary feature per category:

Insert image description here

It looks nice and simple, so why not use it instead of OneHotEncoder? The advantage of OneHotEncoder is that it remembers the categories it was trained on. This is very important because once your model is in production, it should provide exactly the same functionality as during training: no more, no less. Look at the output of our trained cat_encoder when transforming the same df_test (using transform(), not fit_transform()):

Insert image description here

See the difference? get_dummies() only sees two categories, so it outputs two columns, while OneHotEncoder outputs one column for each learned category in the correct order. And if you give get_dummies() a DataFrame containing an unknown category (for example, "2HOPEN"), then it will happily generate a column for it:

Insert image description here

But OneHotEncoder is smarter: it will detect unknown categories and throw an exception. If you wish, you can set the handle_unknown hyperparameter to "ignore", in which case it will represent unknown categories with zero:

Insert image description here

When fitting any Scikit-Learn estimator using a DataFrame, the estimator stores column names in feature_names_in_attribute. Scikit-Learn then ensures that any DataFrame fed to this estimator after this (e.g. to Transform() or Predict()) has the same column names. Transformers also provide the get_feature_names_out () method, which you can use to build a DataFrame around the output of the Transformers:

Insert image description here

Guess you like

Origin blog.csdn.net/coco2d_x2014/article/details/134228012