This article is still based on the previous two articles!
Attribute combination experiment
Hopefully the previous sections gave you an idea of a few ways to explore your data and gain insights. You discovered some data quirks that you might want to clean before feeding it to the machine learning algorithm, and you discovered interesting correlations between attributes, especially with the target attribute
correlation between. You also notice that some attributes have right-skewed distributions, so you may need to transform them (for example, by taking their logarithms or square roots). Of course, your mileage will vary greatly with each project, but the general idea is similar.
The last thing you may want to do before preparing your data for a machine learning algorithm is to try various attribute combinations. For example, the total number of rooms in an area isn't very useful if you don't know how many residents there are. What you really want is the number of rooms per home. Likewise, the total number of bedrooms is not very useful on its own: you may want to compare it with the number of rooms. The population per household also seems to be an interesting mix of attributes. Create these new properties as follows:
housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]
Then you look at the correlation matrix again:
! The new bedrooms_ratio attribute has a much greater correlation with the median value of the house than with the total number of rooms or bedrooms. Obviously, homes with lower bedroom/room ratios tend to be more expensive. The number of rooms per home is also more telling than the total number of rooms in an area - obviously the bigger the house, the more expensive it will be.
This round of exploration doesn't need to be absolutely thorough; the key is to start from the right perspective and gain insights quickly that will help you get your first reasonably good prototype. But it's an iterative process: Once you have a prototype up and running, you can analyze its output to gain more insights before returning to this exploration step.
Prepare data for machine learning algorithms
It’s time to prepare the data for your machine learning algorithm. There are several good reasons why you should write a function for this rather than doing it by hand:
- This will allow you to easily reproduce these transformations on any dataset (e.g. the next time you get a new dataset).
- You will gradually build a library of transformation functions for reuse in future projects.
- You can use these functions in real-time systems to transform new data before feeding it into your algorithm.
- This will allow you to easily try out various transformations and see which combination of transformations works best.
But first, revert to a clean training set (by copying strat_train_set again). You should also keep predictors and labels separate, as you don't necessarily want to apply the same transformation to predictors and target values (note that drop() creates a copy of the data and does not affect strat_train_set):
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
clear data
Most machine learning algorithms cannot handle missing features, so you need to handle these features. For example, you noticed earlier that the total_bedrooms attribute has some missing values. You have three options to solve this problem:
- Remove the corresponding area.
- Remove the entire attribute.
- Set missing values to some value (zero, mean, median, etc.). This is called imputation.
You can easily accomplish these tasks using PandasDataFrame's dropna(), drop(), and fillna() methods:
housing.dropna(subset=["total_bedrooms"], inplace=True) # option 1
housing.drop("total_bedrooms", axis=1) # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median, inplace=True)
You decide to use option 3 because it's the least disruptive, but instead of the previous code you'll be using a convenience Scikit-Learn class: Simplelmputer. The advantage of this is that it will store the median value of each feature: this will allow it to impute missing values not only on the training set, but also on the validation set, test set and any new data input to the model . To use it, you first need to create a Simplelmputer instance, specifying that you want to replace missing values for each attribute with the median value of that attribute:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
Since the median can only be calculated on numeric attributes, you need to create a copy of the data with only numeric attributes (this will exclude the text attribute ocean_proximity):
housing_num = housing.select_dtypes(include=[np.number])
Now you can fit the filler instance to the training data using the fit() method:
imputer.fit(housing_num)
The estimator simply calculates the median for each attribute and stores the result in its statistics_instance variable. Only the total_bedrooms attribute has missing values, but you can't be sure that there won't be any missing values in the new data after the system goes live, so it's safer to apply the filler to all numeric attributes:
Now you can use this "trained" estimator to transform the training set by replacing missing values with the learned median:
X = imputer.transform(housing_num)
Missing values can also be replaced by the mean (strategy="mean"), or by the most frequent value (strategy="most_frequent"), or by a constant value (strategy="constant", fill_value=…). The latter two strategies support non-numeric data.
There are also more powerful imputers in the sklear.impute package (both for numerical features only):
- KNNImputer replaces each missing value with the average
value of the k-nearest neighbor values of the feature. Distance is based on all available features.- Iterativelmputer trains a regression model for each feature to predict missing values based on all other available
features. It then trains the model again on the updated data, and
repeats the process multiple times, improving the model and replacing values at each iteration.
Scikit-Learn converters output NumPy arrays (or sometimes SciPy sparse matrices) even when they are fed into a pandas dataframe. "Thus, the output of input.Transform(Home_Num) is a NumPy array:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing_num.index)
Process text and categorical attributes
So far we've only dealt with numeric attributes, but your data may also contain text attributes. In this data set, there is only one: ocean_proximity attribute. Let's look at the first few instances of its value:
It's not arbitrary text: there are a limited number of possible values, each representing a category. So this attribute is a classification attribute. Most machine learning algorithms prefer to work with numbers, so let's convert these categories from text to numbers. For this we can use Scikit-Learn’s OrdinalEncoder class:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
The first few encoded values in housing_cat_encoded are like this:
You can use the categories_instance variable to get the list of categories. It is a list containing a one-dimensional array of categories for each categorical attribute (in this case, the list contains a single array, since there is only one categorical attribute):
One problem with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This might not be a problem in some cases (e.g., for sorted categories like "Bad", "Average", "Good", and "Excellent"), but obviously this is not the case for the ocean adjacent column (e.g., Categories 0 and 4 are obviously more similar than categories 0 and 1). To solve this problem, a common solution is to create a binary attribute for each category: an attribute equal to 1 when the category is "<1H Ocean" (otherwise is 0), the other attribute is equal to 1 when "inland" (0 otherwise), and so on. This is called one-hot encoding because only one attribute will be equal to 1 (hot), while the other attributes will be equal to 0 ( cold). The new attributes are sometimes called pseudo-attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
By default, the output of OneHotEncoder is a SciPy sparse matrix, not a NumPyarray:
Sparse matrices are a very efficient representation of matrices that contain mostly zeros. In fact, it only stores non-zero values and their positions internally. When a categorical attribute has hundreds or thousands of categories, one-hot encoding produces a very large matrix with only one 1 in each row and all the rest are 0. In this case, a sparse matrix is exactly what you need: it will save a lot of memory and speed up calculations. You can use a sparse matrix, just like a normal 2D array, but if you want to convert it to a (dense) NumPy array, just call the toarray() method:
Alternatively, you can set sparse=False when creating the OneHotEncoder, in which case the transform() method will directly return a regular (dense) NumPy array.
As with OrdinalEncoder, you can get a list of categories using the encoder's categories_instance variable:
Pandas has a function called get_dummies() which also converts each categorical feature into a single hotspot representation,
There is one binary feature per category:
It looks nice and simple, so why not use it instead of OneHotEncoder? The advantage of OneHotEncoder is that it remembers the categories it was trained on. This is very important because once your model is in production, it should provide exactly the same functionality as during training: no more, no less. Look at the output of our trained cat_encoder when transforming the same df_test (using transform(), not fit_transform()):
See the difference? get_dummies() only sees two categories, so it outputs two columns, while OneHotEncoder outputs one column for each learned category in the correct order. And if you give get_dummies() a DataFrame containing an unknown category (for example, "2HOPEN"), then it will happily generate a column for it:
But OneHotEncoder is smarter: it will detect unknown categories and throw an exception. If you wish, you can set the handle_unknown hyperparameter to "ignore", in which case it will represent unknown categories with zero:
When fitting any Scikit-Learn estimator using a DataFrame, the estimator stores column names in feature_names_in_attribute. Scikit-Learn then ensures that any DataFrame fed to this estimator after this (e.g. to Transform() or Predict()) has the same column names. Transformers also provide the get_feature_names_out () method, which you can use to build a DataFrame around the output of the Transformers: