End-to-end machine learning project (Machine Learning Study 6)

Use real data

When you study机器学习 it is better to use real-world data rather than artificial data. As it happens, there are thousands of data sets to choose from, covering a variety of fields.

The data in this article comes from the California housing price data of the StatLib warehouse (see the picture below). It is based on the 1990 California Census data, not the most recent data. Don’t be too entangled, our focus is on study机器学习, is not discussing the issue of housing prices.

Insert image description here

Talk by looking at pictures

机器学习The first task is to use California census data to build a model of housing prices in the state. Your model should learn from this data and be able to predict the median house price in any area, given all other metrics.

Framing problem

Building a model may not be the end goal, but how does the company expect to benefit from using the model? Therefore, understanding your goals is crucial because it will determine how you will frame the problem, which algorithms you will choose, which performance metrics you will use to evaluate your model, and how much effort you will spend tuning it.

The output of your model (a prediction of the median house price in an area) will be fed into another machine learning system (see image below), and many other downstream systems will decide whether it is worth investing in a given area. . Getting this right is crucial as it directly affects revenue.

What is the current solution, if any? The current situation will often give you a reference for performance and insights on how to solve the problem
. House prices in the area are currently estimated manually by a team of experts who gather the latest information about the area, and when they can't get the median house price, they use complex rules to estimate it.

Insert image description here

It was expensive and time-consuming, and their estimates were inaccurate, often realizing their estimates were off by more than 30% by the time they tried to find the actual median home price. That’s why the company thought it would be useful to build a model that predicts the median house price in the area, taking into account other data about the area. Census data seems like a good data set because it includes median home prices for thousands of areas, among other data.

pipeline

The sequence of data processing components is called a data pipeline. Pipelines are very common in machine learning systems because there is a lot of data to operate on and many data transformations to be applied.

components usually run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the results into another data store. Then, some time later, the next component in the pipeline pulls in that data and outputs its own output. Each component is fairly independent: the interface between components is simply data storage. This makes the system easy to grasp (with the help of data flow diagrams) and different teams can focus on different components. Additionally, if one component breaks, downstream components can usually continue to function normally (at least for a while) by simply using the last output of the broken component. This makes the architecture quite robust.

On the other hand, if proper monitoring is not implemented, a damaged component may go unnoticed for some time. Data becomes stale and overall system performance degrades.

While with this information you can start designing your system, please answer, is it a supervised, unsupervised, semi-supervised, self-supervised, or reinforcement learning task? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?

Can you answer it? This is obviously a classic supervised learning task, since the model can be trained with labeled examples (each example comes with the expected output, which is the median house price in the area). This is a typical regression task as the model will be asked to predict a value. More specifically, this is a multiple regression problem because the system will use multiple features to make predictions (the region's population, median income, etc.). This is also a univariate regression problem since we are just trying to predict a single value for each region. If we were trying to predict multiple values ​​for each region, this would be a multiple regression problem. Finally, there is no continuous flow of data into the system, there is no special need to adapt to rapidly changing data, and the data is small enough to fit in memory, so ordinary batch learning should do just fine.

Select performance metrics

A typical performance metric for regression problems is root mean square error (RMSE). It gives an idea of ​​how much error a system typically makes in its predictions, with larger errors being given a higher weight. Formula gives the mathematical formula for calculating RMSE.

Insert image description here

Notice:

This equation introduces several very common machine learning notations that I will use throughout the book:

  • m is the number of instances in the data set in which you measure RMSE.

    For example, if you evaluate RMSE on a validation set of 2,000 regions, then m = 2,000.

  • x(i) is a vector of all feature values ​​(excluding labels) of the i-th instance in the dataset, and y(i) is its label (the desired output value for that instance). For example, if the first borough in the dataset is located at –118.29° longitude, 33.91° latitude, has 1,416 residents, has a median income of $38,372, and a median home value of $156,400 (ignoring other characteristics for now), then:

    Insert image description here

and:

Insert image description here

  • X is a matrix containing all feature values ​​(excluding labels) for all instances in the dataset. One row per instance, with row i equal to the transpose of x(i), denoted (x(i))⊺.3 — For example, if the first region is as described above, then the matrix X looks like this:

    Insert image description here

  • h is the prediction function of the system, also known as the hypothesis. When given the feature vector x(i) of an instance of the system, it outputs the predicted value for that instance ŷ(i) = h(x(i)) (ŷ is pronounced "y-hat"). — For example, if your system predicts that the median home price in District 1 is $158,400, then ŷ(1) = h(x(1)) = 158,400. The forecast error for this region is ŷ(1) – y(1) = 2,000.

  • MSE(X,h) is a cost function measured on a set of examples using the assumption h.

We use lowercase italics for scalar values ​​(such as m or y(i)) and function names (such as h), lowercase bold for vectors (such as x(i)), and uppercase bold for matrices (such as X).

AlthoughRMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use other functions. For example, if there are many outlier areas. In this case, you might consider using the mean absolute error (MAE, also known as the mean absolute deviation), as shown in the following formula:

Insert image description here

RMSE and MAE are both methods of measuring the distance between two vectors: the prediction vector and the target value vector. Various distance measurements or specifications are possible:

  • Computing the root of the sum of squares (RMSE) corresponds to the Euclidean norm: a distance concept that we are all familiar with. It is also called the ℓ2 norm, denoted ∥ · ∥2 (or simply ∥ · ∥).
  • Computing the sum of absolute values ​​(MAE) corresponds to the ℓ1 norm, denoted ∥ · ∥1. This is sometimes called the Manhattan norm because it measures the distance between two points in a city if you could only drive along orthogonal city blocks.
  • More generally, the ℓk norm of an n-element vector v is defined as ∥v∥k = (|v1|k + |v2|k + … + |vn|k)1/k. ℓ0 gives the number of non-zero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector.

The higher the norm index, the greater the emphasis on large values ​​and the neglect of small values. This is why RMSE is more sensitive to outliers than MAE. However, when outliers are exponentially rare (such as in a bell curve), RMSE performs very well and is often preferred.

Check assumptions

Finally, it's a good practice to list and validate the assumptions you've made so far (either by you or by others), which can help you catch serious problems early. For example, the regional prices output by your system will be fed into a downstream machine learning system, and you assume that these prices will be used as such. But what if the downstream system converts the prices into categories (e.g., "cheap," "moderate," or "expensive") and then uses those categories instead of the prices themselves? In this case, getting the exact correct price is not at all possible Important; your system just needs to get the correct category. If this is the case, then the problem should be defined as a classification task rather than a regression task. You don’t want to discover this problem after several months of regressing the system.

Fortunately, after talking to the team responsible for the downstream system, you're convinced that they do want actual prices, not just categories. Great! Everything is ready, the green light is on, now you can start coding!

Download data

In a typical environment, your data would be available in a relational database or some other common data store.

To access it, you first need to obtain credentials and access authorization and become familiar with the data schema. In this project, however, things are much simpler: you simply download a zip file housing.tgz that contains a comma-separated values ​​(CSV) file called housing.csv along with all the data.

It's usually better to write a function that does this for you rather than manually downloading and decompressing the data. This is especially useful if the data changes regularly: you can write a small script that uses this function to get the latest data (or you can set up a scheduled job to do this automatically at regular intervals). Automating the process of obtaining the data is also useful if you need to install the dataset on multiple computers.

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

When load_housing_data() is called, it looks for the datasets/housing.tgz file. If not found, it creates the datasets directory in the current directory, downloads the housing.tgz file from the Ageron/data GitHub repository, and extracts its contents into the datasets directory; this creates the datasets/housing directory, which contains housing.csv document. Finally, the function loads this CSV file into a Pandas DataFrame object containing all the data and returns it.

Create test set

It may seem odd to voluntarily shelve some data at this stage. After all, you just had a quick look at the data, surely you should know more about it before deciding which algorithms to use, right? This is true, but your brain is an amazing pattern detection system, which also means that it is prone to overfitting: if you look at the test set, you may stumble upon some interesting-looking patterns in the test data , thus guiding you to choose a particular machine learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic, and you will start a system that does not perform as well as expected. This is called data snooping bias.


import numpy as np

def shuffle_and_split_data(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]


Ok, this works, but it's not perfect: if you run this program again, it will generate a different test set! Over time, you (or your machine learning algorithm) will see that the entire data set, This is what you want to avoid

One solution is to save the test set on the first run and then load it on subsequent runs. Another option is to set the random number generator's seed before calling np.random.permutation() (for example, using np.random.seed(42)). , so that it always generates the same shuffled index.

However, both solutions break the next time an updated dataset is fetched. To have a stable train/test split even after updating the dataset, a common solution is to use the identifier of each instance to decide whether it should be run in the test set (assuming the instances have unique and immutable identifiers). For example, if the hash value is less than or equal to 20% of the maximum hash value, you can calculate the hash of each instance's identifier and put that instance into the test set. This ensures that the test set remains consistent across multiple runs, even if the dataset is refreshed. The new test set will contain 20% of the new instances, but it will not contain any instances that were previously in the training set.


from zlib import crc32

def is_id_in_test_set(identifier, test_ratio):
    return crc32(np.int64(identifier)) < test_ratio * 2**32

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]


Unfortunately, the housing dataset does not have an identifier column. The simplest solution is to use the row index as the ID:

housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")

If you use a row index as a unique identifier, you need to ensure that new data is appended to the end of the dataset and no rows are deleted. If this is not possible, then you can try to use the most stable function to establish a unique identifier. For example, a region's latitude and longitude are guaranteed to be stable over millions of years, so you could combine them into an ID like this:


housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")


Scikit-Learn provides functions to split a dataset into subsets in various ways. The simplest function train_test_split(), its function is very similar to the shuffle_and_split_data () function we defined earlier, except that it adds a few features. First, there is a random_state parameter which allows you to set the random generator seed. Second, you can pass it multiple datasets with the same number of rows and it will split them by the same index (this is useful, for example, if you have a separate DataFrame label):


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)


So far we have considered purely random sampling methods. This is usually fine if your data set is large enough (especially relative to the number of attributes), but if not, you run the risk of introducing significant sampling bias. When employees at a survey firm decide to call 1,000 people and ask some questions, they don't just randomly pick 1,000 people in the phone book. They try to make sure that those 1,000 people are representative of the entire population and understand the questions they want to ask. For example, the US population is 51.1% female and 48.9% male, so a well-organized survey in the US would try to maintain this ratio in the sample: 511 women and 489 men (at least if that's what the answer is likely to be) vary by gender). This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the correct number of instances are drawn from each stratum to ensure that the test set is representative of the population. If the person conducting the survey used pure random sampling, there would be about a 10.7% chance of sampling a skewed test set with less than 48.5% or more than 53.5% of female participants. Regardless, the findings could be significantly biased.

Suppose you chat with some experts and they tell you that median income is a very important attribute in predicting median home price. You may want to make sure that the test set represents the various income categories in the entire data set. Since median income is a continuous numeric attribute, you first need to create an income category attribute. Let’s take a closer look at the histogram of median incomes (back to Figure 2-8): Most median incomes cluster around 1.5 to 6 (i.e., $15,000 to $60,000), but some median incomes are well beyond 6. It is important to have a sufficient number of instances for each layer in the dataset, otherwise the estimate of the layer's importance may be biased. This means you shouldn't have too many strata, each stratum should be large enough. The following code uses the pd.cut() function to create an income category attribute containing five categories (labeled 1 to 5); category 1 ranges from 0 to 1.5 (i.e., less than $15,000), and category 2 ranges from 1.5 to 3, and so on:

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.xlabel("Income category")
plt.ylabel("Number of districts")
save_fig("housing_income_cat_bar_plot")  # extra code
plt.show()


Insert image description here

Now you are ready to conduct stratified sampling based on income categories. Scikit-Learn provides a number of splitter classes in the sklearn.model_selection package that implement various strategies to split a dataset into training and test sets. Each splitter has a split() method that returns an iterator of different train/test splits of the same data.

To be precise, the split() method generates training and test metrics, not the data itself. If you want to better estimate the performance of your model, it is useful to have multiple splits, as you will see when we discuss cross-validation later in this chapter. For example, the following code generates 10 different hierarchical splits of the same dataset:


from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])


It can be split now

strat_train_set, strat_test_set = strat_splits[0]

Alternatively, since stratified sampling is common, there is an easier way to obtain a single split using the train_test_split() function with the stratify parameter:


strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)


Let's see if this works as expected. You can start with the income category proportions in the test set:

Insert image description here

Using similar code, you can measure the proportion of income categories across the entire data set. The following figure compares the proportions of income categories in the overall data set, in a test set generated by stratified sampling, and in a test set generated by pure random sampling. As you can see, the test set generated using stratified sampling has almost the same income category proportions as the income category proportions in the full data set, whereas the test set generated using pure random sampling is skewed.

Insert image description here

You won't be using the income_cat column again, so you might as well delete it and restore the data to its original state:


for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)


We spend a considerable amount of time on test set generation, and for good reason: it's an often overlooked but crucial part of a machine learning project. Additionally, many of these ideas will be useful when we discuss cross-validation. Now it's time to move on to the next phase: exploring the data.

Guess you like

Origin blog.csdn.net/coco2d_x2014/article/details/133827236