"ML Practice" Regression System: Median House Price Prediction

1. Project Analysis

  • 目的: Uses data from the California Census to build a housing price model in California to predict the median house price in any region based on all other indicators;

Machine Learning Project Checklist

  • Frame the problem and look at the whole;
  • retrieve data;
  • study data for insights;
  • Prepare data to feed underlying data patterns to machine learning algorithms;
  • Explore different models and list the best ones;
  • Fine-tune the models and combine them into a good solution;
  • demo solution;
  • start, monitor and maintain the system;

1. Framing issues

  • 业务目标: The output of the model (prediction of the median house price in a region) will be transmitted to another machine learning system along with other signals; the downstream system will be used to decide whether the region is worth investing in;

Please add a picture description

  • 流水线, a sequential data processing component; used for data manipulation and data transformation in machine learning systems;

The components in the pipeline are usually asynchronous. Each component pulls a large amount of data, processes it, and then transmits the result to another data warehouse, and then the next component pulls the data output before and gives its own output, so as to By analogy; components are independent from each other, only connected through data warehouses, and system components are kept simple and do not interfere with each other;

Proper monitoring needs to be implemented, otherwise broken components will not affect the availability of other components, but without remedial measures for a long time will lead to a decrease in the performance of the overall system;

  • 现有解决方案: Manually estimate the housing price in the area by a team of experts (first continue to collect the latest area information, calculate the median house price, if it cannot be calculated, use complex rules to estimate); the current plan can be used for reference;

  • 专家系统, the knowledge is summed up by humans, and then taught to the computer; the implementation process is expensive, and the calculation results may not be satisfactory;

This is a typical 监督学习task (labeled training examples have been given), and it is also a typical 多重回归task, 一元回归task (the system needs to predict a value through multiple features); this is a 批量学习system (no continuous data stream input , and there is no need to make special adjustments for changing data, and the amount of data is not very large);

2. Performance indicators

  • 均方根误差( RMSE), 欧几里得范数; can be used to reflect how much error the system usually produces in the prediction;

R M S E ( X , h ) = 1 m ∑ i = 1 m ( h ( x i ) − y i ) 2 RMSE(X, h) = \sqrt{ \frac{1}{m} \sum_{i=1}^m (h(x^i) - y^i)^2 } RMSE(X,h)=m1i=1m(h(xi)yi)2

  • m, indicating the number of instances in the dataset where RMSE is measured (e.g. evaluating RMSE on a validation set of 2000 regions, then m=2000);
  • x i x^i xi , a vector representing all feature values ​​(excluding labels) of the ith instance in the dataset;
  • y i y^i yi , representing the label (desired output value of the instance);

For example, the first area in the dataset is located at -118.29° longitude, 33.91° latitude, 1416 residents, median income $38372, median house value $156400, then

x 1 = ( − 118.29 33.91 1416 38372 ) x^1 = \begin{pmatrix} -118.29 \\ 33.91 \\ 1416 \\ 38372 \end{pmatrix} x1= 118.2933.91141638372

y1 = 156400 y^1 = 156400y1=156400

  • X, a matrix containing all eigenvalues ​​(not including labels) of all instances in the dataset, where each row represents an instance, and row i is equal to xix^ixmeans of i , ie (xi) T (x^i)^T(xi)T

X = ( ( x 1 ) T ( x 2 ) T . . . ( x 1 999 ) T ( x 2 000 ) T ) = ( − 118.29 33.91 1416 38372 . . . . . . . . . . . . ) X = \begin{pmatrix} (x^1)^T \\ (x^2)^T \\ ... \\ (x^1999)^T \\ (x^2000)^T \end{pmatrix} = \begin{pmatrix} -118.29 & 33.91 & 1416 & 38372 \\ ... & ... & ... & ... \end{pmatrix} X= (x1)T(x2)T...(x1999)T(x2000)T =(118.29...33.91...1416...38372...)

  • h, the prediction function of the system, also known as the hypothesis; when the system is input with an instance feature vector xix^ixWhen i , it will output a predicted valuey ^ i = h ( xi ) \hat{y}^i = h(x^i)y^i=h(xi )(If the system predicts that the median house price in the first area is $158,400, theny ^ 1 = h ( x 1 ) \hat{y}^1 = h(x^1)y^1=h(x1 )= 158400, its prediction error isy ^ 1 − y 1 \hat{y}^1 - y^1y^1y1 = 2000);
  • RMSE(X, h), a cost function measured over a set of instances using hypothesis h;

other functions

  • 平均绝对误差Mean Absolute ErrorMAE平均绝对偏差),曼哈顿范数

M A E ( X , h ) = 1 m ∑ i = 1 m ∣ h ( x i ) − y i ∣ MAE(X, h) = \frac{1}{m} \sum_{i=1}^m | h(x^i) - y^i | M A E ( X ,h)=m1i=1mh(xi)yi

Both RMSE and MAE are methods of measuring the distance between two vectors (the predicted value vector and the target value vector); the higher the norm index, the more attention is paid to large values ​​and the small values ​​are ignored (RMSE is more sensitive to outliers than MAE, 离群值when RMSE does very well when the exponential form is rare);

2. Get data

1. Prepare the workspace

  • Create a workspace directory
export ML_HOME="$HOME/Documents/workspace/projects/aurelius/lmsl/studying/ml/handson-ml2/workspace"
mkdir -p $ML_HOME
  • Install Python (installation details omitted here)

The Python version needs to keep a newer version of python3, and the pip version needs to be kept up to date;

# 查看 pip 版本号
python3 -m pip --version

# 升级 pip 至最新版
python3 -m pip install --user -U pip
  • Create a dedicated Python environment
cd $ML_HOME
# 创建一个名为 `.venv` 的专属 Python 环境
python3 -m venv .venv

# 进入专属 Python 环境
source .venv/bin/activate     # on Linux or macOS
# $ .\.venv\Scripts\activate  # on Windows

# 退出专属 Python 环境
deactivate
  • Install dependent modules

    • requests
    • Jupyter
    • NumPy
    • pandas
    • Matplotlib
    • ScikitLearn
# 通过 pip 按照依赖模块
pip install requests jupyter matplotlib numpy pandas scipy scikit-learn

# 将专属环境注册到 Jupyter 并给它一个名字
python -m ipykernel install --user --name=ml-venv
# 若安装缓慢,可切换 pip 清华镜像源
cd ~/.pip
vi pip.conf
# 在 ~/.pip/pip.conf 加入如下配置
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple

[install]
trusted-host=pypi.tuna.tsinghua.edu.cn
  • Enable Jupyter Notebook
jupyter notebook

Enabling Jupyter Notebook will open a Web Service locally and access the service through http://localhost:8888;

It is recommended to directly use the Jupter plug-in of VS Code to use Jupyter Notebook, without having to start the Jupyter Service through the jupyter notebook command (the specific usage method can be explored by yourself);

2. Download data

The data of this project is a compressed package in csv format; it can be downloaded through a browser and decompressed through the tar command, but it is recommended to create a Python function for general processing;

import tarfile
import requests

def fetch_data(url, path, tgz):
    if not os.path.isdir(path):
        os.makedirs(path)

    tgz_path = os.path.join(path, tgz)
    with open(tgz_path, 'wb') as w:
        w.write(requests.get(url).content)

    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=path)
    housing_tgz.close()
  • Download and press the data to the workspace path
import os

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("workspace", "datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
HOUSING_TGZ = "housing.tgz"

fetch_data(HOUSING_URL, HOUSING_PATH, HOUSING_TGZ)
  • Load and view data using pandas
import pandas as pd

def load_data(path, csv):
    csv_path = os.path.join(path, csv)
    return pd.read_csv(csv_path)

housing = load_data(HOUSING_PATH, 'housing.csv')

3. View data

View the first 5 rows of the dataset

housing.head()

Please add a picture description

  • 实例属性
    • longitude: longitude
    • latitude: latitude
    • housing_median_age: housing median age
    • total_rooms: total number of rooms
    • total_bedrooms: total number of bedrooms
    • population: population
    • households: family (number of households)
    • median_income: median income
    • median_house_value: median house price
    • ocean_proximity: the distance of the ocean

View a brief description of the dataset

housing info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
  • 数据集摘要: Contains 20640 instances, total_bedroomsonly 20433 non-null values; ocean_proximityit is object type, and all other attributes are 数值类型;

View field classification properties

housing['ocean_proximity'].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

ocean_proximityThere are five types of values, distributed as output above;

View a summary of numeric properties

housing.describe()

Please add a picture description

  • std, standard deviation (used to measure the degree of dispersion of values);
  • 25%// 50%, 75%the percentile, indicating that a given percentage of observations in the observation group are below this value;
  • count, the total number of rows, and null values ​​will be ignored, such as the count of total_bedrooms is 20433;

plot a histogram for each attribute

# 指定 Matplotlib 使用哪个后端,在 VS Code 中则无需指定
# %matplotlib inline   # only in a Jupyter notebook,令 Matplotlib 使用 Jupyter 的后端,图形在 Notebook 上显示;
import matplotlib.pyplot as plt

# hist() 依赖于 matplotlib
housing.hist(bins=50, figsize=(20,15))
plt.show()

Please add a picture description

  • median_income, the median income is obviously not measured in US dollars (but ten thousand US dollars), but has been scaled down by a certain ratio (upper limit 15, lower limit 0.5), and other attribute values ​​have also been scaled to varying degrees;
  • housing_median_ageand are median_house_valuealso capped, and median_house_valueas the prediction target attributes, special attention is required;
    • Re-collect tag values ​​for areas where tag values ​​have been capped;
    • Remove regions where the label value exceeds the upper limit;
  • Histograms mostly show heavy tails (long tail effect), which may make it difficult for some machine learning algorithms to detect patterns, and some transformation methods are needed to convert these properties into a more bell-shaped distribution;

4. Create a test set

  • 数据窥探偏误( data snooping bias), if you browse the test set data in advance, you may fall into a seemingly interesting test data pattern, and then choose a special machine learning model; then when you use the test set to evaluate the generalization error, the result Will be overly optimistic, and the performance of the system will not be as expected when it is officially put into production;

Randomly select some instances (usually 20%, can be scaled down when the data set is large), and put them aside;

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), len(test_set))
# 16512 4128

Repeated runs of such a split test set will give different results, which will cause the learning algorithm to see the full dataset, which needs to be avoided when creating the test set;

You can make the indexed test set always the same by dumping the test set, or fixing the seed of the random number generator (for example, using np.random.seed(42));

In this way, the fixed test set cannot divide the updated data set. A better way is to use a fixed algorithm (hash algorithm) and use the unique identifier (such as hash value) of each instance as input to decide whether to enter the test set;

from zlib import crc32

# 是否进入测试集的固定算法
def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]
  • Enter index as unique identifier
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

It is necessary to ensure that the new data is only appended to the end of the data set, and no rows will be deleted;

  • Enter the latitude and longitude as the unique identifier
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

useScikit-Learn

  • train_test_split()
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

random_state can be used to set the random number generator, and can also send multiple data sets with the same number of rows to it at one time, so as to split them with the same index;

  • 随机抽样, suitable for data sets that are sufficiently large (compared to the number of attributes), otherwise it is easy to cause obvious 抽样偏差;

  • 分层抽样, divide the data set into multiple subsets (layers) according to attributes, and then extract the same proportion of instances from each subset, and merge them into a test set;

Preserve the original distribution of important attributes in the test set

Predicting the median house price and the median income are important attributes, and the test set should be able to represent various types of income in the entire data set;

# 将收入按 0 ~ 1.5 ~ 3 ~ 4.5 ~ 6 ~ 无穷大,分为 5 个子集(层);
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].hist()

Please add a picture description

StratifiedShuffleSplitStratified sampling by income via Scikin-Learn ;

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# 验证分层的实例占比
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

3    0.350533
2    0.318798
4    0.176357
5    0.114341
1    0.039971
Name: income_cat, dtype: float64

Complete data set, stratified sampling test set, random sampling test set income attribute proportion distribution;

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    
    
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

compare_props

Please add a picture description

remove the income_cat attribute;

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

3. Data Exploration

Create a copy of the training set so that later attempts do not harm the training set;

housing = strat_train_set.copy()

1. Geolocation Visualization

Plot the distribution of latitude and longitude by data density

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

Please add a picture description

High-density areas can be clearly distinguished from the figure;

Latitude and longitude distribution by population density and median house price

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

Please add a picture description

The population is represented by the radius of the circle (option s), the median house price is represented by the presentation (option c), where the color range (option cmap) is taken from the predefined color table jet;

It can be confirmed from the figure that housing prices are closely related to geographical location and population density;

2. Look for correlations

Use corr() to calculate the standardized correlation coefficient (Pearson) between each pair of attributes

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value    1.000000
median_income         0.687151
total_rooms           0.135140
housing_median_age    0.114146
households            0.064590
total_bedrooms        0.047781
population           -0.026882
longitude            -0.047466
latitude             -0.142673
Name: median_house_value, dtype: float64
  • 相关系数, linear correlation, ranging from -1 to 1; the closer to 1, the more positive correlation; the closer to -1, the more negative correlation; 0 means no linear correlation between the two;

Plotting correlations using pandas' scatter_matrix()

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Please add a picture description

The main diagonal shows the histogram of each attribute; other positions show the correlation between attributes;

  • Look at the attribute with the best potential to predict the median house price: median income (the most relevant attribute);
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)

Please add a picture description

It can be confirmed from the figure that the correlation between the two is strong, and there are clear horizontal lines at 50W, 35W, and 45W. These may be caused by objectively existing price ceilings. In order to prevent the learning algorithm from learning these weird data, you can try to delete these area;

3. Composite properties

From the above attribute correlation analysis, we can find that some 异常数据(such as horizontal lines) need to be cleaned up in advance, and some 重尾distributions need to be converted (such as calculations 对数), etc.; and trying to combine attributes may allow us to discover new high-correlation attributes;

  • Try to combine attributes and observe the correlation with the target attribute
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value          1.000000
median_income               0.687151
rooms_per_household         0.146255
total_rooms                 0.135140
housing_median_age          0.114146
households                  0.064590
total_bedrooms              0.047781
population_per_household   -0.021991
population                 -0.026882
longitude                  -0.047466
latitude                   -0.142673
bedrooms_per_room          -0.259952
Name: median_house_value, dtype: float64

The new attribute bedrooms_per_room has a significantly higher correlation with the median number of rooms than the original attributes (total_bedrooms, total_rooms);

4. Data preparation

Innovate a new copy of the training set, separating its prediction period from the labels;

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

drop will not affect strat_train_set, it will only create a new copy of the data;

1. Data cleaning

Solve the problem of missing some values ​​of total_bedrooms

  • dropna(), abandon these missing value areas;
  • drop(), discarding the entire attribute;
  • fillna(), to set missing values ​​to some value (0, mean, median, etc.);
housing.dropna(subset=["total_bedrooms"])    # option 1
housing.drop("total_bedrooms", axis=1)       # option 2
median = housing["total_bedrooms"].median()  # option 3
housing["total_bedrooms"].fillna(median, inplace=True)

Handling missing values ​​with Scikit-Learn's SimpleImputer

from sklearn.impute import SimpleImputer
# 创建中位数填充处理器
imputer = SimpleImputer(strategy="median")
# 因为中位数值只能计算数值属性,这里需要移除 ocean_proximity 属性
housing_num = housing.drop("ocean_proximity", axis=1)
# 使用 fit() 将 imputer 实例适配到训练数据(计算每个属性的中位数值,并存储在 statistics_)
imputer.fit(housing_num)
# 查看中位数值
imputer.statistics_
# 比较中位数值是否计算正确
housing_num.median().values

# 使用 transform() 将中位数值替换到缺失值
X = imputer.transform(housing_num)

# 重新将 numpy 数组加载到 pandas 的 DataFrame
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

2. Design of Scikit-Learn

  • 一致性, Scikit-Learn's API design follows 一致性the principles, and all objects share a simple and consistent interface;

estimator

Some parameters are estimated according to the data set (such as the median estimated by the imputer), and the estimation is performed by the fit() method. Only one data set is required as a parameter (or a pair of parameters, one as a trainer and one as a label set), The other parameters that guide the estimation process are 超参数(such as the strategy of strategy='median'), and the hyperparameter must be an instance variable;

converter

The estimation source (such as imputer) that can transform the data set is also called a converter. The transform() method performs the transformation together with the data set to be transformed as a parameter, and the returned result is the transformed data set; the transformation process usually depends on learned parameters (eg imputer.statistics);

The fit_transform() method is equivalent to executing fit() before executing transform(), and sometimes may contain some optimizations, which will run faster;

predictor

An estimator that can make predictions based on a given data set, also known as a predictor (such as a LinearRegression model), predicts a data set of a new instance by the predict() method, and returns a data set containing the corresponding prediction results;

The score() method can be used to measure the quality of predictions (and corresponding labels in supervised learning algorithms) for a given test set;

  • 检查

All estimators 超参数can be directly accessed through public instance variables (such as imputer.strategy);
all estimators 学习参数can be directly accessed through public variables with an underscore suffix (such as imputer.statistics_);

  • 防止类扩散

Datasets are represented as NumPy arrays or SciPy sparse matrices, not custom types;
hyperparameters are just plain Python strings or values;

  • 构成

The building blocks are reused as much as possible (any sequence of converters can be added with a predictor at the end to build a Pipeline estimator);

  • 合理的默认值

Scikit-Learn provides sensible defaults for most parameters to quickly build a basic working system;

3. Processing text, classification attributes

View the first 10 lines of text properties

housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

	ocean_proximity
12655	INLAND
15502	NEAR OCEAN
2908	INLAND
14053	NEAR OCEAN
20496	<1H OCEAN
1481	NEAR BAY
18125	<1H OCEAN
5830	<1H OCEAN
17989	<1H OCEAN
4861	<1H OCEAN

ocean_proximity is not arbitrary text, but an enumeration value, that is, a classification attribute;

Convert text attributes to numeric attributes using Scikit-Learn's OrdinalEncoder

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

housing_cat_encoded[:10]

array([[1.],
       [4.],
       [1.],
       [4.],
       [0.],
       [3.],
       [0.],
       [0.],
       [0.],
       [0.]])

view category list

ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]
  • 独热编码, create a binary attribute for each attribute value of the category attribute (1 means hot, 0 means cold), to avoid mistakenly treating attributes with closer values ​​as closer after converting text attributes to value attributes;

Convert text attributes to one-hot vectors using Scikin-Learn's OneHotEncoder encoder

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
# 输出一个 SciPy 稀疏矩阵;
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>
  • 稀疏矩阵, which only stores the position of non-zero elements, it can still be used like a normal two-dimensional array;

View 2D array representation of sparse matrix

housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

View the category list of encoders

cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

If there are many attribute value categories of category attributes, one-hot encoding will generate a large number of input features, which may slow down training and reduce performance. At this time, it may be necessary to use related digital features instead of category inputs (such as using ocean distance instead of ocean_proximity, or replace each category with a learnable low-dimensional vector);

4. Custom Converter

Scikit-Learn custom converters can be used to implement some cleaning operations or combine specific attributes, etc., which can be seamlessly connected with Scikit-Learn's own functions;

Scikit-Learn relies on duck-type compilation, not inheritance, as long as the created class contains fit() (return self), transform(), fit_transform();

  • TransformerMixin, automatically implement the fit_transform() method;
  • BaseEstimator, get the methods get_params() and set_params() to automatically adjust hyperparameters;

Implementing a converter for composite properties via a custom converter

from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                        bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

The hyperparameter add_bedrooms_per_room can be used to control whether to add the bedrooms_per_room attribute, and this implementation can provide more combinations;

5. Feature Scaling

The most important transformation that needs to be applied to the data is feature scaling;

Two common ways to scale all attributes equally are 最小-最大缩放and 标准化;

  • 最小-最大缩放, 归一法, to scale the value so that its range is 0 ~ 1between (subtract all values ​​from the minimum value and divide by the difference between the maximum and minimum values); Scikit-Learn's MinMaxScaler transformer can be easily implemented, and its hyperparameter feature_range can adjust its range;

  • 标准化, subtract the mean from all values ​​(the mean of standardized values ​​is always 0), and divide by the variance (the distribution of the result has unit variance); standardization does not bind the values ​​to a specific range, and is less affected by outliers; Scikit-Learn's StandardScaler transformer enables standardization;

6. Assembly line

  • 流水线, Pipeline, realize multiple data conversions in certain steps, and Scikit-Learn's Pipeline provides such conversion support;
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

The Pipeline constructor is a sequence defined by a series of pairs of names and estimators; except the last one is an estimator, the previous ones must be transformers (implementing the fit_transform() method);

When the pipeline's fit() method is called, the converter's fit_transform() method is called sequentially, and the output of the previous converter is passed as a parameter to the next converter until it is passed to the final estimator and executed The fit() method of the last estimator;

Use Scikit-Learn's ColumnTransformer transformer to process all columns

from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])
housing_prepared = full_pipeline.fit_transform(housing)

ColumnTransformer can apply transformations on specified columns of the dataset by passing a list of column names, and merge the output along the second axis (transformers must return the same number of rows);

The sparse matrix is ​​merged with the dense matrix, and the ColumnTransformer will estimate the density of the final matrix (the non-zero ratio of the cell), and return a sparse matrix if the density is lower than a given threshold (sparse_threshold defaults to 0.3);

5. Select and train the model

1. Training and evaluating the training set

Train a linear regression model

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

Test predictions using instances from the training set

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

Predictions: [ 86208. 304704. 153536. 185728. 244416.]
Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]

Measuring the RMSE of the regression model on the training set

Use Scikit-Learn's mean_squared_error() for root mean square error measurement;

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

68633.40810776998

It shows that the prediction error reaches $68,628, and the entire median_housing_values ​​is only distributed between $120,000 and $26,500. Such a large error shows that this is a scheme where the model underfits the training data;

At this time, the optimization methods we can try are: choosing a more powerful model, providing better features for algorithm training, and reducing restrictions on the model;

Train a decision tree using DecisionTreeRegressor

Find complex non-linear relationships in data;

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

Measuring the RMSE of the regression model on the training set

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

0.0

An error of 0 means that the model is either absolutely perfect (which is impossible) or severely overfits the data;

2. Cross Validation

Evaluation of decision tree models by cross-validation;

K-fold cross-validation using Scikit-Learn's cross_val_score

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

Scores is the negative number of MSE (representing the utility function, the bigger the better), np.sqrt(-scores)just calculate RMSE;

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [73444.02930862 69237.91537492 67003.65412022 71810.57760783
 70631.08058123 77465.52053272 70962.67507776 73613.93631416
 68442.91744801 72364.26672416]
Mean: 71497.65730896383
Standard deviation: 2835.532019536459

The average RMSE score of the decision tree in the validation set is 71497 (training set: 0), and the fluctuation (accuracy) is 2835;

Cross Validation for Linear Regression Models

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [71800.38078269 64114.99166359 67844.95431254 68635.19072082
 66801.98038821 72531.04505346 73992.85834976 68824.54092094
 66474.60750419 70143.79750458]
Mean: 69116.4347200802
Standard deviation: 2880.6588594759014

The average RMSE score of the linear regression model in the validation set is 69116 (training set: 68633), and the fluctuation (accuracy) is 2880;

The RMSE score of the decision tree is higher than that of the linear regression model, which shows that it is seriously overfitting;

Train Random Forests with RandomForestRegressor

  • 随机森林: By training many decision trees on a random subset of features, and then averaging their predictions; building a model on the basis of multiple models is called ensemble learning;
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print(forest_rmse)

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

18580.285001969234
Scores: [51420.10657898 48950.26905778 46724.70163181 52032.16751813
 47382.48485738 51644.10218989 52532.85241798 50040.96772226
 48869.83863791 53727.35461654]
Mean: 50332.484522865096
Standard deviation: 2191.1726721020977

The RMSE score of the training set is 18580, the validation set score is 50332, and the fluctuation (accuracy) is 2191. Although the performance is much better than the previous two models, the training set score is much lower than the validation set, which shows that it is still overfitting;

Before model simplification and model constraints, you can try more machine learning algorithms (such as support vector machines with different kernels, neural network models, etc.), first screen a few effective models, and don't spend too much time adjusting hyperparameters ;

save model

import joblib

joblib.dump(forest_reg, "./workspace/models/forest_reg.pkl")
# and later, reload model...
forest_reg_loaded = joblib.load("./workspace/models/forest_reg.pkl")

6. Fine-tuning the model

Once you have several valid candidate models, you can fine-tune them;

1. Grid search

  • 网格搜索, can beScikit-Learn used to evaluate all possible combinations of hyperparameters by setting the hyperparameters of the experiment and the values ​​​​that need to be tried, so as to obtain the best combination;GridSearchCV交叉验证
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
    
    'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {
    
    'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
    ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                            scoring='neg_mean_squared_error',
                            return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
  • parram_grid, the hyperparameter grid setting;
    • n_estimators, max_features, hyperparameter name, given 3 * 4 = 12kinds of values;
    • bootstrap, n_estimators, max_features, hyperparameter name, given 1 * 2 * 3 = 6kinds of values;
  • cv, the above 18 combinations of hyperparameters were trained 5 times (5-fold cross-validation);
  • refit, =True (default) allows GridSearchCV to find the best estimator through cross-validation, and then retrain the model on the entire training set (more data can improve model performance);

View grid search results

grid_search.best_params_

{
    
    'max_features': 6, 'n_estimators': 30}

get the best estimator

grid_search.best_estimator_

Estimator's evaluation score

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

63475.5397459137 {
    
    'max_features': 2, 'n_estimators': 3}
55754.473565553184 {
    
    'max_features': 2, 'n_estimators': 10}
52830.64714547093 {
    
    'max_features': 2, 'n_estimators': 30}
60296.33920014068 {
    
    'max_features': 4, 'n_estimators': 3}
52504.03498357088 {
    
    'max_features': 4, 'n_estimators': 10}
50328.7606181505 {
    
    'max_features': 4, 'n_estimators': 30}
59328.255990059035 {
    
    'max_features': 6, 'n_estimators': 3}
51909.34406264884 {
    
    'max_features': 6, 'n_estimators': 10}
49802.234477838996 {
    
    'max_features': 6, 'n_estimators': 30}
58997.87515871176 {
    
    'max_features': 8, 'n_estimators': 3}
52036.752607340735 {
    
    'max_features': 8, 'n_estimators': 10}
50321.971231209965 {
    
    'max_features': 8, 'n_estimators': 30}
62389.547952235145 {
    
    'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53800.36505088281 {
    
    'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59953.45347364427 {
    
    'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52115.46931655621 {
    
    'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59061.9294179386 {
    
    'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
52197.755732390906 {
    
    'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

The RMSE score of the best estimator (max_features: 6, n_estimators: 30) is 49802, which is slightly better than the default hyperparameters of 50332, and the model is optimized;

Grid search can be performed via hyperparameters defined during the data preparation phase (to control outlier handling, missing features, feature selection, etc.) to automatically explore the best solution to a problem;

2. Random search

  • 随机搜素, isScikit-Learn roughly the same as , but only one random value is chosen for each hyperparameter in each iteration, and then a certain number of random combinations are evaluated; RandomizedSearchCVGridSearchCV
    • Random searches can be performed repeatedly to explore different hyperparameters each time; unlike grid search, the search range of each hyperparameter must be fixed;
    • You can better control the calculation budget allocated to hyperparameter search through simple iteration number setting;

Evaluate Support Vector Machine Regressors Using RandomizedSearchCV

from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

# see https://docs.scipy.org/doc/scipy/reference/stats.html
# for `expon()` and `reciprocal()` documentation and more probability distribution functions.

# Note: gamma is ignored when kernel is "linear"
param_distribs = {
    
    
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
    }

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END C=629.782329591372, gamma=3.010121430917521, kernel=linear; total time=   3.3s
[CV] END C=629.782329591372, gamma=3.010121430917521, kernel=linear; total time=   3.3s
[CV] END C=629.782329591372, gamma=3.010121430917521, kernel=linear; total time=   3.2s
[CV] END C=629.782329591372, gamma=3.010121430917521, kernel=linear; total time=   3.2s
[CV] END C=629.782329591372, gamma=3.010121430917521, kernel=linear; total time=   3.2s
...

negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
print(rmse)

54767.960710084146

print(rnd_search.best_params_)

{
    
    'C': 157055.10989448498, 'gamma': 0.26497040005002437, 'kernel': 'rbf'}

A set of optimal hyperparameters of the support vector machine regressor was randomly searched, and the final RMSE score was 54767;

3. Integration methods

  • 集成方法, combining the best-performing models usually performs better than a single model (such as random forests for decision trees), especially when a single model will produce different types of errors;

4. Model error

See the relative importance level of each attribute

feature_importances = grid_search.best_estimator_.feature_importances_
print(feature_importances)

array([8.30181927e-02, 7.09849240e-02, 4.24425223e-02, 1.76691115e-02,
       1.61540923e-02, 1.71789859e-02, 1.59395934e-02, 3.39837758e-01,
       6.50843504e-02, 1.04717194e-01, 6.48945156e-02, 1.47186585e-02,
       1.38881431e-01, 6.76526692e-05, 3.02499407e-03, 5.38602332e-03])

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.3398377582278221, 'median_income'),
 (0.13888143088401578, 'INLAND'),
 (0.10471719429817675, 'pop_per_hhold'),
 (0.0830181926813895, 'longitude'),
 (0.07098492396156919, 'latitude'),
 (0.06508435039879204, 'rooms_per_hhold'),
 (0.06489451561779028, 'bedrooms_per_room'),
 (0.042442522257867, 'housing_median_age'),
 (0.017669111520336293, 'total_rooms'),
 (0.017178985883288055, 'population'),
 (0.016154092256827887, 'total_bedrooms'),
 (0.015939593408818325, 'households'),
 (0.0147186585483286, '<1H OCEAN'),
 (0.005386023320075893, 'NEAR OCEAN'),
 (0.0030249940656810405, 'NEAR BAY'),
 (6.765266922142473e-05, 'ISLAND')]

You can try to delete some less useful features (in this case only one ocean_proximity is useful, others can be deleted);

You can also optimize the model by adding additional features, removing uninformative features, and removing outliers;

5. Evaluate the system on the test set

Evaluate the final model on the test set

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

47785.02562107877

Compute 95% confidence intervals for generalization error using scipy.stats.t.interval()

from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([45805.04012754, 49686.17157851])

The evaluation results of the test set will be slightly worse than the performance of the previous use of cross-validation. At this time, do not continue to adjust the hyperparameters to try to make the results of the test set look better, because these improvements have no effect on the generalization effect on the new data set. is useless;

The final performance of the system may not be better than that of the expert system (for example, a drop of about 20%), but this is not necessarily useless. This machine learning system can provide some useful information and liberate the task of the expert system to a certain extent;

The strengths and weaknesses of the model can be evaluated with a specific test set (e.g. inland areas, areas close to the ocean);

7. Deployment, monitoring and system maintenance

1. Deployment

Exposing services via REST API

Please add a picture description

  • Serialize and save the trained Scikit-Learn model through joblib, which includes a complete preprocessing and prediction pipeline;
  • Load this model through Web Service in the production environment, and open the interface for calling the predict function of the model;
  • You can interact with it through a Web App in front of the model service, provide new data input and predict result processing, and open the results to desktop and mobile users;

Deploy via Google Cloud AI Platform

  • Upload the model serialized by joblib to Google CloudStorage (GCS);
  • Create a new model version on Google Cloud AI Platform, and the model points to the model file on GCS;
  • Google Cloud AI Platform will directly provide a simple Web Service (similar to the model service above);

2. System monitoring

  • 监控目标

    • Write monitoring code to regularly check the real-time performance of the system, and trigger an alarm when the system performance degrades;
  • 监控方向

    • Damaged components in the infrastructure can degrade engine performance;
    • Slight drops in performance may go unnoticed over long periods of time;
    • The outside world is changing, and the model that may be trained may no longer adapt to the newly input data after a period of time;
  • 评估方式

    • The performance indicators of the model can be inferred from the downstream (such as the weight of the recommendation system, the number of orders generated by recommendation and non-recommendation, which reflects the performance of the recommendation system);
    • Let human analysis intervene in system performance evaluation (such as introducing experts, non-experts, and workers on crowdsourcing platforms to mark data, and Google's verification code has the function of marking training data);
    • The quality of the input data of the monitoring model (such as comparing the average value and standard deviation of the input data with the training set, or the emergence of new categories of classification features, etc.), can detect the cause of system performance degradation in advance;

3. System Maintenance

The best practice for system maintenance is to automate the entire process;

What to do for system maintenance

  • Regularly collect new data and label it (manually if necessary);
  • Write a script to train the model regularly, and automatically fine-tune the hyperparameters (let the script run regularly according to the demand);
  • Write a script to evaluate the new model and the old model on the updated test set, compare the performance of the two to decide whether to replace it in the production environment;
  • Keep all versions of the model for quick rollback; keep each version of the data set for rollback (when the new data set is destroyed, such as adding outliers) and other model evaluations;

Machine learning involves a lot of infrastructure work. It is normal for the first machine learning project to spend a lot of energy and time building and deploying these components. Once these processes go through, it will be easy to launch and iterate model services in the future matter;

It is recommended that readers choose a good goal from a competition website like Kaggle , and then run the whole process;

8. Available data sources


PS: Welcome friends from all walks of life 阅读, 评论thank you friends 点赞, 关注, 收藏!


References:

  • [1] "Machine Learning"
  • [2] "Machine Learning in Practice"

Guess you like

Origin blog.csdn.net/ChaoMing_H/article/details/129413048