Chapter 7 scikit-learn and machine learning combat

1 scikit-learn

Navigation page and algorithm guide
API: Data Preprocessing and Normalization, Feature Extraction, Feature Selection, various models: Generalized linear models (GLM) for regression, Naive Bayes, Support Vector Machines, Decision Trees, Clustering, models Tuning and hyperparameter selection: Model Selection, model fusion and enhancement Ensemble Methods, model evaluation Metrics

2 Actual combat of a project

2.1 Project goals

Use California Census data to build a California housing price model. The data includes indicators such as the population of each block, the median income, and the median house price.
We want to build a model to learn from the data and predict the median house price of any neighborhood based on other indicators.

2.2 Defining the problem

The first question to the boss is: what is the business goal.
Building a model is not the ultimate goal, how the company will use the model, and how to benefit from the model. Your boss may tell you that your model output will be passed to another system.
The next question is: How effective are the existing programs?
Start designing the system. Is it supervised learning, unsupervised learning or reinforcement learning? Is it classification or regression?
This is a supervised learning problem. We use training sample data with labels (median house prices). This is a regression problem. What is to be predicted is a numerical value.

2.3 Select performance indicators

For regression problems, the root mean square error (RMSE) is generally selected as the performance indicator.

2.4 Verify hypothesis

Communicate with downstream to determine what value is needed: numeric value or classification.

2.5 Get data

  • Download data

View data structure

housing.head()
housing.info()
housing.describe()
housing[“ocean_proximity”].value_counts()
housing.hist(bins=50, figsize=(20,15))

  • View data description

Focus on the amount of data, the type of each attribute, and the number of non-empty values. Check to see if the data is truncated.

  • Create test set

If the test set is re-collected every time, the effect is very unstable. You can save the test set, or set a random number generation seed to generate the same test set every time.
In addition, one thing to note is that if the test set randomly selects data, it may not be representative. For example, in this example, the test set may not cover high-income groups. We need to stratify sampling data according to the distribution of the median income in the original data set. It can be implemented using StratifiedShuffleSplit. split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
from sklearn.model_selection import StratifiedShuffleSplit
# 提供分层抽样功能,确保每个标签对应的样本的比例
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

strat_test_set["income_cat"].value_counts() / len(strat_test_set)
housing["income_cat"].value_counts() / len(housing)
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

You will find that the income_cat distributions in the test set and the original data set are consistent.

2.6 Data exploration and visualization, and discovering laws

Have an overall understanding of the data to be processed.
First put the test set aside and don't understand it. If the training set is large, you can sample an exploration set.
Second, visualize geographic data.
Third, find correlations.
Use the corr_matrix = housing.corr() method provided by pandas to view correlations.
Insert picture description here

The median income and the median house price are highly correlated. After analysis, it is found that some horizontal lines such as 500000, 480000, 350000 will be obviously in a straight line. It may be necessary to remove the data for these blocks to prevent the algorithm from repeating these coincidences.

Fourth, attribute combination test

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

Attribute combination is the time to open the brain hole. You can try to have several rooms in each household, how many bedrooms in each room, and the number of people in each household.

2.7 Prepare data for machine learning algorithms

1 Write some functions to do things.
These functions can be reused.
2 Data cleaning
There are three ways to deal with missing data: 1 remove those data with missing values ​​(by row), 2 remove the entire attribute (by column), 3 perform assignment (0, average, median, etc.)
3 processing Text and category attributes
Processing text attributes: first convert the text attributes to numeric attributes, and then use one-hot encoding: OrdinalEncoder, OneHotEncoder, or LabelBinarizer in one step.
4 Custom converter
Inherit BaseEstimator, TransformerMixin.

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

5 Feature scaling
Usually we do not scale the target value. This is the median house price, without scaling.
The feature scaling is for the training set data, and the same operation needs to be done in the test set and the prediction process.
There are two ways of scaling: normalization and standardization.
The normalization operation is: x = x − minmax − minx=\dfrac{x-min}{max-min}x=maxm i nxm i n, The value range is (0,1). But it is greatly affected by outliers. API: MinmaxScaler
standardization operation is: x = x − mean variance x=\dfrac{x-mean}{variance}x=Square differencexmean, The value has no range and is less affected by outliers. API: StandardScaler.

6 Conversion pipeline

from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
housing_prepared = full_pipeline.fit_transform(housing)

2.8 Select and train the model

2.9 Model fine-tuning

2.10 Test on the test set

Guess you like

Origin blog.csdn.net/flying_all/article/details/113448533