Article Directory
- 1 scikit-learn
- 2 Actual combat of a project
-
- 2.1 Project goals
- 2.2 Defining the problem
- 2.3 Select performance indicators
- 2.4 Verify hypothesis
- 2.5 Get data
- 2.6 Data exploration and visualization, and discovering laws
- 2.7 Prepare data for machine learning algorithms
- 2.8 Select and train the model
- 2.9 Model fine-tuning
- 2.10 Test on the test set
1 scikit-learn
Navigation page and algorithm guide
API: Data Preprocessing and Normalization, Feature Extraction, Feature Selection, various models: Generalized linear models (GLM) for regression, Naive Bayes, Support Vector Machines, Decision Trees, Clustering, models Tuning and hyperparameter selection: Model Selection, model fusion and enhancement Ensemble Methods, model evaluation Metrics
2 Actual combat of a project
2.1 Project goals
Use California Census data to build a California housing price model. The data includes indicators such as the population of each block, the median income, and the median house price.
We want to build a model to learn from the data and predict the median house price of any neighborhood based on other indicators.
2.2 Defining the problem
The first question to the boss is: what is the business goal.
Building a model is not the ultimate goal, how the company will use the model, and how to benefit from the model. Your boss may tell you that your model output will be passed to another system.
The next question is: How effective are the existing programs?
Start designing the system. Is it supervised learning, unsupervised learning or reinforcement learning? Is it classification or regression?
This is a supervised learning problem. We use training sample data with labels (median house prices). This is a regression problem. What is to be predicted is a numerical value.
2.3 Select performance indicators
For regression problems, the root mean square error (RMSE) is generally selected as the performance indicator.
2.4 Verify hypothesis
Communicate with downstream to determine what value is needed: numeric value or classification.
2.5 Get data
- Download data
View data structure
housing.head()
housing.info()
housing.describe()
housing[“ocean_proximity”].value_counts()
housing.hist(bins=50, figsize=(20,15))
- View data description
Focus on the amount of data, the type of each attribute, and the number of non-empty values. Check to see if the data is truncated.
- Create test set
If the test set is re-collected every time, the effect is very unstable. You can save the test set, or set a random number generation seed to generate the same test set every time.
In addition, one thing to note is that if the test set randomly selects data, it may not be representative. For example, in this example, the test set may not cover high-income groups. We need to stratify sampling data according to the distribution of the median income in the original data set. It can be implemented using StratifiedShuffleSplit. split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
from sklearn.model_selection import StratifiedShuffleSplit
# 提供分层抽样功能,确保每个标签对应的样本的比例
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
housing["income_cat"].value_counts() / len(housing)
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
You will find that the income_cat distributions in the test set and the original data set are consistent.
2.6 Data exploration and visualization, and discovering laws
Have an overall understanding of the data to be processed.
First put the test set aside and don't understand it. If the training set is large, you can sample an exploration set.
Second, visualize geographic data.
Third, find correlations.
Use the corr_matrix = housing.corr() method provided by pandas to view correlations.
The median income and the median house price are highly correlated. After analysis, it is found that some horizontal lines such as 500000, 480000, 350000 will be obviously in a straight line. It may be necessary to remove the data for these blocks to prevent the algorithm from repeating these coincidences.
Fourth, attribute combination test
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
Attribute combination is the time to open the brain hole. You can try to have several rooms in each household, how many bedrooms in each room, and the number of people in each household.
2.7 Prepare data for machine learning algorithms
1 Write some functions to do things.
These functions can be reused.
2 Data cleaning
There are three ways to deal with missing data: 1 remove those data with missing values (by row), 2 remove the entire attribute (by column), 3 perform assignment (0, average, median, etc.)
3 processing Text and category attributes
Processing text attributes: first convert the text attributes to numeric attributes, and then use one-hot encoding: OrdinalEncoder, OneHotEncoder, or LabelBinarizer in one step.
4 Custom converter
Inherit BaseEstimator, TransformerMixin.
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
5 Feature scaling
Usually we do not scale the target value. This is the median house price, without scaling.
The feature scaling is for the training set data, and the same operation needs to be done in the test set and the prediction process.
There are two ways of scaling: normalization and standardization.
The normalization operation is: x = x − minmax − minx=\dfrac{x-min}{max-min}x=max−m i nx−m i n, The value range is (0,1). But it is greatly affected by outliers. API: MinmaxScaler
standardization operation is: x = x − mean variance x=\dfrac{x-mean}{variance}x=Square differencex−mean, The value has no range and is less affected by outliers. API: StandardScaler.
6 Conversion pipeline
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('cat_encoder', OneHotEncoder(sparse=False)),
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)