Machine Learning - Process Overview

In recent days been studying Machine Learning only, since that since Kaggle this site became Grand Master of the plan was also referred to the agenda.

After doing a couple of competitions Get Started gradually summed up a routine of machine learning.

First, Data Science, when you have three very important points:

  1.  Knowledge of the data is particularly important. By itself, the characteristics of the relationship between data and data, research data, we tend to give after work to bring inspiration.
  2. Feature Engineering is particularly important. We have a term called the Magic Feature, that is, after the introduction of this Feature will be a very significant increase in your score. So Feature Engineer can help us improve their ranking to a certain extent.
  3. Model Ensemble is particularly important. We all know the line and a baseline model is a particularly easy thing, so we can achieve the score baseline model. In all the tricks, the most obvious increase your ranking, one Feature Engineering, there is a Model Ensemble.

This can be said that we have to know before kaggle match point, let's take a look at the basic processes of our machine learning.

 

One. EDA

(1). Know your data as a whole

First understand some basic concepts:

First, data is basically divided into two categories, Numerical the Data and Categorical the Data . The former can be divided into Discrete Value and the Continuous Value .

For Numerical Data, he has unique characteristics: mean, std, max and min and so on. We can use describe the method of pandas to describe.

# For numerical data
data_set.describe()

For Discrete Data, he has unique characteristics: unique, count and so on. We still can describe the method of pandas to describe.

# For continuous data
data_set.describe(include=["O"])    # 注意这里是大写的O

Pandas personally think it is the most powerful package for data handling missing values, during and after this you will come across a simple data preprocessing. In order to obtain the number of missing values in the end there, what each type is a Feature, we can use info methods to view.

# See if there is any missing value
data_set.info()

In some specific questions we may have to know the appropriate distribution of this implementation is really very simple, we can use DataFrame object value_counts + Plot to achieve:

train_set.Survived.value_counts().plot(kind='bar')

These are very simple method to understand your data. I recommend this ipython get in REPL environment, of course, if it is ipynb is also very good , a strong place ipynb is that he allows code execution in blocks (from other REPL environment by line execution), but also after execution save the corresponding result, when the next piece of code is also executed without re-training (different from Pycharm etc. IDE). So I strongly recommend.

(2) Understanding the relationship between the data and the data

When you see some of the characteristics of the overall data later, you probably have a little assumption of the last feature to be selected, the following have to do is tell these assumption into practice. Here we can both use DataFrame the way , can also be used to show our results matplotlib or Seaborn in the image .

Here Kaggle Hello World to the level of competition - Titanic here as example. We know from casual working women and children first, so we guess the gender and age and will not be the last surviving outcomes.

Such as sex, is a typical Categorical the Data , Panads which provides a very good API - groupby, it creates a groupby objects, we can by using various methods groupby on this subject see the statistics under the corresponding category .

# Groupby method creates a group object
# ax_index format the data frame
train_set[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean()

For Categorical Data, also in a preferred method of visualizing, in a stacked = True is set in matplotlib.

Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'Unsurvived':Survived_0})
df.plot(kind='bar', stacked=True)

Age is also a very typical features, although he appears to be Discrete Numerical Data, but specific issues more often than not depends on the value of age, but depends on the age segments - such as Titanic escape this problem child has priority level, then we will be divided into segments child's age, this feature is called in the engineering and other wide-binning method . (Of course, now also features works do not hurry to do)

We first of age as a Numerical Data to be treated, this can not be used Groupby like to put this to observation of the DataFrame, we still use histgram to view it.

Distribution of specific terms, I want Survive Unsurvived age and age distribution, this time we need two maps, and here I recommend use of FacetGrid Seaborn method , since multiple sub-graphs of the API is chaotic matplotlib and difficult to understand.

g = sns.FacetGrid(train_set, col='Survived')
g.map(plt.hist, 'Age', bins=20)

If you want to study multiple variables at the same time, our FacetGrid also supported. Note that we can add sns.set at the top () or the introduction of hue attributes make our map even more attractive (basically I count mode is so full of paper).

# 在不同社会等级下,男性和女性在不同登陆港口下的数量对比
grid = sns.FacetGrid(train_set, col='Pclass', hue='Sex', palette='seismic', size=4)
grid.map(sns.countplot, 'Embarked', alpha=.8)
grid.add_legend() 

 

II. Data Preprocessing

Data preprocessing mainly to do three things:

  1.  Handle missing values. Obviously if the value does not deal with, then we will be difficult to carry out the following work.
  2.  Feature works. Engineering characteristics and can generally be divided into feature selection and construction features, which I will put the final say.
  3.  Standardization and coding. In general the value of our Numerical Value is relatively large, convergence is not very good on our model, so we have to be standardized, it is limited in scope. As for our Categorical Value, generally we will select one-hot encoding, in fact, for some types of features string, we have to be encoded.

(1) Processing missing values

The method of handling missing values is too much, the portal . General is to analyze specific issues. Also mentioned before pandas provided a lot of API for handling missing values, for example, want to use the mean to fill in missing values , you can use the following statement:

test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

If you want missing values as a new category :

# 这里如果先填充缺失值,那么所有的数据都会变成"yes",想一想是为什么
df.loc[df.Cabin.notna(), "Cabin"] = "yes"
df.loc[df.Cabin.isna(), "Cabin"] = "no"

After this I will do the appropriate summary of Bowen.

(2) Standardized encoding

Before this has been said quite clearly, if it is Numerical data, then it is the use of standardized algorithms, this a lot in sklearn in, for example, below I chose the preprocessing of StandardScaler:

    df["Age"] = StandardScaler().fit_transform(df["Age"].to_frame())

If Categorical data, then, on the use of one-hot encoding (of course, you can use other encoding, for example, sequences encoding, or binary encoding, and other encoding methods), PANDAS there does have API encoded directly with one-hot, is called get_dummies this will create new features on your original DataFrame, see below:

features = ["Cabin", "Embarked", "Sex", "Pclass", "Title"]
one_hot_category = []
for feature in features:
    dummies = pd.get_dummies(df[feature], prefix=feature)
    one_hot_category.append(dummies)
# 拼接的具体方法是讲新的数据和原来的数据拼接,然而再扔掉旧的
df = pd.concat([df, *one_hot_category], axis=1)
df = df.drop(features, axis=1)

 

III. Modeling and forecasting

This appears to be the most important part, but I would rather not speak, because there sklearn call is too simple. As for what model, in fact, what the baseline model does not matter, anyway, when the last model ensemble will integrate a wide variety of models, specifically how to do so will have to see the final model ensemble.

 

IV. Drawing Learning Curve

Angrew NG on coursa to dedicate to our learning curve. He said the learning curve is designed to determine whether our model of convergence. In fact, if we sklearn given model does not converge, it should be given warning. But I have the following warning statement shielding off:

import warnings
warnings.filterwarnings("ignore")

As the model for training sklearn we are invisible, which means that we do not want to train the neural network is defined as a list, and then every training when they are speaking to a list of metrics exist inside. Fortunately sklearn provides us with a learning_curve package, used as follows:

# 盲猜这个train_size是训练次数,至于这个train_score和这个test_score难道是不同训练下的答案
train_sizes, train_scores, test_scores = learning_curve(
    estimator, X, y, cv=None, train_sizes=np.linspace(0.05, 1, 20)
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# train_size是训练的样本数,fill_between还要额外接受两个参数(上限和下限)
# 注意fill_between要设置透明度, 不然就直接和learning_curve混在一起了
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,         
train_scores_mean + train_scores_std, alpha=0.1, color="b")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="r")
plt.plot(train_sizes, train_scores_mean, "o-", color="b")
plt.plot(train_sizes, test_scores_mean, "o-", color="r")
plt.show()

The method of FIG. Fill_between matplotlib used in, say mean and standard deviation mixed with very good results.

 

III. Project features

Feature project is not a simple task, of course, as a Get Started task, I will deal with him simplistic.

(1)Numerical转Categorical

That's equal-width binning method we previously mentioned. You do not say, pandas really strong, even-width binning, he was ready to help us.

train_df['AgeBand'] = pd.cut(train_df['Age'], 5)

This time we got five equally wide range, is clearly not suitable for our model range for direct analysis, we converted to the corresponding Categorical features, as follows:

 df.loc[df["Age"] <= 16, "Age"] = 0
 df.loc[(df["Age"] > 16) & (df["Age"] <= 32), "Age"] = 1
 df.loc[(df["Age"] > 32) & (df["Age"] <= 48), "Age"] = 2
 df.loc[(df["Age"] > 48) & (df["Age"] <= 64), "Age"] = 3
 df.loc[df["Age"] > 64, "Age"] = 4

Numpy little techniques used here, is an intermediate & due and Series object numerical comparison returns a boolean array, and the array can not be used and simple logic, or the like connected to the operator. And this is the result of overload & over, and it shows an element-wise the product. With this new array, we can get all of the rows is True.

(2) Type String

Titanic there is a very good example of extracting information from a string. His name is extracted from the prefix number to call. Of course, we deal with on a string is the choice of a regular expression, but the built-str also has a very good approach:

df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(['Lady', 'Countess', 'Capt', 'Col',
                                   'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df = df.drop("Name", axis=1)

This feature works only a corner of it, look at the specific portal .

 

IV. Model Integration

Explain the almost known: portal ; explanation on Kaggle: Portal .

Real integration model is very complex, I will after this completion. Here simply use sklearn inside the package to do a simple model of integration.

Or to the Titanic, for example, which is a classification problem, so we have to select the corresponding classifier integrated , common classifiers are:

  1. logistic regression
  2. support vector machine
  3. random forest classifier
  4. naive bayes classifier
  5. preceptron (Perceptron)
  6. KNN

Then we will be able to achieve an integrated model of a simple call, where the choices are common VotingClassifier to integrate heterogeneous models :

LR = LogisticRegression(solver="lbfgs", C=1.0, penalty="l2", tol=1e-6)
SVM = SVC(kernel="rbf", C=1)
RF = RandomForestClassifier(n_estimators=20, max_depth=5)
GNB = GaussianNB()
P = Perceptron()
KNN = KNeighborsClassifier(n_neighbors=3)
model = VotingClassifier([("lr", LR), ("svm", SVM), ("gnb", GNB), ("rf", RF), ("p", P), ("knn", KNN)])
model.fit(X, y)

Another point to note is that, regardless of our engineering features or integrated model, we can not ensure that the model score rises. We also need to be assessed on its own cv.

Although there are many score on Kaggle we are very far off the mark. For example, Regression problems, MSE will actually appear 0, Classification in, Accuracy will be even 100%, these are actually overfitting, the solution is to find the original data set, and then in another KNN k = 1. Although this behavior is completely pointless, but we can not lie to us cv.

print(cross_val_score(model, X, y))

As long as there is a clear improvement on our cv, then we will get a score corresponding increase (confident).

Other methods cv test: Portal .

 

V. submit results

Present the results would not have said, first of all talk about our result into csv, then submission-> upload to.

Into csv way - or through pandas, a simple method of to_csv:

outputs = model.predict(test_set)
result = pd.DataFrame({"PassengerId": range(892, 1310), "Survived": outputs})
result.to_csv("./result.csv", index=False)
# Remind me if over
print("Done!")

 

Published 137 original articles · won praise 19 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43338695/article/details/103791134