My first game Kaggle learning - Titanic

background

Titanic: Machine Learning from Disaster - Kaggle

Two years ago recommended shining this game to do something, I open the results page will be kept, and did not know where to start.

Two years later, open the page again, to see clearly the Titanic the Tutorial - Kaggle , completely fool can do to follow suit down. What year was blinded my eyes ~

Target

use machine learning to create a model that predicts which passengers survived the Titanic shipwreck

Data

Titanic: Machine Learning from Disaster - Kaggle

  • train.csv
    • Survived: 1=yes, 0=No
  • test.csv
  • gender_submission.csv: for prediction
    • PassengerId: those from test.csv
    • Survived: final result

Guide to help start and follow

Titanic Tutorial - Kaggle

Learning Model

Excerpt site interpretation, behind concrete to talk about.

  • random forest model
    • constructed of several "trees"
      • that will individually consider each passenger's data
      • and vote on whether the individual survived.
      • Then, the random forest model makes a democratic decision:
        • the outcome with the most votes wins!

sklearn.ensemble.RandomForestClassifier

Titanic game is used in RandomForestClassifierthe algorithm, the algorithm about this before, I noticed sklearnin the algorithm class is in the ensemblemodule, English is not good, I do not know ensemblewhat that means? I would like to first lookensemble

ensemble

Dictionary explanation is:a number of things considered as a group

Sounds like a combination of means.

Search a bit, the ML has ensemble learningtranslated more than a "integrated learning", referring to how integrated learning (ensemble learning) should Starter? - know almost mentioned that there are three common integrated learning framework: bagging, boostingand stacking.

From the API Reference - scikit-learn 0.22.1 documentation of these types can also be seen the frame has a corresponding algorithm.

Random ForestIt is baggingan algorithm framework. Here single first try to understand this, other frameworks such as after the encounter to say. But about this before, still have to clear Ensemble Learning what in the end is?

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

This interpretation of the literal meaning, and should be, a combination of a variety of algorithms for better prediction performance, the result is better than a single algorithm which alone.

bagging frame

sklearn.ensemble.BaggingClassifier — scikit-learn 0.22.1 documentation

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

To the effect that:

  • Randomly from the source data set a subset of the sample
  • Training a classifier on this subset of samples
  • Repeat the above steps a plurality of times
  • Then predict the results of each classifier integration (averaging or vote)
  • Form the final prediction

The issue is:

  • How many sets of samples drawn second son, namely how much the classifier to do?
  • What randomly selected algorithm?
  • Integration of all classifiers results when averaged and voting have what advantages and disadvantages?
  • How to train each classifier?

I do not know -

Mentioned earlier Random Forestis baggingan algorithm framework. Now let's look at how this algorithm answer some of my questions.

Random Forest algorithm

1.11. Ensemble methods — scikit-learn 0.22.1 documentation

The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

First clear one, this algorithm is hate each classifier averaging. Forest of what? Nature is a forest of trees, and the tree here refers to the Decision Trees , so the algorithm is actuallyaveraging algorithms based on randomized decision trees

random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Random forestWe will build a decision tree for each classifier, and then merge.

How classifier is to divide it? Or in the code Titanic as an example to try to understand the next:

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
  • y: Is a collection of training people focus on surviving the disaster
  • features: Eigenvalues ​​of these people, such as sex, few such spaces, etc.
  • X: generating dummy data, why should get_dummiesnot directly use train_data[features]it?

Try to directly train_data[features]print X, the result is this:

     Pclass     Sex  SibSp  Parch
0         3    male      1      0
1         1  female      1      0

If we continue with this X model, it will complain:

ValueError: could not convert string to float: 'male'

Obviously, because Sex is a field of type string, and the model is needed is a float, so can not directly use train_data[features]

That get_dummies()role clearly, is to convert these string types of fields into a float. The results can be seen from the following print, Sex field is divided into two fields, Sex_male, Sex_female, its value is 0 and 1, respectively.

Pclass  SibSp  Parch  Sex_female  Sex_male
0         3      1      0           0         1
1         1      1      0           1         0
2         3      0      0           1         0
3         1      1      0           1         0
4         3      0      0           0         1
..      ...    ...    ...         ...       ...
886       2      0      0           0         1
887       1      0      0           1         0
888       3      1      2           1         0
889       1      0      0           0         1
890       3      0      0           0         1
  • RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
    • These parameters are what does this mean?
      • n_estimators: the number of decision tree
      • max_depths: maximum depth of the tree
      • random_state : control random number generator, (actually not quite understand, this is not referring to random random sampling?), you may want to cooperate with other parameters such as the shuffle. Reference was also made that the random number with the control algorithm, making multiple runs every time or produce the same results?
        • To make a randomized algorithm deterministic (i.e. running it multiple times will produce the same result), an arbitrary integer random_state can be used

DETAILED how to adjust the parameters, reference parameter tuning guidelines

Random ForestApplication scenarios

Since it is a classification algorithm, a lot of natural classification are suitable for the application scenario; in addition to regression problem scenarios.

The article The Random Forest Algorithm: A Complete Guide - Built In gives a practical example of analogy:

  • You decide where to travel, go ask your friends
  • A friend asked, your previous trip likes and dislikes aspects which
    • On this basis we give some suggestions
  • This provided material for your decision
  • The same steps you went to ask another friend
  • Another and another friend
  • ...

Also, you get a few offer, then the hesitation which so forth; saw several houses, decide what to do, seemingly can apply this method a try.

Not familiar with the code before several learned

  • pandas.DataFrame.head : Returns the first few lines of the data set of data, parameters n, default n = 5
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

Reference

This article from the blog article multiple platforms OpenWrite release!

Guess you like

Origin www.cnblogs.com/learnbydoing/p/12232970.html