background
Titanic: Machine Learning from Disaster - Kaggle
Two years ago recommended shining this game to do something, I open the results page will be kept, and did not know where to start.
Two years later, open the page again, to see clearly the Titanic the Tutorial - Kaggle , completely fool can do to follow suit down. What year was blinded my eyes ~
Target
use machine learning to create a model that predicts which passengers survived the Titanic shipwreck
Data
Titanic: Machine Learning from Disaster - Kaggle
- train.csv
- Survived: 1=yes, 0=No
- test.csv
- gender_submission.csv: for prediction
- PassengerId: those from test.csv
- Survived: final result
Guide to help start and follow
Learning Model
Excerpt site interpretation, behind concrete to talk about.
- random forest model
- constructed of several "trees"
- that will individually consider each passenger's data
- and vote on whether the individual survived.
- Then, the random forest model makes a democratic decision:
- the outcome with the most votes wins!
- constructed of several "trees"
sklearn.ensemble.RandomForestClassifier
Titanic game is used in RandomForestClassifier
the algorithm, the algorithm about this before, I noticed sklearn
in the algorithm class is in the ensemble
module, English is not good, I do not know ensemble
what that means? I would like to first lookensemble
ensemble
Dictionary explanation is:a number of things considered as a group
Sounds like a combination of means.
Search a bit, the ML has ensemble learning
translated more than a "integrated learning", referring to how integrated learning (ensemble learning) should Starter? - know almost mentioned that there are three common integrated learning framework: bagging
, boosting
and stacking
.
From the API Reference - scikit-learn 0.22.1 documentation of these types can also be seen the frame has a corresponding algorithm.
Random Forest
It is bagging
an algorithm framework. Here single first try to understand this, other frameworks such as after the encounter to say. But about this before, still have to clear Ensemble Learning what in the end is?
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
This interpretation of the literal meaning, and should be, a combination of a variety of algorithms for better prediction performance, the result is better than a single algorithm which alone.
bagging frame
sklearn.ensemble.BaggingClassifier — scikit-learn 0.22.1 documentation
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.
To the effect that:
- Randomly from the source data set a subset of the sample
- Training a classifier on this subset of samples
- Repeat the above steps a plurality of times
- Then predict the results of each classifier integration (averaging or vote)
- Form the final prediction
The issue is:
- How many sets of samples drawn second son, namely how much the classifier to do?
- What randomly selected algorithm?
- Integration of all classifiers results when averaged and voting have what advantages and disadvantages?
- How to train each classifier?
I do not know -
Mentioned earlier Random Forest
is bagging
an algorithm framework. Now let's look at how this algorithm answer some of my questions.
Random Forest algorithm
1.11. Ensemble methods — scikit-learn 0.22.1 documentation
The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
First clear one, this algorithm is hate each classifier averaging. Forest of what? Nature is a forest of trees, and the tree here refers to the Decision Trees , so the algorithm is actuallyaveraging algorithms based on randomized decision trees
random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Random forest
We will build a decision tree for each classifier, and then merge.
How classifier is to divide it? Or in the code Titanic as an example to try to understand the next:
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
y
: Is a collection of training people focus on surviving the disasterfeatures
: Eigenvalues of these people, such as sex, few such spaces, etc.- X: generating dummy data, why should
get_dummies
not directly usetrain_data[features]
it?
Try to directly train_data[features]
print X, the result is this:
Pclass Sex SibSp Parch
0 3 male 1 0
1 1 female 1 0
If we continue with this X model, it will complain:
ValueError: could not convert string to float: 'male'
Obviously, because Sex is a field of type string, and the model is needed is a float, so can not directly use train_data[features]
That get_dummies()
role clearly, is to convert these string types of fields into a float. The results can be seen from the following print, Sex field is divided into two fields, Sex_male, Sex_female, its value is 0 and 1, respectively.
Pclass SibSp Parch Sex_female Sex_male
0 3 1 0 0 1
1 1 1 0 1 0
2 3 0 0 1 0
3 1 1 0 1 0
4 3 0 0 0 1
.. ... ... ... ... ...
886 2 0 0 0 1
887 1 0 0 1 0
888 3 1 2 1 0
889 1 0 0 0 1
890 3 0 0 0 1
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
- These parameters are what does this mean?
- n_estimators: the number of decision tree
- max_depths: maximum depth of the tree
- random_state : control random number generator, (actually not quite understand, this is not referring to random random sampling?), you may want to cooperate with other parameters such as the shuffle. Reference was also made that the random number with the control algorithm, making multiple runs every time or produce the same results?
- To make a randomized algorithm deterministic (i.e. running it multiple times will produce the same result), an arbitrary integer random_state can be used
- These parameters are what does this mean?
DETAILED how to adjust the parameters, reference parameter tuning guidelines
Random Forest
Application scenarios
Since it is a classification algorithm, a lot of natural classification are suitable for the application scenario; in addition to regression problem scenarios.
The article The Random Forest Algorithm: A Complete Guide - Built In gives a practical example of analogy:
- You decide where to travel, go ask your friends
- A friend asked, your previous trip likes and dislikes aspects which
- On this basis we give some suggestions
- This provided material for your decision
- The same steps you went to ask another friend
- Another and another friend
- ...
Also, you get a few offer, then the hesitation which so forth; saw several houses, decide what to do, seemingly can apply this method a try.
Not familiar with the code before several learned
- pandas.DataFrame.head : Returns the first few lines of the data set of data, parameters n, default n = 5
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)
Reference
- Integrated Learning (ensemble learning) should be how to get started? - Know almost
- Ensemble Learning》
- The Random Forest Algorithm: A Complete Guide - Built In
This article from the blog article multiple platforms OpenWrite release!