(C) transactions with Python Random Forest algorithm

Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai


Getting Started (a) machine learning ensemble learning

(Ii) Method Bagging

(C) transactions with Python Random Forest algorithm


In this article, we will discuss what is random forests, how they work and how they help overcome the limitations of the decision tree.

With its machine-learning technology to flourish in the current environment, more and more of the algorithm finds application in various fields. Work of the various machine learning algorithms different from each other, in which an algorithm for a certain problem may algorithm better than another. Machine learning algorithms constantly updated and upgraded in order to expand its scope of application and minimize its disadvantages.

Random Forest algorithm is an algorithm that aims to overcome the limitations of the decision tree. In this blog, we will introduce it. Before the direct introduction of random forests, let us briefly understand the tree and how it works.

What is a decision tree?

As its name, or a decision tree having a hierarchical tree structure, which acts as the branch node. We can make a decision by traversing the nodes, the nodes selected by parameter data features.
However, there is a problem of over-fitting tree. Overfitting usually add more and more nodes in the tree to increase specificity in the tree to reach a conclusion, thereby increasing the depth of the tree and make it more complicated.

In addition, in this blog, we will learn how to Random Forest to help overcome this shortcoming of the decision tree.

What is a Random Forests?

Random Forests classification is supervised machine learning algorithm using an integrated approach. In short, random forest consisting of a large number of decision trees, decision trees can help solve the problem of over-fitting. Characterized by randomly selecting from a given data set randomly configured these decision trees.
Random Forests decision or predicted results based on the maximum number of votes received from the tree. The maximum number of results through numerous random forest tree is regarded as the final result.

Works random forest

Random Forests based on ensemble learning technique, simply represents a combination or set, in this case, it is a set of decision trees, random forests, together referred to. The accuracy of the model is better than the accuracy of the collection of a single model, because it summarizes the results of a single model and provides a final result.
So, how to select features from centralized data to construct a decision tree random forest it?

We use a method called bagging random selection feature. The data set available feature set to create a number of random training subset by selecting features having an alternative. This means that you can focus on different training while repeating a sub-feature.

For example, if the data set comprises a subset of the features 20, and to select five different characteristics to build a decision tree, which will be randomly selected five feature, and any feature may be part of a plurality of subsets. This ensures that the randomness, the correlation between the trees smaller, thereby overcoming the problem of over-fitting.

After selection feature, the tree constructed according to the best segmentation. Each tree give an output which is considered a "vote" from the tree to a given output. Export

Receiving a maximum 'Voting' Random Forest selecting the final output / result, or in the case of continuous variables, the output is considered the average of all final output.

Here Insert Picture Description

For example, in the figure above, we can observe that each tree has voted or predict a specific category. Random Forest selected final output will be N type or category, because it has a majority vote is two or four predicted output of the decision tree.

Random Forests Python code

In this code, we will create a random forest classifier and train them, and give daily returns.

Import library
import quantrautil as q
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Introduced above using library as follows:

  • quantrautil: This will be used to acquire BAC stock price data from Yahoo finance in;
  • numpy: BAC perform data operations on the stock price to calculate the input and output characteristics. If you want to learn more about numpy, you can view numpy official website;
  • Sklearn: sklearn is a collection of tools and a lot of machine learning model, you can be free to use;
  • RandomForestClassifier: will be used to create a Random Forest classifier model;
retrieve data

The next step is to import the BAC stock price data from quantrautil. quantrautil of get_data BAC function is used to obtain data from January 1, 2000 to January 31, 2019 of. As follows:

data = q.get_data('BAC','2000-1-1','2019-2-1')
print(data.tail())
[*********************100%***********************]  1 of 1 downloaded
             Open   High    Low  Close  Adj Close     Volume Source
Date                                                               
2019-01-25  29.28  29.72  29.14  29.58      29.27   72182100  Yahoo
2019-01-28  29.32  29.67  29.29  29.63      29.32   59963800  Yahoo
2019-01-29  29.54  29.70  29.34  29.39      29.08   51451900  Yahoo
2019-01-30  29.42  29.47  28.95  29.07      28.77   66475800  Yahoo
2019-01-31  28.75  28.84  27.98  28.47      28.17  100201200  Yahoo
Create input and output data

In this step, I will create input and output variables.

Input variables: I use (open-close) / open, (High-Low) / Low, 5 bill past the standard deviation (std_5), the average of the past five days (ret_5).

Output variables: If tomorrow's closing price is greater than today's closing price, the output variable is set to 1, otherwise it is set to -1.1 stock purchase, -1 means to sell the stock.

Characteristics of the input and output data is completely random, if you are interested can view the Wiki .

# Features construction 
data['Open-Close'] = (data.Open - data.Close)/data.Open
data['High-Low'] = (data.High - data.Low)/data.Low
data['percent_change'] = data['Adj Close'].pct_change()
data['std_5'] = data['percent_change'].rolling(5).std()
data['ret_5'] = data['percent_change'].rolling(5).mean()
data.dropna(inplace=True)
# X is the input variable

X = data[['Open-Close', 'High-Low', 'std_5', 'ret_5']]


# Y is the target or output variable

y = np.where(data['Adj Close'].shift(-1) > data['Adj Close'], 1, -1)


Divide the training set and test set

We will now split the data set is 75% of the training data set, 25% of the test data set.

# Total dataset length
dataset_length = data.shape[0]
# Training dataset length

split = int(dataset_length * 0.75)

split
3597
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Print the size of the train and test dataset

print(X_train.shape, X_test.shape)

print(y_train.shape, y_test.shape)
(3597, 4) (1199, 4)
(3597,) (1199,)
Training machine learning models

All data is set up! Let us train a decision tree classifier model. RandomForestClassifier function tree is stored in the variable clf then use X_train y_train dataset and fit as calling the method, classifier model to learn the relationship between input and output.

clf = RandomForestClassifier(random_state=5)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
print('Correct Prediction (%): ', accuracy_score(y_test, model.predict(X_test), normalize=True)*100.0)
Correct Prediction (%):  50.29190992493745
# Run the code to view the classification report metrics
from sklearn.metrics import classification_report
report = classification_report(y_test, model.predict(X_test))
print(report)
             precision    recall  f1-score   support

         -1       0.50      0.61      0.55       594
          1       0.51      0.40      0.45       605

avg / total       0.50      0.50      0.50      1199
Return Policy
data['strategy_returns'] = data.percent_change.shift(-1) * model.predict(X)
Daily return histograms
%matplotlib inline
import matplotlib.pyplot as plt
data.strategy_returns[split:].hist()
plt.xlabel('Strategy returns (%)')
plt.show()

Here Insert Picture Description

Return Policy
(data.strategy_returns[split:]+1).cumprod().plot()
plt.ylabel('Strategy returns (%)')
plt.show()

Here Insert Picture Description

The output is displayed in Figure daily returns and return policies based on random forest classifier code.

advantage
  • Avoid over-fitting
  • It can be used for classification and regression
  • It can handle missing values
Shortcoming
  • A large number of tree will take up a lot of space and use a lot of time

In this blog, we learned using random forest to write simple strategy.

Published 414 original articles · won praise 168 · views 470 000 +

Guess you like

Origin blog.csdn.net/CoderPai/article/details/91615739