Performance comparison test implemented by random forest algorithm

Random forest is a commonly used machine learning algorithm, which can be used for both classification problems and regression problems. This article compares and tests the implementation of random forest algorithms on four platforms: scikit-learn, Spark MLlib, DolphinDB, and xgboost. Evaluation indicators include memory usage, running speed and classification accuracy. This test uses simulated data as input for binary classification training, and uses the generated model to predict the simulated data.

1. Test software

The platform versions used in this test are as follows:

scikit-learn:Python 3.7.1,scikit-learn 0.20.2

Spark MLlib:Spark 2.0.2,Hadoop 2.7.2

DolphinDB:0.82

xgboost:Python package,0.81

2. Environment configuration

CPU: Intel(R) Xeon(R) CPU E5-2650 v4 2.20GHz (24 cores and 48 threads in total)

RAM:512GB

Operating system: CentOS Linux release 7.5.1804

When testing on each platform, the data will be loaded into the memory and then calculated, so the performance of the random forest algorithm has nothing to do with the disk.

3. Data generation

This test uses the DolphinDB script to generate simulation data and export it as a CSV file. The training set is equally divided into two categories. The feature columns of each category are subject to two different centers, the same standard deviation, and pairwise independent multivariate normal distributions N(0, 1) and N(2/sqrt(20), 1 ). There are no null values ​​in the training set.

Suppose the size of the training set is n rows and p columns. In this test, the value of n is 10,000, 100,000, 1,000,000, and the value of p is 50.

Since the test set and the training set are independent and identically distributed, the size of the test set has no significant effect on the model accuracy evaluation. This test uses 1000 rows of simulated data as the test set for all training sets of different sizes.

See Appendix 1 for the DolphinDB script that generates simulation data.

4. Model parameters

The following parameters are used for random forest model training in each platform:

  • Number of trees: 500
  • Maximum depth: The maximum depth of 10 and 30 were tested on 4 platforms
  • The number of features selected when dividing nodes: the square root of the total number of features, that is integer(sqrt(50))=7
  • Impurity index when dividing nodes: Gini index, this parameter is only valid for Python scikit-learn, Spark MLlib and DolphinDB
  • Number of sampled buckets: 32, this parameter is only valid for Spark MLlib and DolphinDB
  • Number of concurrent tasks: number of CPU threads, 48 ​​for Python scikit-learn, Spark MLlib and DolphinDB, and 24 for xgboost.

When testing xgboost, I tried different values ​​of the parameter nthread (representing the number of concurrent threads at runtime). However, when the parameter value is the number of threads in this test environment (48), the performance is not ideal. It is further observed that when the number of threads is less than 10, the performance is positively correlated with the value. When the number of threads is greater than 10 and less than 24, the performance difference of different values ​​is not obvious. After that, the performance decreases when the number of threads increases. This phenomenon has also been discussed in the xgboost community . Therefore, the final number of threads used in xgboost in this test is 24.

5. Test results

See appendix 2~5 for the test script.

When the number of trees is 500 and the maximum depth is 10, the test results are shown in the following table:

When the number of trees is 500 and the maximum depth is 30, the test results are shown in the following table:

In terms of accuracy, the accuracy of Python scikit-learn, Spark MLlib and DolphinDB are similar, slightly higher than the implementation of xgboost; in terms of performance, from high to low, they are DolphinDB, Python scikit-learn, xgboost, Spark MLlib .

In this test, the implementation of Python scikit-learn uses all CPU cores.

The implementation of Spark MLlib does not make full use of all CPU cores and has the highest memory usage. When the data volume is 10,000, the peak CPU occupancy rate is about 8%. When the data volume is 100,000, the peak CPU occupancy rate is about 25%. At 1,000,000, it will interrupt execution due to insufficient memory.

The implementation of DolphinDB database uses all CPU cores, and it is the fastest of all implementations, but its memory footprint is 2-7 times that of scikit-learn and 3-9 times that of xgboost. DolphinDB's random forest algorithm implementation provides the numJobs parameter, which can be adjusted to reduce the degree of parallelism, thereby reducing memory usage. For details, please refer to the DolphinDB user manual .

xgboost is often used in the training of boosted trees, and can also perform random forest algorithms. It is a special case when the number of algorithm iterations is 1. Xgboost actually has the highest performance around 24 threads, and its utilization of CPU threads is not as good as Python and DolphinDB, and the speed is not as good as both. Its advantage lies in the least memory usage. In addition, the specific implementation of xgboost is also different from that of other platforms. For example, there is no bootstrap process, using sampling without replacement instead of sampling with replacement. This can explain why its accuracy is slightly lower than other platforms.

6. Summary

Python scikit-learn's random forest algorithm achieves a relatively balanced performance, memory overhead, and accuracy. The performance of Spark MLlib's implementation is far inferior to other platforms in terms of performance and memory overhead. DolphinDB's random forest algorithm achieves the best performance, and DolphinDB's random forest algorithm and the database are seamlessly integrated. Users can directly train and predict the data in the database, and provide the numJobs parameter to achieve the difference between memory and speed. balance. The random forest of xgboost is only a special case when the number of iterations is 1, and the specific implementation is quite different from other platforms. The best application scenario is boosted tree.

appendix

1. DolphinDB script that simulates data generation

def genNormVec(cls, a, stdev, n) {
	return norm(cls * a, stdev, n)
}

def genNormData(dataSize, colSize, clsNum, scale, stdev) {
	t = table(dataSize:0, `cls join ("col" + string(0..(colSize-1))), INT join take(DOUBLE,colSize))
	classStat = groupby(count,1..dataSize, rand(clsNum, dataSize))
	for(row in classStat){
		cls = row.groupingKey
		classSize = row.count
		cols = [take(cls, classSize)]
		for (i in 0:colSize)
			cols.append!(genNormVec(cls, scale, stdev, classSize))
		tmp = table(dataSize:0, `cls join ("col" + string(0..(colSize-1))), INT join take(DOUBLE,colSize))
		insert into t values (cols)
		cols = NULL
		tmp = NULL
	}
	return t
}

colSize = 50
clsNum = 2
t1m = genNormData(10000, colSize, clsNum, 2 / sqrt(20), 1.0)
saveText(t1m, "t10k.csv")
t10m = genNormData(100000, colSize, clsNum, 2 / sqrt(20), 1.0)
saveText(t10m, "t100k.csv")
t100m = genNormData(1000000, colSize, clsNum, 2 / sqrt(20), 1.0)
saveText(t100m, "t1m.csv")
t1000 = genNormData(1000, colSize, clsNum, 2 / sqrt(20), 1.0)
saveText(t1000, "t1000.csv")

 

2. Python scikit-learn training and prediction script

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from time import *

test_df = pd.read_csv("t1000.csv")

def evaluate(path, model_name, num_trees=500, depth=30, num_jobs=1):
    df = pd.read_csv(path)
    y = df.values[:,0]
    x = df.values[:,1:]

    test_y = test_df.values[:,0]
    test_x = test_df.values[:,1:]

    rf = RandomForestClassifier(n_estimators=num_trees, max_depth=depth, n_jobs=num_jobs)
    start = time()
    rf.fit(x, y)
    end = time()
    elapsed = end - start
    print("Time to train model %s: %.9f seconds" % (model_name, elapsed))

    acc = np.mean(test_y == rf.predict(test_x))
    print("Model %s accuracy: %.3f" % (model_name, acc))

evaluate("t10k.csv", "10k", 500, 10, 48)    # choose your own parameter

 

3. Training and prediction code of Spark MLlib (implemented in Scala)

import org.apache.spark.mllib.tree.configuration.FeatureType.Continuous
import org.apache.spark.mllib.tree.model.{DecisionTreeModel, Node}

object Rf {
  def main(args: Array[String]) = {
    evaluate("/t100k.csv", 500, 10)    // choose your own parameter 
  }

  def processCsv(row: Row) = {
    val label = row.getString(0).toDouble
    val featureArray = (for (i <- 1 to (row.size-1)) yield row.getString(i).toDouble).toArray
    val features = Vectors.dense(featureArray)
    LabeledPoint(label, features)
  }

  def evaluate(path: String, numTrees: Int, maxDepth: Int) = {
    val spark = SparkSession.builder.appName("Rf").getOrCreate()
    import spark.implicits._

    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val featureSubsetStrategy = "sqrt" 
    val impurity = "gini"
    val maxBins = 32

    val d_test = spark.read.format("CSV").option("header","true").load("/t1000.csv").map(processCsv).rdd
    d_test.cache()

    println("Loading table (1M * 50)")
    val d_train = spark.read.format("CSV").option("header","true").load(path).map(processCsv).rdd
    d_train.cache()
    println("Training table (1M * 50)")
    val now = System.nanoTime
    val model = RandomForest.trainClassifier(d_train, numClasses, categoricalFeaturesInfo,
      numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
    println(( System.nanoTime - now )/1e9)

    val scoreAndLabels = d_test.map { point =>
      val score = model.trees.map(tree => softPredict2(tree, point.features)).sum
      if (score * 2 > model.numTrees)
        (1.0, point.label)
      else
        (0.0, point.label)
    }
    val metrics = new MulticlassMetrics(scoreAndLabels)
    println(metrics.accuracy)
  }

  def softPredict(node: Node, features: Vector): Double = {
    if (node.isLeaf) {
      //if (node.predict.predict == 1.0) node.predict.prob else 1.0 - node.predict.prob
      node.predict.predict
    } else {
      if (node.split.get.featureType == Continuous) {
        if (features(node.split.get.feature) <= node.split.get.threshold) {
          softPredict(node.leftNode.get, features)
        } else {
          softPredict(node.rightNode.get, features)
        }
      } else {
        if (node.split.get.categories.contains(features(node.split.get.feature))) {
          softPredict(node.leftNode.get, features)
        } else {
          softPredict(node.rightNode.get, features)
        }
      }
    }
  }
  def softPredict2(dt: DecisionTreeModel, features: Vector): Double = {
    softPredict(dt.topNode, features)
  }
}

 

4. DolphinDB training and prediction script

def createInMemorySEQTable(t, seqSize) {
	db = database("", SEQ, seqSize)
	dataSize = t.size()
	ts = ()
	for (i in 0:seqSize) {
		ts.append!(t[(i * (dataSize/seqSize)):((i+1)*(dataSize/seqSize))])
	}
	return db.createPartitionedTable(ts, `tb)
}

def accuracy(v1, v2) {
	return (v1 == v2).sum() \ v2.size()
}

def evaluateUnparitioned(filePath, numTrees, maxDepth, numJobs) {
	test = loadText("t1000.csv")
	t = loadText(filePath); clsNum = 2; colSize = 50
	timer res = randomForestClassifier(sqlDS(<select * from t>), `cls, `col + string(0..(colSize-1)), clsNum, sqrt(colSize).int(), numTrees, 32, maxDepth, 0.0, numJobs)
	print("Unpartitioned table accuracy = " + accuracy(res.predict(test), test.cls).string())
}

evaluateUnpartitioned("t10k.csv", 500, 10, 48)    // choose your own parameter

 

5. xgboost training and prediction script

import pandas as pd
import numpy as np
import xgboost as xgb
from time import *

def load_csv(path):
    df = pd.read_csv(path)
    target = df['cls']
    df = df.drop(['cls'], axis=1)
    return xgb.DMatrix(df.values, label=target.values)

dtest = load_csv('/hdd/hdd1/twonormData/t1000.csv')

def evaluate(path, num_trees, max_depth, num_jobs):
    dtrain = load_csv(path)
    param = {'num_parallel_tree':num_trees, 'max_depth':max_depth, 'objective':'binary:logistic',
        'nthread':num_jobs, 'colsample_bylevel':1/np.sqrt(50)}
    start = time()
    model = xgb.train(param, dtrain, 1)
    end = time()
    elapsed = end - start
    print("Time to train model: %.9f seconds" % elapsed)
    prediction = model.predict(dtest) > 0.5
    print("Accuracy = %.3f" % np.mean(prediction == dtest.get_label()))

evaluate('t10k.csv', 500, 10, 24)    // choose your own parameter

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/110823613