[Turn] RWeka open source machine learning

Background:
1) Weka:

Weka has two meanings: a flightless bird name, a machine learning referred to an open source project (Waikato Environment for Knowledge the Analysis, http://www.cs.waikato.ac.nz / ~ ml / WEKA / ). Of course, we are here to introduce the second meaning friends, Weka project from the beginning of 1992, supported by the Government of New Zealand, is now famous in the field of machine learning. There are very comprehensive Weka machine learning algorithms, including data pre-processing, classification, regression, clustering, association rules. Weka graphical interface is very convenient for people who can not write a program, but also provides "KnowledgeFlow" feature that allows multiple steps of a workflow. In addition, Weka also allows execute commands on the command line.
2) R
R I would not nonsense, huh, huh, more popular statistical software ( http://www.r-project.org/ ).
3) R and Weka:
R, there are a lot of machine learning functions, and packages, but function in Weka provides a more comprehensive and focused, so I sometimes need to use Weka. I used to be so with the R and Weka:
ready for training data (such as: extract data features ......) in R;
organized into Weka format required (* .arff);
do machine learning (as in Weka in: Features Alternatively, classification ......);
calculated from the prediction results of Weka statistics (if desired: sensitivity, specificity, MCC ......) .
Daoteng software back and forth two very troublesome; for lazy, I did not learn Weka command line, only use graphical interface, a large amount of data at the time suffered greatly, sometimes not enough memory. Now actually find R and Weka provides the interface function package RWeka, after much more convenient Oh, Here are some RWeka functions:
RWeka ( http://cran.r-project.org/web/packages/RWeka/index.html ):
1) data input and output
WOW (): function Weka view parameters.
Weka_control (): a function of setting parameters Weka.
read.arff (): data read Weka Attribute-Relation File Format (ARFF ) format.
write.arff: writing data to files Weka Attribute-Relation File Format (ARFF ) format.
2) data preprocessing
Normalize (): continuity data normalized unsupervised.
Discretize (): with MDL (Minimum Description Length) method, there are discrete numerical data continuity supervision.
3) classification and regression
IBk (): k nearest neighbor classifier
LBR (): naive Bayes classification method
J48 (): C4.5 decision tree algorithm (decision tree in the analysis of individual properties, is completely independent).
LMT (): composition and Logistic regression model tree, each leaf node is a Logistic regression model, the accuracy is better than the individual tree and Logistic regression method.
M5P (): M5 algorithm model number, and a combination of the linear regression model tree, each leaf node is a linear regression model, and thus can be used for continuous data regression.
DecisionStump (): single-layer decision tree algorithm, often as boosting basic learner.
SMO (): SVM
AdaBoostM1 (): Adaboost M1 method. -W parameter specifies the algorithm weak learners.
Bagging (): by sampling (with replacement method) from the raw data, to create a plurality of models.
LogitBoost (): weak learner uses a logistic regression method, to be a real learning
MultiBoostAB (): a combination of an improved method of AdaBoost, AdaBoost, and can be regarded as "wagging" in.
Stacking (): used for different integrated basic classification algorithm.
LinearRegression (): establish an appropriate linear regression model.
Logistic (): the establishment of logistic regression models.
JRip (): a regular learning.
M5Rules (): decision rules generated by M5 regression method.
OneR (): Simple 1-R classification.
PART (): PART generate decision rules.
4) cluster
Cobweb (): This is the kind of model-based methods, it is assumed that each cluster model and found to fit the data corresponding to the model. Not suitable for large database clustering process.
FarthestFirst (): Fast Approximate k-means clustering algorithm
SimpleKMeans (): k-means clustering algorithm
XMeans (): Improved k-means method can automatically determine the number of classes
DBScan (): density based clustering method, it is growing clusters according to the density of the surrounding objects. Arbitrary shape cluster it was found from the spatial database containing noisy. This method is defined as a cluster a set point set "Density connection".
5) association rules
Apriori (): Apriori association rule is an area where the most influential basic algorithm is a breadth-first algorithm to obtain frequent itemsets support is greater than the minimum support multiple scans through the database. It is based on theoretical principles monotonicity two itemsets: any subset of frequent itemsets must be frequent; a superset of any non-constant non-frequent itemsets frequent. In the case of massive data, time and space costs Apriori algorithm is very high.
Tertius (): Tertius algorithm.
6) Evaluation and prediction:
Predict (): The result of the prediction classification categories or clusters of the new data
table (): Comparative factor two objects
evaluate_Weka_classifier (): Evaluation model execution, such as: TP Rate, FP Rate, Precision , Recall , F-Measure.

Reproduced in: https: //www.cnblogs.com/caleb/archive/2011/05/03/2035583.html

Guess you like

Origin blog.csdn.net/weixin_34248023/article/details/93174949