weka learning Development

Copyright: copyright reserved by the authors, are reproduced in any form, please indicate the source. https://blog.csdn.net/xiaokui9/article/details/89185464

 

Read the file and database data read

Reference links 1 and 2

You may want to use the most commonly used components (components) are:

l Instances your data

l Filter pretreatment data

l Classifiers / Clusterer is based on pre-processed data, classification / clustering

l Evaluating Evaluation classifier / clusterer

l Attribute selection data removal properties are not relevant

Available code following tests:

public static Instances getInstances(String filePath) {
        try {
            filePath = "C:\\Weka-3-8\\data\\iris.arff";
                    
            //3.5.5和3.4.X版本
            Instances data = new Instances( new BufferedReader( new FileReader(filePath) ) );
            // setting class attribute
            // Class Index是指示用于分类的目标属性的下标。在ARFF文件中,它被默认为是最后一个属性,这也就是为什么它被设置成numAttributes-1.
            //你必需在使用一个Weka函数(ex: weka.classifiers.Classifier.buildClassifier(data))之前设置Class Index。
            data.setClassIndex(data.numAttributes() - 1);
            System.out.println( "#################data:" );
            System.out.println( data );

            //3.5.5和更新的版本
            //DataSource类不仅限于读取ARFF文件,它同样可以读取CSV文件和其它格式的文件(基本上Weka可以通过它的转换器(converters)导入所有的文件格式)。
            DataSource source = new DataSource(filePath);
            Instances data2 = source.getDataSet();
            // setting class attribute if the data format does not provide this information
            // E.g., the XRFF format saves the class attribute information as well
            //if (data2.classIndex() == -1)
            data2.setClassIndex(data2.numAttributes() - 1);
            System.out.println(  "#################data2:"  );
            System.out.println( data2 );

            //读取数据库
            InstanceQuery query = new InstanceQuery();
            //数据库配置在weka.jar文件中
            query.setUsername("root");
            query.setPassword("");
            query.setQuery("select * from url_features limit 0,10");//url_features
            // if your data is sparse, then you can say so too
            // query.setSparseData(true);
            Instances data3 = query.retrieveInstances();
            //把数据集全部输入出
            System.out.println( data3 );
            //用numInstances可以获得数据集中有多少样本
            for( int i = 0; i < data3.numInstances(); i++ )
            {
                //instance( i )是得到第i个样本
                System.out.println( data3.instance( i ) );
            }
            return data2;
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    } 

Database configuration instructions see link 3.

Classifier

Reference Link 4

To classify the data set, first step is to specify the data set which column as a category, if this step is forgotten (in fact often forgotten) appear "Class index is negative (not set)!" This error, set a method as members of a category with setClassIndex Instances of classes, to set the number of categories may be the last column () members property obtained by the method numAttributes Instances of classes minus one.

            Instances m_instances = getInstances(filePath);//这里使用了上面代码的data2的方法
            J48 classifier = new J48();

            //NaiveBayes classifier2 = new NaiveBayes();
            //SMO classifier = new SMO();
            classifier.buildClassifier( m_instances );
            //输出的内容是数据中第0、60、110行的数据的分类结果
            System.out.println( classifier.classifyInstance( m_instances.instance( 0 ) ) );
            System.out.println( classifier.classifyInstance( m_instances.instance( 60 ) ) );
            System.out.println( classifier.classifyInstance( m_instances.instance( 110 ) ) );

Classification and evaluation

Reference links 5

    //首先初始化一个Evaluation对象,Evaluation类没有无参的构造函数,一般用Instances对象作为构造函数的参数。
    //如果没有分开训练集和测试集,可以使用Cross Validation方法,
    // Evaluation中crossValidateModel方法的四个参数分别为,第一个是分类器,第二个是在某个数据集上评价的数据集,第三个参数是交叉检验的次数(10是比较常见的),第四个是一个随机数对象。
    //提醒大家一下,使用crossValidateModel时,分类器不需要先训练,这其实也应该是常识了。
    //Evaluation中提供了多种输出方法,大家如果用过weka软件,会发现方法输出结果与软件中某个显示结果的是对应的。例中的三个方法toClassDetailsString,toSummaryString,toMatrixString比较常用。
    public static void crossValidation() throws Exception
    {
        J48 classifier = new J48();
        //NaiveBayes classifier = new NaiveBayes();
        //SMO classifier = new SMO();

        Evaluation eval = new Evaluation( m_instances );
        eval.crossValidateModel( classifier, m_instances, 10, new Random(1));
        System.out.println(eval.toClassDetailsString());
        System.out.println(eval.toSummaryString());
        System.out.println(eval.toMatrixString());
    }

    //如果有训练集和测试集,可以使用Evaluation 类中的evaluateModel方法,
    // 方法中的参数为:第一个为一个训练过的分类器,第二个参数是在某个数据集上评价的数据集。例中我为了简单用训练集再次做为测试集,希望大家不会糊涂。
    public static void evaluateTestData() throws Exception
    {
        J48 classifier = new J48();
        //NaiveBayes classifier = new NaiveBayes();
        //SMO classifier = new SMO();

        classifier.buildClassifier( m_instances );
        Evaluation eval = new Evaluation( m_instances );
        eval.evaluateModel( classifier, m_instances );
        System.out.println(eval.toClassDetailsString());
        System.out.println(eval.toSummaryString());
        System.out.println(eval.toMatrixString());
    }

The output is:

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     Iris-setosa
                 0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     Iris-versicolor
                 0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     Iris-virginica
Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924     


Correctly Classified Instances         144               96      %
Incorrectly Classified Instances         6                4      %
Kappa statistic                          0.94  
Mean absolute error                      0.035 
Root mean squared error                  0.1586
Relative absolute error                  7.8705 %
Root relative squared error             33.6353 %
Total Number of Instances              150     

=== Confusion Matrix ===

  a  b  c   <-- classified as
 49  1  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  2 48 |  c = Iris-virginica

Description:

                True False
returns a true example (tp) pseudo-positive patients (fp)
do not return a pseudo counter-example (fn) true negatives (tn)

1, FN: False Negative, a sample is judged to be negative, but in fact is a positive sample. 
2, FP: False Positive, the sample is judged to be positive, but in fact is a negative samples. 
3, TN: True Negative, the sample is judged to be negative, indeed, negative samples. 
4, TP: True Positive, the sample is judged to be positive, indeed, evidence samples.

TP Rate : TP / (TP + FN), the classifier of the identified positive samples proportion of all positive samples of
the FP Rate : the FP / (the FP + the TN), classifier mistaken negative samples positive class account all negative samples ratio
Precision : accuracy: P = tp / (tp + fp) of the positive samples / system system identified all positive samples
the recall : recall: R = tp / (tp + fn) positive samples / system identified by the system all the identified total number of samples
F-Measure: F value is case Precision and Recall weighted harmonic mean, sometimes contradictory P and R indicator will appear, so you need to consider them, the most common method is the F-Measure (also known as F -Score).
= 2 * P * Fl R & lt / (P + R & lt)
the MCC: Matthews correlation coefficient , a measure of imbalance data set is better. In a common performance evaluation scores, MCC is the only correct score considered confusion matrix size ratio. Especially in an unbalanced data set (e.g., a data set representing positive examples 99.9%), MCC can be accurately determined whether the predicted evaluation was smooth, and the accuracy or F1 rates can not.
Said in MCC is essentially a description of the correlation coefficient between the actual and the predicted classification classification, it is in the range of [-1,1], is a perfect prediction value of a subject 1, a value of 0 the predicted results are not randomized prediction, classification and prediction -1 means the actual classification completely inconsistent.
The formula is:


Area ROC
a PRC Area

abscissa ROC curve for the false positive rate (FPR), the ordinate is the true positive rate (TPR) 
is generally assessed by the area under the curve (AUC) curve model calculated these two properties: between 0.5 and 1.0, the better the performance AUC larger model. Optimization of the ROC curve tends to negative and maximize the positive values correctly classified correctly classified. Optimized differently, PR curve tends to maximize the value correctly classified, and not directly consider the negative correctly classified.
In the positive and negative samples to be extremely unevenly distributed (highly skewed datasets) case, PRC can react more effectively than the quality of the classifier ROC. Abscissa Recall, ordinate Precision.

Classified the Instances correctly : correct classification
Incorrectly Classified the Instances : misclassified
the Kappa statistic : i.e. Cronbach value (inter-rater, coefficient of internal consistency), as an important indicator of the degree of consistency of the evaluation of the determination. Values between 0 and 1. Both Kappa≥0.75 good agreement; 0.75> Usually both Kappa≥0.4 consistency; Kappa <0.4 both poor consistency.
Absolute error on Mean 
Root Mean Squared error

mean absolute error and root mean square error, it is a measure of the difference classifier predicted and actual results, the smaller the better.
Absolute error Relative
Root Squared error relative

relative absolute error and relative root mean square error, sometimes absolute error does not reflect the true magnitude of the error, and to reflect the size of the error relative error by error reflects the true value of the proportion accounted for.

The Matrix Confusion : Confusion Matrix

Select Properties

The next test can be used in the following two pieces of code, each function of the mathematical principles not do too well. Reference links 8,9.

    public static void selectAtt() throws Exception
    {
        //AttributeSelection来自import weka.attributeSelection.AttributeSelection
        AttributeSelection attsel = new AttributeSelection();
        CfsSubsetEval eval = new CfsSubsetEval();
        GreedyStepwise search = new GreedyStepwise();
        search.setSearchBackwards(true);
        attsel.setEvaluator(eval);
        attsel.setSearch(search);
        attsel.SelectAttributes(m_instances);

        int attarray[] =attsel.selectedAttributes();
        System.out.println("result:"+attsel.toResultsString());
        System.out.println("the selected attributes are as follows:");
        for (int i=0;i<attarray.length;i++ ){
            //System.out.println(attarray[i]);
            System.out.print(m_instances.attribute((int)attarray[i]).name()+',');
        }
    }*/


    public static void selectAttribute() throws Exception
    {
        //初始化搜索算法(search method)及属性评测算法(attribute evaluator)
        Ranker rank = new Ranker();
        InfoGainAttributeEval eval = new InfoGainAttributeEval();
        // 3.根据评测算法评测各个属性
        eval.buildEvaluator(m_instances);
        // 4.按照特定搜索算法对属性进行筛选
        //在这里使用的Ranker算法仅仅是属性按照InfoGain的大小进行排序
        int[] attrIndex = rank.search(eval, m_instances);
        //5.打印结果信息 在这里我们了属性的排序结果同时将每个属性的InfoGain信息打印出来
        StringBuffer attrIndexInfo = new StringBuffer();
        StringBuffer attrInfoGainInfo = new StringBuffer();
        attrIndexInfo.append("Selected attributes:");
        attrInfoGainInfo.append("Ranked attributes:\n");
        for (int i = 0; i < attrIndex.length; i++) {
            attrIndexInfo.append(attrIndex[i]);
            attrIndexInfo.append(",");
            attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));
            attrInfoGainInfo.append("\t");
            attrInfoGainInfo.append((m_instances.attribute(attrIndex[i]).name()));
            attrInfoGainInfo.append("\n");
        }
        System.out.println(attrIndexInfo.toString());
        System.out.println(attrInfoGainInfo.toString());
    }

SUMMARY The following functions described with reference to the link 10, the link 11 inside the additional content in more detail, that further can be seen.

There are two attributes selection mode weka

1, a subset of attributes evaluator + search methods (which can be said to be cyclic, the former part of each cycle of operation)

2, a single property evaluator sorted +

 

 Feature evaluation function
evaluation criteria play an important role in the feature selection process, which is based on the feature selection. Evaluation criteria can be divided into two: one is the evaluation criteria for the predictive power of each feature individually measure; the other is the evaluation criteria for evaluating overall performance prediction of a sub-set of features. Filter and Method Wrapper: two important types of methods respectively.
In the Filter method , in general, it does not depend on specific learning algorithm to evaluate a subset of features, but draw ideological statistics, information theory and other subjects, and be based on the intrinsic properties of the dataset evaluate the predictive power of each feature , and to find Some sort of optimum characteristic compositional feature subset. And Wrapper method , the embedded with subsequent learning algorithm to the feature selection process overall, by testing the prediction algorithm performance on this subset of features to determine its merits, but little attention to predict the performance characteristics of each subset of features. Thus, each of the features which does not require an optimal feature subset is optimal.

 

With a subset of attributes evaluator

CfssubEval: Considering the degree of repetition between the predicted values ​​and attributes single attribute.

classifiersubsetEval: evaluation attribute set with the evaluator

consistencySubsetEval: mapping the training data set up to detect the type of machine attribute consistency

WrapperSubsetEval: using a classifier and cross-validation (packing method)

Search methods

bestFirst: backtracking greedy search

ExhaustiveSearch: exhaustive search

GeneticSearch: genetic search algorithm

GreedyStepwise: no backtracking greedy search

randomSearch: random search

RankSearch: flow properties and use property evaluation subset will sort attributes potential

Single attribute evaluator

ChiSquaredAttributeEval: property assessments to X2 as the basis of class-based

GainRationAttributeEval: the gain was based on property assessments

InfoGainAttributeEval: property assessment based on the information gain

OneRAttributeEval: Methodology to evaluate the properties OneR

PrincipleComponent: principal component analysis and conversion of

ReliefAttributeEval: Evaluation based on the attributes of the instance

SymmeticalUncertAttributeEavl: to assess the property as the basis of a symmetrical uncertainty

Sort method

Ranker: in accordance with the assessment of the property to sort them

 

Clustering Algorithm

Reference Link 1

Clusterer establish a similar way to build a classifier, but instead of using buildClassifier (Instances) method, which uses buildClusterer (Instances), the following code snippet shows how to use a maximum of 100 iterations using EM clusterer method.

Evaluation of a Clusterer, you ClusterEvaluation classes available, e.g., poly output several classes.

import weka.clusterers.EM;
import weka.clusterers.ClusterEvaluation;

            String[] options = new String[2];
            options[0] = "-I";                 // max. iterations
            options[1] = "100";
            EM clusterer = new EM();   // new instance of clusterer
            clusterer.setOptions(options);     // set the options
            clusterer.buildClusterer(m_instances);

            ClusterEvaluation eval = new ClusterEvaluation();
            eval.setClusterer(clusterer);       // the clusterer to evaluate

            // data to evaluate the clusterer on
            eval.evaluateClusterer(m_instances);//newData
            // output # of clusters
            System.out.println("# of clusters: " + eval.getNumClusters());

            //eval.crossValidateModel(                 // cross-validate
            //        clusterer, m_instances, 10,               // with 10 folds,newData
            //        new Random(1));        // and random number generator with seed 1
            System.out.println(eval.clusterResultsToString());

SetClassIndex attention to the need to comment out, or will be error "weka.clusterers.EM:! Can not handle any class attribute"

Output data set is the result iris.arff

EM
==

Number of clusters selected by cross validation: 4
Number of iterations performed: 16


                    Cluster
Attribute                 0       1       2       3
                     (0.32)  (0.33)   (0.2)  (0.14)
====================================================
sepallength
  mean                 5.897   5.006  6.9426  6.1304
  std. dev.           0.5279  0.3489   0.498  0.2943

sepalwidth
  mean                2.7519   3.418  3.1103  2.8088
  std. dev.           0.3103  0.3772  0.2952  0.2361

petallength
  mean                4.2267   1.464  5.8559  5.0993
  std. dev.            0.445  0.1718  0.4626  0.2462

petalwidth
  mean                1.3134   0.244  2.1495  1.8254
  std. dev.           0.1864  0.1061   0.232  0.2152

class
  Iris-setosa              1      51       1       1
  Iris-versicolor    48.1125       1  1.0182  3.8693
  Iris-virginica      2.0983       1 31.0375 19.8641
  [total]            51.2108      53 33.0557 24.7335
Clustered Instances

0       48 ( 32%)
1       50 ( 33%)
2       29 ( 19%)
3       23 ( 15%)


Log likelihood: -2.03504

 

reference

1, Weka development [-1] - Weka used in your code, https://blog.csdn.net/u010968153/article/details/46275445

2, Weka development [1] -Instances class, https://blog.csdn.net/zt_706/article/details/8855286

3, weka connected database detailed steps, https://blog.csdn.net/qq_34760892/article/details/54630723

4, Weka development [2] - class classifier, https://blog.csdn.net/zt_706/article/details/8855314

5, Weka development [3] -Evaluation class, https://blog.csdn.net/zt_706/article/details/8855339

6, Weka classification and evaluation Evaluation output analysis, https://blog.csdn.net/qiao1245/article/details/50886070

7, Weka feature selection (the Attribute Selection), http://blog.sciencenet.cn/blog-713110-568654.html

8, Weka feature selection (the Attribute Selection), http://blog.sciencenet.cn/blog-713110-568654.html

9, Weka secondary development experience, http://www.cnblogs.com/thinkml/p/4170399.html

10, weka attribute selection, https://www.cnblogs.com/xaf-dfg/p/3558383.html

11, using a machine learning tool WEKA summary, including algorithm selection, parameter optimization, attribute selection, https://www.cnblogs.com/lutaitou/p/5818027.html

12, attribute selection algorithm related papers, https://www.docin.com/p-215712031.html , https://wenku.baidu.com/view/2f7b18ece009581b6bd9eb46.html

Guess you like

Origin blog.csdn.net/xiaokui9/article/details/89185464