读取文件和读取数据库数据

参考链接1和2

你可能要用的最常用的组件(components)是：

l Instances 你的数据

l Filter 对数据的预处理

l Classifiers/Clusterer 被建立在预处理的数据上，分类/聚类

l Evaluating 评价classifier/clusterer

l Attribute selection 去除数据中不相关的属性

测试的可用的代码如下：

public static Instances getInstances(String filePath) {
        try {
            filePath = "C:\\Weka-3-8\\data\\iris.arff";
                    
            //3.5.5和3.4.X版本
            Instances data = new Instances( new BufferedReader( new FileReader(filePath) ) );
            // setting class attribute
            // Class Index是指示用于分类的目标属性的下标。在ARFF文件中，它被默认为是最后一个属性，这也就是为什么它被设置成numAttributes-1.
            //你必需在使用一个Weka函数(ex: weka.classifiers.Classifier.buildClassifier(data))之前设置Class Index。
            data.setClassIndex(data.numAttributes() - 1);
            System.out.println( "#################data:" );
            System.out.println( data );

            //3.5.5和更新的版本
            //DataSource类不仅限于读取ARFF文件，它同样可以读取CSV文件和其它格式的文件(基本上Weka可以通过它的转换器(converters)导入所有的文件格式)。
            DataSource source = new DataSource(filePath);
            Instances data2 = source.getDataSet();
            // setting class attribute if the data format does not provide this information
            // E.g., the XRFF format saves the class attribute information as well
            //if (data2.classIndex() == -1)
            data2.setClassIndex(data2.numAttributes() - 1);
            System.out.println(  "#################data2:"  );
            System.out.println( data2 );

            //读取数据库
            InstanceQuery query = new InstanceQuery();
            //数据库配置在weka.jar文件中
            query.setUsername("root");
            query.setPassword("");
            query.setQuery("select * from url_features limit 0,10");//url_features
            // if your data is sparse, then you can say so too
            // query.setSparseData(true);
            Instances data3 = query.retrieveInstances();
            //把数据集全部输入出
            System.out.println( data3 );
            //用numInstances可以获得数据集中有多少样本
            for( int i = 0; i < data3.numInstances(); i++ )
            {
                //instance( i )是得到第i个样本
                System.out.println( data3.instance( i ) );
            }
            return data2;
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }

数据库配置说明参见参考链接3。

分类器

参考链接4

要对数据集进行分类，第一步要指定数据集中哪一列做为类别，如果这一步忘记了（事实上经常会忘记）会出现“Class index is negative (not set)!”这个错误，设置某一列为类别用Instances类的成员方法setClassIndex，要设置最后一列为类别则可以用Instances类的numAttributes()成员方法得到属性的个数再减1。

            Instances m_instances = getInstances(filePath);//这里使用了上面代码的data2的方法
            J48 classifier = new J48();

            //NaiveBayes classifier2 = new NaiveBayes();
            //SMO classifier = new SMO();
            classifier.buildClassifier( m_instances );
            //输出的内容是数据中第0、60、110行的数据的分类结果
            System.out.println( classifier.classifyInstance( m_instances.instance( 0 ) ) );
            System.out.println( classifier.classifyInstance( m_instances.instance( 60 ) ) );
            System.out.println( classifier.classifyInstance( m_instances.instance( 110 ) ) );

分类评价

参考链接5

    //首先初始化一个Evaluation对象，Evaluation类没有无参的构造函数，一般用Instances对象作为构造函数的参数。
    //如果没有分开训练集和测试集，可以使用Cross Validation方法，
    // Evaluation中crossValidateModel方法的四个参数分别为，第一个是分类器，第二个是在某个数据集上评价的数据集，第三个参数是交叉检验的次数（10是比较常见的），第四个是一个随机数对象。
    //提醒大家一下，使用crossValidateModel时，分类器不需要先训练，这其实也应该是常识了。
    //Evaluation中提供了多种输出方法，大家如果用过weka软件，会发现方法输出结果与软件中某个显示结果的是对应的。例中的三个方法toClassDetailsString，toSummaryString，toMatrixString比较常用。
    public static void crossValidation() throws Exception
    {
        J48 classifier = new J48();
        //NaiveBayes classifier = new NaiveBayes();
        //SMO classifier = new SMO();

        Evaluation eval = new Evaluation( m_instances );
        eval.crossValidateModel( classifier, m_instances, 10, new Random(1));
        System.out.println(eval.toClassDetailsString());
        System.out.println(eval.toSummaryString());
        System.out.println(eval.toMatrixString());
    }

    //如果有训练集和测试集，可以使用Evaluation 类中的evaluateModel方法，
    // 方法中的参数为：第一个为一个训练过的分类器，第二个参数是在某个数据集上评价的数据集。例中我为了简单用训练集再次做为测试集，希望大家不会糊涂。
    public static void evaluateTestData() throws Exception
    {
        J48 classifier = new J48();
        //NaiveBayes classifier = new NaiveBayes();
        //SMO classifier = new SMO();

        classifier.buildClassifier( m_instances );
        Evaluation eval = new Evaluation( m_instances );
        eval.evaluateModel( classifier, m_instances );
        System.out.println(eval.toClassDetailsString());
        System.out.println(eval.toSummaryString());
        System.out.println(eval.toMatrixString());
    }

输出结果为：

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 Iris-setosa
0.940 0.030 0.940 0.940 0.940 0.910 0.952 0.880 Iris-versicolor
0.960 0.030 0.941 0.960 0.950 0.925 0.961 0.905 Iris-virginica
Weighted Avg. 0.960 0.020 0.960 0.960 0.960 0.940 0.968 0.924

Correctly Classified Instances 144 96 %
Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.035
Root mean squared error 0.1586
Relative absolute error 7.8705 %
Root relative squared error 33.6353 %
Total Number of Instances 150

=== Confusion Matrix ===

a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 2 48 | c = Iris-virginica

说明：

正确错误
返回真正例(tp) 伪正例(fp)
不返回伪反例(fn) 真反例(tn)

1、 FN：False Negative,被判定为负样本，但事实上是正样本。
2、 FP：False Positive,被判定为正样本，但事实上是负样本。
3、TN：True Negative,被判定为负样本，事实上也是负样本。
4、TP：True Positive,被判定为正样本，事实上也是证样本。

TP Rate：TP / (TP + FN)，分类器所识别出的正样本占所有正样本的比例
FP Rate：FP / (FP + TN)，分类器错认为正类的负样本占所有负样本的比例
Precision：准确率:P=tp/(tp+fp)系统识别的正样本 / 系统所有的正样本
Recall：召回率:R=tp/(tp+fn)系统识别的正样本 / 系统所有识别的样本总数
F-Measure:F值，是Precision和Recall加权调和平均，P和R指标有时候会出现的矛盾的情况，这样就需要综合考虑他们，最常见的方法就是F-Measure（又称为F-Score）。
F1 = 2 * P * R / (P + R)
MCC：马修斯相关系数，衡量不平衡数据集的指标比较好。在常见的性能评估得分中，MCC是唯一正确地考虑混淆矩阵大小的比率的得分。特别是在不平衡数据集（例如一个数据集中正实例占99.9%）上，MCC能够正确地判断预测评估是否进行得顺利，而准确性或F1评分则无法做到。
MCC本质上是一个描述实际分类与预测分类之间的相关系数，它的取值范围为[-1,1]，取值为1时表示对受试对象的完美预测，取值为0时表示预测的结果还不如随机预测的结果，-1是指预测分类和实际分类完全不一致。
公式是：

ROC Area
PRC Area
ROC曲线的横坐标为false positive rate（FPR），纵坐标为 true positive rate（TPR）
通常，通过计算这两个曲线模型的曲线下面积( AUC )来评估性能: 介于0.5和1.0之间，AUC越大，模型的性能越好。ROC曲线的优化倾向于最大化正确分类的正值和正确分类的负值。不同地，PR曲线的优化倾向于最大化正确分类的正值，而不直接考虑正确分类的负值。
在正负样本分布得极不均匀(highly skewed datasets)的情况下，PRC比ROC能更有效地反应分类器的好坏。横坐标为Recall，纵坐标为Precision。

Correctly Classified Instances：正确分类
Incorrectly Classified Instances：错误的分类
Kappa statistic：值即内部一致性系数(inter-rater,coefficient of internal consistency)，是作为评价判断的一致性程度的重要指标。取值在0～1之间。Kappa≥0.75两者一致性较好；0.75>Kappa≥0.4两者一致性一般；Kappa<0.4两者一致性较差。
Mean absolute error
Root mean squared error
平均绝对误差和均方根误差，用来衡量分类器预测值和实际结果的差异，越小越好。
Relative absolute error
Root relative squared error
相对绝对误差和相对均方根误差，有时绝对误差不能体现误差的真实大小，而相对误差通过体现误差占真值的比重来反映误差大小。

Confusion Matrix：混淆矩阵

属性选择

试了下能用的两段代码如下，每个函数的数学原理还没有搞得太清楚。参考链接8、9。

    public static void selectAtt() throws Exception
    {
        //AttributeSelection来自import weka.attributeSelection.AttributeSelection
        AttributeSelection attsel = new AttributeSelection();
        CfsSubsetEval eval = new CfsSubsetEval();
        GreedyStepwise search = new GreedyStepwise();
        search.setSearchBackwards(true);
        attsel.setEvaluator(eval);
        attsel.setSearch(search);
        attsel.SelectAttributes(m_instances);

        int attarray[] =attsel.selectedAttributes();
        System.out.println("result:"+attsel.toResultsString());
        System.out.println("the selected attributes are as follows:");
        for (int i=0;i<attarray.length;i++ ){
            //System.out.println(attarray[i]);
            System.out.print(m_instances.attribute((int)attarray[i]).name()+',');
        }
    }*/


    public static void selectAttribute() throws Exception
    {
        //初始化搜索算法（search method）及属性评测算法（attribute evaluator）
        Ranker rank = new Ranker();
        InfoGainAttributeEval eval = new InfoGainAttributeEval();
        // 3.根据评测算法评测各个属性
        eval.buildEvaluator(m_instances);
        // 4.按照特定搜索算法对属性进行筛选
        //在这里使用的Ranker算法仅仅是属性按照InfoGain的大小进行排序
        int[] attrIndex = rank.search(eval, m_instances);
        //5.打印结果信息 在这里我们了属性的排序结果同时将每个属性的InfoGain信息打印出来
        StringBuffer attrIndexInfo = new StringBuffer();
        StringBuffer attrInfoGainInfo = new StringBuffer();
        attrIndexInfo.append("Selected attributes:");
        attrInfoGainInfo.append("Ranked attributes:\n");
        for (int i = 0; i < attrIndex.length; i++) {
            attrIndexInfo.append(attrIndex[i]);
            attrIndexInfo.append(",");
            attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));
            attrInfoGainInfo.append("\t");
            attrInfoGainInfo.append((m_instances.attribute(attrIndex[i]).name()));
            attrInfoGainInfo.append("\n");
        }
        System.out.println(attrIndexInfo.toString());
        System.out.println(attrInfoGainInfo.toString());
    }

以下函数说明的内容参考链接10，另外链接11里面的内容更详细，进一步可以看那个。

weka有两种属性选择模式

1、属性子集评估器+搜索方法（后者可以说是循环，前者是循环的每个环节的操作）

2、单一属性评估器+排序方法

特征评估函数
评价标准在特征选择过程中扮演着重要的角色，它是特征选择的依据。评价标准可以分为两种：一种是用于单独地衡量每个特征的预测能力的评价标准；另一种是用于评价某个特征子集整体预测性能的评价标准。分别对应两种重要的方法类型：Filter和Wrapper方法。
在Filter方法中，一般不依赖具体的学习算法来评价特征子集，而是借鉴统计学、信息论等多门学科的思想，根据数据集的内在特性来评价每个特征的预测能力，从而找出排序最优的的若干特征组成特征子集。而Wrapper方法中，用后续的学习算法嵌入到特征选择过程总，通过测试特征子集在此算法上的预测性能来决定其优劣，而极少关注特征子集中每个特征的预测性能。因此，后者并不要求最优特征子集中的每个特征都是最优。

属性子集评估器有

CfssubEval：综合考虑单一属性的预测值和属性间的重复度。

classifiersubsetEval：用评估器评估属性集

consistencySubsetEval：将训练数据集映射到属性机上来检测类型的一致性

WrapperSubsetEval：使用分类器和交叉验证（包装方法）

搜索方法有

bestFirst：回溯的贪婪搜索

ExhaustiveSearch：穷举搜索

GeneticSearch：使用遗传算法搜索

GreedyStepwise：不回溯的贪婪搜索

randomSearch：随机搜索

RankSearch：排列属性并使用属性子集评估器将有潜力的属性进行排序

单一属性评估器

ChiSquaredAttributeEval：以基于类的X2为依据的属性评估

GainRationAttributeEval：以增益率为依据的属性评估

InfoGainAttributeEval：以信息增益为依据的属性评估

OneRAttributeEval：以OneR的方法论来评估属性

PrincipleComponent：进行主成分的分析和转换

ReliefAttributeEval：基于实例的属性评估器

SymmeticalUncertAttributeEavl：以对称不确定性为依据的属性评估

排序方法

Ranker：按照属性的评估对他们进行排序

聚类算法

参考链接1

一个clusterer建立与建立一个分类器的方式相似，只是不是使用buildClassifier(Instances)方法，它使用buildClusterer(Instances)，下面的代码段展示了如何用EM clusterer使用最多100次迭代的方法。

评价一个clusterer，你可用ClusterEvaluation类，例如，输出聚了几个类。

import weka.clusterers.EM;
import weka.clusterers.ClusterEvaluation;

            String[] options = new String[2];
            options[0] = "-I";                 // max. iterations
            options[1] = "100";
            EM clusterer = new EM();   // new instance of clusterer
            clusterer.setOptions(options);     // set the options
            clusterer.buildClusterer(m_instances);

            ClusterEvaluation eval = new ClusterEvaluation();
            eval.setClusterer(clusterer);       // the clusterer to evaluate

            // data to evaluate the clusterer on
            eval.evaluateClusterer(m_instances);//newData
            // output # of clusters
            System.out.println("# of clusters: " + eval.getNumClusters());

            //eval.crossValidateModel(                 // cross-validate
            //        clusterer, m_instances, 10,               // with 10 folds,newData
            //        new Random(1));        // and random number generator with seed 1
            System.out.println(eval.clusterResultsToString());

注意需要将setClassIndex注释掉，不然会报错“weka.clusterers.EM: Cannot handle any class attribute!”

用iris.arff数据集的输出结果是

EM
==

Number of clusters selected by cross validation: 4
Number of iterations performed: 16


                    Cluster
Attribute                 0       1       2       3
                     (0.32)  (0.33)   (0.2)  (0.14)
====================================================
sepallength
  mean                 5.897   5.006  6.9426  6.1304
  std. dev.           0.5279  0.3489   0.498  0.2943

sepalwidth
  mean                2.7519   3.418  3.1103  2.8088
  std. dev.           0.3103  0.3772  0.2952  0.2361

petallength
  mean                4.2267   1.464  5.8559  5.0993
  std. dev.            0.445  0.1718  0.4626  0.2462

petalwidth
  mean                1.3134   0.244  2.1495  1.8254
  std. dev.           0.1864  0.1061   0.232  0.2152

class
  Iris-setosa              1      51       1       1
  Iris-versicolor    48.1125       1  1.0182  3.8693
  Iris-virginica      2.0983       1 31.0375 19.8641
  [total]            51.2108      53 33.0557 24.7335
Clustered Instances

0       48 ( 32%)
1       50 ( 33%)
2       29 ( 19%)
3       23 ( 15%)


Log likelihood: -2.03504

参考

1、Weka开发［-1］——在你的代码中使用Weka，https://blog.csdn.net/u010968153/article/details/46275445

2、Weka 开发［1］－Instances类，https://blog.csdn.net/zt_706/article/details/8855286

3、weka 连接数据库详细步骤，https://blog.csdn.net/qq_34760892/article/details/54630723

4、Weka开发［2］－分类器类，https://blog.csdn.net/zt_706/article/details/8855314

5、Weka开发［3］－Evaluation类，https://blog.csdn.net/zt_706/article/details/8855339

6、Weka分类评价Evaluation输出分析，https://blog.csdn.net/qiao1245/article/details/50886070

7、Weka中的特征选择(Attribute selection)，http://blog.sciencenet.cn/blog-713110-568654.html

8、Weka中的特征选择(Attribute selection)，http://blog.sciencenet.cn/blog-713110-568654.html

9、Weka 二次开发使用心得，http://www.cnblogs.com/thinkml/p/4170399.html

10、weka属性选择，https://www.cnblogs.com/xaf-dfg/p/3558383.html

11、机器学习工具WEKA使用总结，包括算法选择、参数优化、属性选择，https://www.cnblogs.com/lutaitou/p/5818027.html

weka开发学习

读取文件和读取数据库数据

分类器

分类评价

属性选择

聚类算法

参考

猜你喜欢