Weka software detailed tutorial on machine learning (reproduced)

Reprinted from https://www.cnblogs.com/hxsyl/p/3307343.html

Download and installation: link: https://pan.baidu.com/s/14GMxr1mss_bm0bUoLNJnIw Password: fvby (64 bits)

The functions provided by Weka include data processing, feature selection, classification, regression, clustering, association rules, visualization, etc. This article will give a brief introduction to the use of Weka, and through a simple example, let everyone understand the process of using Weka. This article will only introduce the operation of the graphical interface, without involving the command line and code level things.

 

2. Introduction to Tools

0

        There are 4 applications on the right side of the window, namely

  1. Explorer: An environment for data experiments and mining. It provides functions for classification, clustering, association rules, feature selection, and data visualization.
  2. Experimentor: An environment for conducting experiments and data testing of different learning programs.
  3. KnowledgeFlow: The function is similar to Explorer, but the interface provided is different. Users can use drag and drop to create experimental schemes. In addition, it supports incremental learning.
  4. SimpleCLI: Simple command line interface.

        Weka supports many file formats, including arff, xrff, csv, and even libsvm. Among them, arff is the most commonly used format, and we only introduce this one here.
The full name of Arff is Attribute-Relation File Format. The following is an example of an arff format file.

%
%  Arff file example
%
@relation ‘labor-neg-data’
@attribute ‘duration’ real
@attribute ‘wage-increase-first-year’ real
@attribute ‘wage-increase-second-year’ real
@attribute ‘wage-increase-third-year’ real
@attribute ‘cost-of-living-adjustment’ {‘none’,'tcf’,'tc’}
@attribute ‘working-hours’ real
@attribute ‘pension’ {‘none’,'ret_allw’,'empl_contr’}
@attribute ’standby-pay’ real
@attribute ’shift-differential’ real
@attribute ‘education-allowance’ {‘yes’,'no’}
@attribute ’statutory-holidays’ real
@attribute ‘vacation’ {‘below_average’,'average’,'generous’}
@attribute ‘longterm-disability-assistance’ {‘yes’,'no’}
@attribute ‘contribution-to-dental-plan’ {‘none’,'half’,'full’}
@attribute ‘bereavement-assistance’ {‘yes’,'no’}
@attribute ‘contribution-to-health-plan’ {‘none’,'half’,'full’}
@attribute ‘class’ {‘bad’,'good’}
@data
1,5,?,?,?,40,?,?,2,?,11,’average’,?,?,’yes’,?,’good’
2,4.5,5.8,?,?,35,’ret_allw’,?,?,’yes’,11,’below_average’,?,’full’,?,’full’,'good’
?,?,?,?,?,38,’empl_contr’,?,5,?,11,’generous’,'yes’,'half’,'yes’,'half’,'good’
3,3.7,4,5,’tc’,?,?,?,?,’yes’,?,?,?,?,’yes’,?,’good’
3,4.5,4.5,5,?,40,?,?,?,?,12,’average’,?,’half’,'yes’,'half’,'good’
2,2,2.5,?,?,35,?,?,6,’yes’,12,’average’,?,?,?,?,’good’
3,4,5,5,’tc’,?,’empl_contr’,?,?,?,12,’generous’,'yes’,'none’,'yes’,'half’,'good’
3,6.9,4.8,2.3,?,40,?,?,3,?,12,’below_average’,?,?,?,?,’good’
2,3,7,?,?,38,?,12,25,’yes’,11,’below_average’,'yes’,'half’,'yes’,?,’good’
1,5.7,?,?,’none’,40,’empl_contr’,?,4,?,11,’generous’,'yes’,'full’,?,?,’good’
3,3.5,4,4.6,’none’,36,?,?,3,?,13,’generous’,?,?,’yes’,'full’,'good’
2,6.4,6.4,?,?,38,?,?,4,?,15,?,?,’full’,?,?,’good’
2,3.5,4,?,’none’,40,?,?,2,’no’,10,’below_average’,'no’,'half’,?,’half’,'bad’

        This example comes from the labor.arff file in the data file of the weka installation directory. It comes from a Canadian labor negotiation case. It predicts the final outcome of labor negotiation based on the worker's personal information. In the file, comments begin with "%". The rest can be divided into two parts, header information and data information.
In the header information, the line starting with "@relation" represents the name of the relationship, which is the first line of the entire file (without comments). The format is @relation <relation-name> "@attribute" to represent the features, the format is @attribute <attribute-name> <datatype> attribute-name is the name of the feature, followed by the data type. Common data types are as follows

  • numeric, numeric type, including integer (integer) and real (real number)
  • Nominal can be considered as an enumerated type, that is, the characteristic value is a limited set, which can be a string or a number.
  • string, string type, the value can be any string.

        Starting from "@data" is the actual data part. Each row represents an instance, which can be considered as a feature vector. The order of each feature corresponds to the attribute in the header information one by one, and the feature values ​​are separated by commas. In supervised classification, the last column is the labeled result. If the value of some feature is missing, you can use "?" instead.

       The process of data mining using weka is as follows:

                          weka flow chart

        Among them, the three steps of data preprocessing, training and verification are carried out in weka.
         1) Data preprocessing: Data preprocessing includes feature selection, feature value processing (such as normalization), sample selection and other operations.
         2) Training: Training includes algorithm selection, parameter adjustment, and model training.
         3) Verification: Verify the model results.
        The rest of this article will take this process as the main line and use classification as an example to introduce the steps of using weka for data mining.

        I found that my interface is different from this one, but for demonstration purposes, no changes have been made. I understand that the display in area 5 is different because of the different selection of area 4.

 

3. Data preprocessing

        Data preprocessing: open the Explorer interface, click "open file", in the weka installation directory, select the "labor.arff" file in the data directory, you will see the following interface. We divide the entire area into 7 parts, and the functions of each part will be introduced below.

1(1)1

        There are 6 tabs in area 1, which are used to select different data mining function panels. From left to right, they are Preprocess (preprocessing), Classify (classification), Cluster (clustering), Associate (association rule), Select attribute ( Feature selection) and Visualize.
        Area 2 provides the functions of opening, saving and editing files. Not only can you open the file directly from the local selection, you can also use url and db as the data source. The Generate button provides the function of data generation, and weka provides several methods to generate data. Click Edit, you will see the following interface

2(2)2
        In this interface, you can see the corresponding value of each row and column, right-click the name of each column (click the column name first), you can see some functions for editing data, these functions are quite practical.

        Area 3 is called Filter. Some people may think of the Filter method in feature selection. In fact, Filter provides a large number of operation methods for attributes and instances, which are very powerful.
        Area 4, you can see the current features, sample information, and provide feature selection and deletion functions. After selecting a single feature with the mouse in area 4, area 5 will display the information of the feature. Include minimum, maximum, expectation and standard deviation.
        Area 6 provides a visualization function. After selecting a feature, this area will display the distribution of feature values ​​in each interval, and different category labels are displayed in different colors.
        Area 7 is the status bar. When there is no task, the bird is sitting. When the task is running, the bird will stand up and swing left and right. If the bird is standing but not turning, it indicates that there is a problem with the mission.

 

Four.Filters instance

        Click the choose button under Filter, you can see the following interface

                   3
        Filters can be divided into two categories, supervised (supervision, management) and unsupervised. The methods under supervised require category labels, while unsupervised does not. The attribute category represents the selection of features, and the instance represents the selection of samples.

         Case 1: Feature value normalization
        This function has nothing to do with the category, and is for the attribute, we choose Normalize under unsupervised -> attribute. Click on the area where Normalize is located (click on the selected filter), and you will see the following interface. In the left window, there are several parameters to choose from. Click more, the window on the right will appear, which introduces this function in detail.

           4

         Use the default parameters and click ok to return to the main window. Select the features to be normalized in area 4, which can be one or more, and then click apply. In the visualization area, we can see that the feature values ​​are normalized from 1 to 3 to between 0 and 1 (see the maximum and minimum values ​​of area 5).

            5

        Case 2: Classifier feature filtering
        This function is related to the category, select AttributeSelection under supervised -> attribute. There are two options in this interface. Evaluator is a method to evaluate the effectiveness of feature set, and search is a method to search feature set. Here, we use InformationGainAttributeEval as the evaluator and Ranker as the search, which means that we will sort the features according to their information gain value. A threshold can be set in Ranker, and features below this threshold will be thrown away.

             6

        Click apply, you can see that the features in area 4 have been reordered, and those below the threshold have been deleted.
        Case 3: Select
        unsupervised -> RemoveMisclassified under instance, you can see 6 parameters, classIndex is used to set the category label, classifier is used to select the classifier, here we choose J48 decision tree, invert we choose true, so that the misclassified samples are retained, and numFolds is used to set the cross-validation parameters. After setting the parameters, click apply and you can see that the number of samples has been reduced from 57 to 7.

            7

Five. Classification

        In Explorer, open the classify tab, and the entire interface is divided into several areas. They are
        1) Classifier: Click the choose button to select the classifier provided by weka. Commonly used classifiers are
              a) Naïve Bayes (naive Bayes) and BayesNet (Bayesian belief network) under bayes.
              b) LibLinear, LibSVM (these two need to install expansion packs), Logistic Regression, Linear Regression under functions.
              c) IB1 (1-NN) and IBK (KNN) under lazy.
              d) Many boosting and bagging classifiers under meta, such as AdaBoostM1.
              e) J48 under trees (C4.5 of the weka version), RandomForest.
         2) Test options
        There are four options for evaluating the effect of the model.
              a) Use training set: Use the training set, that is, the training set and the test set use the same data. Generally, this method is not used.
              b) Supplied test set: Set the test set, you can use a local file or URL, and the format of the test file needs to be consistent with the format of the training file.
              c) Cross-validation: Cross-validation, a very common verification method. N-folds cross-validation refers to dividing the training set into N parts, using N-1 parts for training and 1 part for testing, looping N times, and finally calculating the result as a whole.
             d) Percentage split: According to a certain ratio, the training set is divided into two parts, one for training and one for testing. Below these verification methods, there is a More options option, which can set some model output and model verification parameters.
        3) Result list
        This area saves the history of classification experiments. Right-click the record and you can see many options. Commonly used are some options for saving or loading models and visualization.
        4)
       The output result of the Classifier output classifier. The default output option is Run information, which gives some summary information of features, samples and model verification; Classifier model, gives some parameters of the model, and different classifiers The information given is different. The bottom part is the result of model verification, which gives the results of some commonly used verification standards, such as accuracy (Precision), recall rate (Recall), true positive rate (True positive rate), and false positive rate (False positive rate) , F value (F-Measure), Roc area (Roc Area), etc. Confusion Matrix gives the classification of the test sample, through it, you can easily see the number of samples of a certain type that are correctly or incorrectly classified.

        Case 1: Use J48 to classify labor files
            a. Open labor.arff file and switch to the classify panel.
            b. Select trees->J48 classifier and use the default parameters.
            c. Test options select the default ten-fold cross-validation, click More options, and check Output predictions.
            d. Click the start button to start the experiment.
            e. In the Classifier output on the right, we see the results of the experiment.

           8

        The above figure shows the classifier used in the experiment and specific parameters, the name of the experiment, the number of samples, the number of features, and the features used, and the test mode.

               9

        The figure above shows the generated decision tree, the number of leaf nodes, the number of tree nodes, and the model training time. If you feel that this is not intuitive, you can right-click the experiment you just performed in the Result list and click Visualize Tree. You can see the decision tree in the graphical interface, which is very intuitive.
           10

        Below is the prediction result, you can see the actual classification of each sample, the predicted classification, whether it is wrong, and the predicted probability.

           11
        At the bottom is the verification result. The overall accuracy is 73.68%, the bad class accuracy rate is 60.9%, the recall rate is 70.0%, the good class accuracy rate is 82.4%, and the recall rate is 75.7%.         5) Visualization         Open the Visualize panel of Explorer, you can see that the top is a two-dimensional graphics matrix, the rows and columns of the matrix are all features (including category labels), the i-th row and the j-th column represent the features i and features The distribution of j on a two-dimensional plane. Each point on the graph represents a sample, and different categories are identified by different colors. There are several options below. PlotSize can adjust the size of the graph, PointSize can adjust the size of the sample points, and Jitter can adjust the distance between the points. Sometimes the points are too concentrated and you can spread them out by adjusting the Jitter.
          12

          13

        The above figure is a graph of two characteristics of duration and class. It can be seen that duration is not a good feature. In each feature value interval, the distribution of good and bad is similar.
Click on the graph in a certain area, and another window will pop up. This window also gives the graph of the distribution between two features. The difference is that here, by clicking on the sample point, you can pop up the detailed information of the sample. Visualization can also be used to view misclassified samples, which is a very useful feature. After the classification is over, right-click the classified record in the Result list and select Visualize classify errors, the following window will pop up.

          14

        In this window, the cross indicates the correctly classified samples, the square indicates the incorrectly classified samples, the X-axis is the actual category, the Y-axis is the predicted category, blue is the actual bad, and red is the actual good. In this way, the blue square represents the sample that is actually bad but is misclassified as good, and the red square represents the sample that is actually good and is misclassified as bad. Click these points, you can see the characteristic values ​​of the sample, and analyze why the sample was misclassified.
        To introduce a more practical function, right-click the record in the Result list, select Visualize threshold curve, and then select the category (bad or good), you can see the following graph.

            111

         The figure shows the comparison of the classification confidence under different thresholds and the classification effect evaluation criteria. The figure above shows the comparison between the false positive ratio and the true positive ratio under different thresholds. In fact, the ROC curve is given. We can easily observe the distribution of different evaluation criteria by choosing colors. If the X-axis and Y-axis select accuracy and recall, then we can use this graph to trade-off between these two values ​​and choose a suitable threshold. Some other visualization functions will not be introduced one by one.

Guess you like

Origin blog.csdn.net/Toky_min/article/details/82180601