[Study Notes] Shandong University Bioinformatics-07 Data Mining (WEKA)

Course Address : Bioinformatics, Shandong University


7. Data Mining

Three elements of data mining

  1. statistics
  2. Database systems
  3. machine learning

7.1 Database system

Database system

  • Database system DBS: System(DB+DBMS)
  • Database Management System DBMS: Database Management System (software for management)
  • DatabaseDB: Database (data storage)
  • Database system = database + database management system

Database type

  • Relational database : Stores data in tabular form.
  • Object-oriented database : xml format storage, clear and flexible structure, suitable for storing complex biological data.
    insert image description here

Commonly used database systems

  • Relational database system : MySQL (SQL language)
  • Object-oriented database system : exist-db (based on JAVA, XQuery language)

7.2 Machine Learning

  • Machine Learning (Machine Learning): It is mainly to design and analyze some algorithms that allow computers to "learn" automatically . These algorithms are a class of algorithms that obtain laws from data and use these laws to predict unknown data.
  • The realization of machine learning : convert the objects that need to be learned by the computer into vectors, describe the objects with vectors, and let the computer read the vector values. like:
    insert image description here

Common Machine Learning Tasks

1. Classification: With background knowledge, judge which category the new object belongs to according to the background knowledge.
2. Clustering : Without background knowledge , for a group of new objects, all new objects are grouped by judging their attributes.
3. Regression : With background knowledge, deduce the quantitative relationship between x1, x2, …, xn and y based on the background knowledge, and calculate the y of the new object accordingly.
insert image description here

K cross-validation

  • Clustering does not require training set data to learn background knowledge (Unsupervised).

  • Regression and classification require the training group data training dataset to learn the background knowledge (Supervised) to train the prediction model. After the prediction model is trained, a part of the training group needs to be taken as the test group data test dataset to test the accuracy of the model.

  • In theory, all data with known results should be used for training. Data other than training data do not know the results and cannot be used for testing; if using training data for testing is over-learning; using test data for testing is under-learning ; Using K times cross-validation can avoid over-learning and under-learning, which is one of the common methods to test the effect of machine learning .

  • K -fold cross validation: Divide the data of all known results into k parts. Take the first part as the test group data, and the remaining k-1 parts are used as the training group data to train the model, and use the test group data to test the accuracy of the model; then take the second part as the test group data, and the remaining k-1 parts are used as the training group data Train the model; and so on, let each one be used as the sequential test group data. In this way, use the same algorithm to construct k models for k tests, get k accuracy, and calculate the average accuracy, that is, the accuracy of the final model degree .insert image description here

  • See the video for details: Machine Learning-01 P127

Algorithms for Machine Learning

Several common algorithms :

  • Bayesian : Bayes theorem
    In general, the probability of event A under the conditions of event B (occurrence) is not the same as the probability of event B under the conditions of event A; however, there is a definite relationship between the two, Bayes Yees' theorem is a statement of this relationship. See: [Study Notes] Shandong University Bioinformatics-05 Introduction to High-throughput Sequencing Technology + 06 Statistical Basis and Sequence Algorithm (Principle)P(A|B) = P(B|A)P(A) / P(B)

  • Nearest neighbor : Neighbor Joining
    marks known objects in the coordinate system according to their own characteristic attributes , and then marks unknown objects in the coordinate system according to their own characteristic attributes. The new object is the closest known object to whichever known object the new object is.
    insert image description here

  • Decision tree : Decision Tree
    decision tree is a predictive model , which represents a mapping between object attributes and object values . Each node in the tree represents the judgment condition of the object attribute , and its branches represent objects that meet the node conditions . The leaf nodes of the tree represent the predicted results to which the object belongs .
    insert image description here

  • Support Vector Machine : Support Vector Machine
    Support Vector Machine is a two-category model , but it can also be extended to multi-category . Its feature based on maximization of interval can make it more flexible to deal with linear or nonlinear classification problems.
    Similar to the nearest neighbor method, mark objects in the coordinate system according to their attributes, draw a line to separate different objects as much as possible, and the vertical distance from the closest object to the line on both sides of the line is as large as possible (maximum interval, maximum and minimum distance ), Where the new object falls on the line, it belongs to which object.
    insert image description here

  • genetic algorithm

  • Artificial neural networks

  • And many more algorithms...


7.3 WEKA

  • WEKA Waikato Environment for Knowledge Analysis (Waikato Environment for Knowledge Analysis), free, data mining software, without worrying about the algorithm, input data, select a ready-made algorithm, and then output the result model.
  • Weka 3: Machine Learning Software in Java
  • WEKA stores data in ARFFthe (Attribute-Relation File Format) format, which is an ASCII text file. EXCEL files cannot be read directly.
  • See the video for details: Machine Learning-01 P127
  • Terms in WEKA : ◆ A row
    in a table is called an instance ( ), which is equivalent to a sample in statistics or a record in a database . ◆ A vertical row in a table is called an attribute ( ), which is equivalent to a variable in statistics or a field in a database . ◆ Such a table, or data set , in the view of WEKA, presents a relationship ( ) between attributes , for example, the relationship of WEKA in the figure below is weather.Instance
    Attrbute
    Relation
    insert image description here

7.3.1 ARFF file format

insert image description here

● WEKA reads ARFF files based on line breaks and spaces, so line breaks or spaces cannot be added arbitrarily , and blank lines and lines full of spaces will be ignored.
%Begins with a comment line

Head information (Head information), including declarations of relations and declarations of attributes.

  • Relationship declaration : The relationship name is defined in the first effective line of the ARFF file, and the format is: @relation relation-name
    where relation-nameis a string, if the string contains spaces , it must be enclosed in quotation marks (single quotation marks or double quotation marks for English punctuation).
  • Attribute declaration : used to define the attribute name and attribute type of each attribute, the format is: @attribute attr-name attr-type
    where attr-name is a string that must start with a letter. As with relation names, if this string contains spaces, it must be quoted. The order of the
    attribute declaration statements indicates the position of the attribute in the data section . The last declared attribute is called the class attribute . In classification or regression tasks, it is the default target variable .

Data information (Data information), that is, the data given in the data set.

  • @dataOne line in the data information . Next is the data information of all instances, each instance occupying one line. The attribute values ​​of the instance are separated by commas, . If the value of an attribute is a missing value (missing value), it is represented by a question mark ?, and this question mark cannot be omitted . For example:
    @data
    sunny,85,85,FALSE,no
    ?,78,90,?,yes
  • Values ​​for string and nominal properties are case-sensitive . If the value contains spaces , it must be enclosed in quotes .

7.3.2 ARFF attribute type and format conversion

There are four attribute types in ARFF format :
insert image description here

  1. Numericnumeric
    Format: @attribute name numeric
    Note:
    numeric = integer = real
    "integer", "real", "numeric", "date", "string" These keywords are case-sensitive .
    "relation", "attribute" and "data" are case insensitive .

  2. Nominalnominal-specification
    Format: @attribute name {nominal.name1, nominal-name2,...}
    Note: Nominal attributes are a set of
    possible class names enclosed in braces . If there are spaces in "name" or "nominal-name" , they need to be enclosed in quotation marks. "name" and all "nominal-name" are case sensitive .{}

  3. Stringstring
    Format: @attribute name string
    Note:
    This type of attribute is very useful in text mining.
    If there are spaces in "name", they need to be enclosed in quotation marks.
    The "data" of a string attribute can contain arbitrary text . Enclose spaces in quotation marks.

  4. Time and datedate
    format: @attribute name date "<date-format>"
    Note:
    Date and time attributes are uniformly represented by the "date" type.
    "date-format" is a string specifying how to parse and display the datetime format.
    "date-format" If omitted, the default format (see above) is written.

Format conversion

  • WEKA cannot read EXCEL files, but can read CSV files (plain text files with columns separated by commas), and save EXCEL as CSV files, which can be opened with WEKA's ArffViewer, and then File → Save as to save as a file .arff.
    insert image description here
  • The converted ARFF file will automatically assign attribute types, which may be inaccurate and require manual correction. For example, the name attribute in the figure below should be a string type, not a nominal type.
    insert image description here

7.3.2 Data preparation: Explorer interface introduction

  • The Explorer interface is the data preprocessing and mining task interface
    insert image description here

  • See the video for details: WEKA-03 Explorer Interface Introduction P131

  • Open the weather.numeric.arff file with Open file.
    insert image description here

  • According to different functions, the current Precrocessinterface can be divided into 8 areas.
    ◆ Several tabs in area 1 can be used to switch between different mining task panels. The current Preprocess panel can perform data preview and preprocessing.
    ◆ Area 2 is some commonly used function buttons. Including open, undo, edit, save, etc.
    ◆ In area 3 ( data preprocessing ), by using various functionsfilter in the option , data filtering or attribute type conversion can be realized . ◆ Area 4 ( ) shows some basic information of the data set, including: relationship name, attribute number, and instance number. ◆ All attributes are listed in area 5 ( ) . The button can delete some attributes, and the button in area 2 can be used to retrieve them after deletion. The row of buttons above area 5 is used for quick ticking. ◆ Select an attribute in area 5, and a summary about this attribute will appear in area 6 ( ). Note that for numeric attributes and nominal attributes, the display method of the summary is different. Region 7 is the histogram of the attributes selected in region 5 . The histogram reflects the nominal type
    Current relation
    AttributesRemoveUndo
    Selected attribute
    insert image description here
    The distribution of each label of the target attribute (class attribute) among the labels of the currently selected attribute. WEKA takes the last defined property as the target property by default . It can also be re-selected through the drop-down menu of the Class attribute .
    insert image description here
    Visualize AllYou can view the histogram of all attributes. There is a difference between the histograms of nominal attributes and numeric attributes. The histogram of numeric attributes is divided into two sections based on the average value , and the distribution of class attribute values ​​in each section is counted separately.
    insert image description here
    ◆ Area 8 is the status bar, you can check the log to determine whether the program has errors . The weka bird reflects the number of tasks and the status of the tasks. Right-clicking on the status bar also performs memory garbage collection .

7.3.3 Data preprocessing

Data preprocessing (1) attribute conversion

  • Some algorithms can only handle data whose attributes are all nominal. This requires converting numeric attributes into nominal ones .
    filters - unsupervised - attribute - Discretizediscretization function → click the parameter box to set the parameters (as follows) → Apply.
    attributeIndices: Select which attributes to convert, enter the sequential numbers of the attributes , separated by commas .
    bins: Set discretization into several segments, that is, convert into several nominals.
    insert image description here
    NumericToNominalThe function can also convert a numeric type to a nominal type, which is slightly different from the Discretize function.

Data preprocessing (2) adding attributes

  • Define a formula with AddExpressiona function and add a new property accordingly.
  • For example: Add a new attribute temp/humi, whose value is equal to the value of temperature divided by the value of humidity.
  • NOTE : Newly created properties are added at the end! This will affect WEKA's judgment on the class attribute! It is necessary to manually correct the selection of the class attribute!
    insert image description here

7.3.4 Executing mining tasks

  • See the video for details: WEKA-05 Executes Excavation Task P133
    insert image description here
  • The target attribute (output variable) is the nominal type is the classification task, and the numerical type is the regression task.

Typical algorithms that come with WEKA

Typical classification algorithm:

  • Bayes (Bayesian Classifier):
    BayesNet (Bayesian Belief Network)
    NaiveBayes (Naive Bayesian Network)
  • Functions:
    Multilayer Perceptron (multilayer feed-forward artificial neural network)
    SMO (support vector machine)
  • Trees (decision tree classifier):
    Id3 (ID3 decision tree learning algorithm)
    J48 (C4.5 decision tree learning algorithm)
    RandomTree (random decision tree algorithm)
    RandomForest (combination method based on decision trees)

Typical regression algorithm:

  • Functions:
    LinearRegression (in addition to the target attribute, supports multiple attributes)
    SimpleLinearRegression (in addition to the target attribute, only supports one attribute)
    PaceRegression (Pace regression)

WEKA practical operation

  • See the video for details: WEKA-05 Executes Excavation Task P133
  • Data file: data/diabetes.arff in the weka installation directory
  • 1. Build a model : predict diabetes outcome (negative or positive) through 8 physiological indicators (classification task).
    insert image description here
    insert image description here
    insert image description here
    insert image description here
  • 2. Result output : decision tree in text form and graphical decision tree.
    The decision tree in text form facilitates further programming to automate the predictive model.
    insert image description here
    Zooming in on the decision tree : There are only two types insert image description here
    of leaves at the end of the decision tree : the predicted diabetes outcome is negative or positive.
    insert image description here
  • 3. Prediction accuracy :
    insert image description here
    insert image description here
    The predicted TN, FN, TP, and FP can be substituted into the formula to calculate various accuracy measurement parameters , including Sensitivity and Specificity.
    insert image description here
    whenDefine tested_negative as positive and tested_positive as negativeAfter that, another set of TN, FN, TP, and FP will be obtained, and another set of accuracy measurement parameters will be calculated. The two statistical results can respectively reflect the prediction quality of each nominal in the target.

Guess you like

Origin blog.csdn.net/zea408497299/article/details/125241118