[Data Mining] Study Notes


<Data preprocessing>

  • Aggregation: Combine multiple samples or features (reduce sample size, convert scale, more stable)
  • Sampling: taking a sample
  • Dimensionality reduction: representing samples in position space (PCA, SVD)
  • Feature selection: select important features (Lasso)
  • Feature Creation: Reconstructing Useful Features (Fouter Transformation)
  • discretization
    • The process of converting continuous attributes into discrete attributes
    • Commonly used for classification
  • dualization
    • Map continuous or categorical attributes to one or more binary variables
    • Correlation Analysis
    • Convert continuous attributes into categorical attributes and convert categorical attributes into a set of binary variables
  • variable transformation
    • Converts the value of a given attribute
    • Linear transformation method (simple function)
  • Standardize
    • min-max normalization (normalization)
    • z-score normalization (zero-mean normalization)
    • Decimal scaling normalization

<sklearn machine learning platform>

MLlib learning library:

  • Algorithms covered: classification algorithms, clustering algorithms, regression algorithms, dimensionality reduction algorithms
  • Scikit-learn main usage:
    • Symbol tags: training data, training set labels, test data, test set labels, complete data, labeled data
    • Data partition:
      • train_test_split(x,y,random)
      • shuffle = True
    • Data preprocessing
    • Supervised learning algorithms (classification,
      • logistic regression
      • Support Vector Machines
      • Naive Bayes

Chapter 3 Regression Analysis

3.1 Basic concepts of regression analysis

  • regression analysis
  • Divided by the number of variables involved: single regression, multiple regression analysis
  • Divided according to the number of dependent variables: simple regression analysis, multiple regression analysis
  • Divided according to the type of relationship between independent variables and dependent variables: linear regression analysis, nonlinear regression analysis.
  • Problems solved by regression analysis:
    • Correlation between variables: deterministic relationship, non-deterministic relationship
    • Predict or control the value of a variable(s)
  • Regression analysis steps
    • Determine variables: related influencing factors (independent variables), main influencing factors
    • Building a predictive model: Calculation of historical statistics for independent and dependent variables
    • Conduct correlation analysis: the degree of correlation between variables and predicted objects
    • Calculate prediction error: can it be used for actual predictions
    • Determine the predicted value: conduct a comprehensive analysis of the predicted value

3.2 Univariate linear regression

F test, T test

  • Y = a + bX + ε
  • Model features:
    • Y is a linear function of X plus an error term
    • The linear part reflects changes in Y due to changes in X
    • The error chosen ε is a random variable
    • For a given value of X, the expected value of Y is E(Y) = a+bX
  • Regression equation:
  • Regression equation solving and model testing:
    • Least Squares (Equation Solving), Residual Sum of Squares
    • Goodness of fit test (model test)
    • Significance test of linear relationship: Significance level test regression equation (significance test of regression parameters), ESS, RSS
    • Univariate linear regression example
    • Evaluation criteria r 2

3.3 Multiple linear regression

  • Y = a + b1X1 + b2X2 + … + bnXn
  • Model features:
    • Y has a linear relationship with X 1 X 2 X 3 …X 4
    • Each observation value Y i (i=1,2,3,…) is independent of each other
    • Random error ε~N(0,q 2 )
  • Solving polynomial regression equations using least squares method
  • Goodness of fit test
  • Significance test of regression parameters
  • Multiple linear regression example

3.4 Polynomial regression

  • Polynomial regression equation (nonlinear → linear)
  • Polynomial regression equation example
    • Solving polynomial regression equations
    • Regression equation F test
    • Polynomial regression equation t-test

Evaluation criteria for regression

  • Mean Squared Error (MSE)
  • Root mean square error (RMSE)
  • Mean Absolute Error (MAE)
  • Choose MSE or MAR?

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/132691343