The basic concepts of machine learning (1) of

@ (Machine Learning (1) of the basic concepts)

Definition of machine learning

According to existing data algorithm selection, and build models based on algorithms and data, and ultimately predict the future.

The basic parameters of the algorithm

Here Insert Picture Description

  • Input: x∈χ (attribute values)
  • Output: y∈Y (target)
  • Obtaining an objective function (target function):
    F: X-→ the Y (over the formula)
  • Input data: D = {(x ~ 1 ~, y ~ 1 ~), (x ~ 2 ~, y ~ 2 ~), ..... (x ~ n ~, y ~ n ~)}
  • The final equation is assumed to have the properties:
    G: X-→ the Y (the final equation obtained learning)

General description of the algorithm of data

  • Fitting: Building an algorithm consistent with the characteristics of a given data
    • x ^ (i) ^: x represents a vector of the i-th sample
    • x ~ i ~: value of the i dimension vector x
  • Robustness: that is robust, robustness, robustness, is the robustness of the system; when there is abnormal data, the algorithm will fit the data
  • Overfitting: Algorithms too consistent with the characteristics of the sample data, feature data for actual production can not fit
  • Underfitting: Algorithms do not meet the characteristics of the sample data

Machine learning classifier

Supervised learning

  • With some known or sample some of the characteristics of the training set to build a mathematical model, then the model is established to predict unknown samples, this method is called supervised learning, is the most commonly used machine learning method. Training is the tag data from a central machine learning tasks infer a model.
    • Discriminant model (Discriminative Model): direct conditional probability p (y | x) modeling, identification of common models are: linear regression, decision trees, support vector machine SVM, k nearest neighbor, neural network;
    • Model formula (Generative Model): joint probability distribution p (x, y) is modeled, the model has a common formula: hidden Markov model the HMM, naive Bayes model, a Gaussian mixture model GMM, LDA and the like;
    • Features:
      1. More universal model formula; discriminative models more direct, more targeted.
      2. How generative model focuses on data is generated, looking for data distribution model.
      3. discriminant model focuses on differences in the data, looking for the classification plane.
      4. discriminative models can be generated by the model formula, but the pattern can not be formed by the discriminant model formula.

Unsupervised Learning

  • Compared with supervised learning, unsupervised learning training focus is not the result of man-marked, unsupervised learning process, the data are not specifically identified, learning model is to infer some of the internal structure of the data.
  • Unsupervised learning trying to learn or feature extraction data behind the data, or extract information from the data important features in common algorithm clustering, dimension reduction, text processing (feature extraction) and so on.
  • Unsupervised learning is generally used as pre-supervised learning data processing function is to extract the necessary information from the raw data tag.

Semi-supervised learning

  • Consider how to use a small amount of labeled samples and a large number of issues unlabeled samples for training and classification, there is a combination of supervised learning and unsupervised learning.
  • Unsupervised learning trying to learn or feature extraction data behind the data, or extract information from the data important features in common algorithm clustering, dimension reduction, text processing (feature extraction) and so on.
  • Unsupervised learning is generally used as pre-supervised learning data processing functions are extracted from the raw data necessary tag information

Machine learning development process

Data collection and storage

  • Data Sources:
    • User access data
    • Business Data
    • External third-party data
  • data storage:
    • Data needs to be stored: the original data, the data after preprocessing, model results
    • Storage facilities: mysql, HDFS, HBase, Solr, Elasticsearch, Kafka, Redis, etc.
  • Data collection methods:
    • Flume & Kafka
  • In practice, we can use business data for machine learning and development, but in the learning process, no business data, then you can use publicly available data sets for development, common data set as follows:
    • http://archive.ics.uci.edu/ml/datasets.html
    • https://aws.amazon.com/cn/public-datasets/
    • https://www.kaggle.com/competitions
    • http://www.kdnuggets.com/datasets/index.html
    • http://www.sogou.com/labs/resource/list_pingce.php
    • https://tianchi.aliyun.com/datalab/index.htm
    • http://www.pkbigdata.com/common/cmptIndex.html

Data preprocessing

  • Initial data pre-processing, which needs to be converted into a form suitable for representation of machine learning models, many types of model, this is represented by a vector or a matrix comprising the numeric data.
    • Category data corresponding to the encoded into numerical representation (typically using Method 1-of-k) -dumy.
    • Extracting useful data from the text data (bag method generally used words or TF-IDF).
    • Processing image or audio data (pixel, acoustic, audio, amplitude, etc. <Fourier transform>).
    • Numerical data into categories of data to reduce values ​​of variables, such as age segment.
    • Convert numeric data, such as logarithmic conversion.
    • Regularization of the features, standardized to ensure that the same range of different input variables of the same model.
    • Or a combination of existing variables to generate new conversion features, such as the average number (do a dummy variable) keep trying.

      Feature Extraction

      Model building

  • Model selection : the best choice for a particular task modeling methods or the selection of the optimum parameters of a particular model.

    Model testing and evaluation

  • Running on a training data set model (algorithm) and concentrated on the test data the test results, the iterative modification of the data model, This is called cross-validation (the data into training and testing sets, to build the model using a training set, and using a test set of assessment models suggest amendments).

  • Select models will be as much selection algorithm execution, and compare the results.

  • Usually the test model to compare the following aspects, namely, the accuracy / recall / accuracy rate / F value.

    • The correct number of samples accuracy (Accuracy) = extracted / total number of samples.
    • Recall (Recall) = number of positive samples of the correct number of samples positive Example / Sample - The coverage.
    • Precise ratio (Precision) = number of correct positive samples / number of samples predicted positive embodiment.
    • Precision F value = the Recall 2 / (the Recall Precision +) (i.e., F harmonic mean values of the correct and recall).

      Put into use (model deployment and integration)

  • When a good model building, the trained models stored in the database, to facilitate loading of other applications using the model (usually a good model constructed matrix).
  • Models need to be periodically: one month, one week.

    Iterative optimization

  • When the model once put into actual production environment, performance monitoring model is very important, and often need to focus on business performance and user experience, so sometimes an A / B test (3: 7 test: is the original system and processing algorithm test, the difference between the two tests).
  • Model requires the user feedback in response to the operation, i.e., modify the model, it is to be noted that the abnormal feedback on the model, so the need for necessary data preprocessing operation.

Guess you like

Origin www.cnblogs.com/tankeyin/p/12113762.html