Introduction to machine learning python

  Machine learning is more than one field of cross-disciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithmic complexity theory and other subjects. How specializing in computer simulation or realization of human learning behavior , in order to acquire new knowledge or skills, re-organize existing knowledge structures so as to continuously improve their performance. Extract knowledge from the data, also known as predictive analysis or statistical learning .

  It is the core of artificial intelligence, it is to make computers intelligent fundamental way.

  Let's first look at the machine to learn it.


One: Learning Machine Learning causes and problems that can be solved

1. Cause

  Machine learning is now more popular and contributed in real life as well as the use of advanced scientific problems.

  In the early days, many systems and procedures are artificially set of decision rules, however, decision-making rules of human shortcomings. A task is a little change, you may need to rewrite the system; on the other hand, policy makers also need to have a deep understanding of the decision-making process.

  One example is face recognition, human and computer description of the human face is different, so sometimes need to be very complex operations with the machine, the machine can automatically and requires learning and recognition, which requires the machine to make computer learning machine. So learning machine learning is necessary.

2. Machine learning problems that can be solved

  The most successful machine learning algorithm is able to automate the process of decision-making algorithms derived from known examples and generalization of results. This algorithm is called supervised learning .

  In this algorithm, the user paired input and output to the intended algorithm, the algorithm to obtain a desired output and input method, the method can be applied to other data sets, according to known estimation unknown. It's like a teacher overseeing the algorithm.
  Supervised learning algorithm instance: zip code identification on the envelope, determine tumor size, testing the authenticity of bank credit card.

  Another algorithm is unsupervised learning algorithm. This algorithm is known only to the input data, output data is not provided, so it is difficult to evaluate it.

  Examples of unsupervised learning algorithm: determining the series blog theme, customers or friends into different types of groups, visit the abnormal pattern detection site.

  Represents the input data in tabular form is useful, for each data point represents a line, property represents the data representing the column.

  In machine learning, each entity is called a row or each sample or data points , and each column (attribute describes any one of these entities) are called features .

  You must have a valid set of information, enough to build a machine learning algorithm machine learning models.


Two: Why did you choose the language as a machine learning python

  Powerful python both general purpose programming language, but also the ease of use of domain-specific scripting language. There are many libraries, using its main advantage is the use of terminal or other similar Jupyter notebook tools to interact directly with the code.


III: Introduction to Machine Learning Common Library

1.scikit-learn

  Description : It is an open source python library contains the most advanced machine learning algorithms, but also the most famous python machine learning library.

  User's Guide : http://scikit-learn.org/stable/user_guide.html


  installation-Learn scikit : mounted directly on a collection of python Anaconda release more data analysis library contains all the necessary machine learning libraries.

2.Jupyter notebook

  This is an interactive environment to run code in the browser, there are many convenient interactive features, it can be used to integrate the code, text and images.

3.NumPy

  It is the basic data structure. Features include multi-dimensional arrays, advanced mathematical functions, and a pseudo-random number generators. All data formats must be translated into NumPy multidimensional arrays. Referred to as "NumPy array" or "array."

4.SciPy

  Is a collection of functions SciPy python for scientific computing. It has advanced linear algebra program, math function optimization, signal processing, special functions and statistical distribution functions.

5.matplotlib

  matplotlib Python is a major scientific graphics library, and the data analysis function to generate the visualized content. % Matplotlib notebook general use and% matplotlib inline command image is displayed in the browser.

6.pandas

  pandas is Python library for data processing and analysis. It is based on a data structure called DataFrame. Similar to the two-dimensional table structure in the database.

7.mglearn

  Utility library, users quickly beautify the drawing, or the user get some interesting numbers.

8. Machine Learning introducing common library

import sys
import pandas as pd
import matplotlib
import numpy as np
import scipy as sp
import ipython
import sklearn 


Four: Machine Learning Process

1. The real problem is abstracted into mathematical problems

  The real problem is abstracted into mathematical problems, the goal is a mathematical problem of how the issue is a classification or regression, or clusters of issues, find a specific type of problem, and the problem for which data can be used.

2. Get the data

  Machine Learning first step is to collect data, quality and quantity of the collected data will directly determine whether the prediction model can built, we will collect data deduplication, standardization, bug fixes, get standard data, and multi-dimensional array of . Saved to a text file (csv, txt, json) or database.

  Note here that the acquired data includes acquiring raw data and feature data extraction engineering training, testing from the raw data from the raw data through. Data determines the upper limit of the results of machine learning, algorithms and just approaching this limit as much as possible. If the data is too big may consider reducing training samples, using the dimensionality reduction or distributed machine learning system.

3. Analyze

  Primarily data distribution found in each column to find the maximum, minimum, average, variance, median, tertiles, quartiles, certain specific value (such as zero value) or the proportion of Wait. The best way is to understand these visual and intuitive analysis of the data.

4. The feature works

  Construction Engineering features include features from the raw data, feature extraction, feature selection. Feature enables the effects of the project and performance of the algorithm has been significantly improved, the effect can sometimes better than the simple model of the effect of complex models. Data Mining and spent most of the above features engineering, machine learning is very basic and essential step . Data preprocessing, data cleaning, screening salient features, discard non-significant features, etc. are very important.

5. vectorization

  Is to quantify the results of feature extraction reprocessing, aimed at enhancing feature representation capability to prevent the model is too complicated and difficult to learn, to simplify complex issues.

6. Split Data Set

  The data is divided into two parts. On the one hand it is used to train the model; on the other hand is used to evaluate the performance of our trained models to test the suitability of our model.

7. Model Training

  Before training model, to determine the appropriate algorithm, such as linear regression, decision trees, random forest, logistic regression, the gradient upgrade, SVM and so on. The best way to test a variety of different algorithms, and by cross-validation select the best one. However, if the training set is small, high bias / low variance classifier (such as Naive Bayes classifier) better than a low deviation / high variance classifier (such as k-nearest neighbor classifier), easier to fit, but large training set then the low deviation / variance is more suitable for the high.

8. Evaluate

  After completion of the training, the training is performed by splitting the data out of the model evaluation by comparing the real data and the prediction data to judge the acceptability of the model. Common five methods: confusion matrix, lift FIG & Lorenz FIG Gini coefficient, KS curve, ROC curve.

  After the completion of the assessment, if you want to further improve the training, we can be achieved by adjusting the parameters of the model, and then repeat the process of training and evaluation.

9. filing

  After completion of the training model, to sort out the different meanings of files, ensure that the model will function correctly.

10. The interface package, on-line

  By encapsulating package service interface, the call to the model, in order to return predictions. The entire machine learning models on the line.

Guess you like

Origin www.cnblogs.com/ITXiaoAng/p/11618546.html