I taught you how to put into practice Python developers no longer talk about language learning tutorials



Machine learning python want to do with it, is not as where to start scratching their heads?
Here, I assume you are a novice, this article we together with the completion of the first machine learning Python project.
I taught you the following:
⦁ download python, numpy, SciPy like the software and install it, these are the python in machine learning the most useful software packages.
⦁ loading a data set, statistical summary (statistical summaries) and data visualization to understand the structure of the data sets.
⦁ create six machine learning models, choose the best side here, and then describes the method by which to determine the accuracy of a stable elected when the model predictions.
If you are a beginner machine learning, and you are determined to learn the language as a starting machine with python, then this article should be more suitable for you.

 
At the beginning, it looks a bit scary Python
Python is a very popular, very powerful interpreted language. Not the same as R, for research, development and production of complete systems, python is a complete development language, a complete platform.
Python also has many options available modules and libraries, the study said above, the development and production of complete system offers a variety of paths to achieve. Gives the impression that the prospect of overwhelming python.
Again, the best way to learn learning machine with Python is to complete an entire project.
⦁ do so to force you to install Python, and start the python interpreter (at least this is so).
⦁ doing so will give you a chance to look at the whole experience step by step how to complete a small project.
⦁ doing so will give you confidence, gives you the confidence to start their own small projects.
Beginners need to complete a project from start to finish
Related books and courses in this area a lot, but tend to make people frustrated. They give you a lot of problem solving and code fragments, but it's hard to understand how they are integrated together.
When you put machine learning algorithms to their own data set, when you're making a complete project.
A machine-learning project may not be as described below as a step by step in order to complete, but the machine learning project does have a number of well-known steps:
⦁ definition of
⦁ ready data
⦁ evaluation algorithm
⦁ improve results
⦁ submit the results
are familiar with a new platform or a the new tool is the best way from start to finish to complete a practical machine learning project, all critical steps of the above practice. In other words, hands-load data, statistical data, evaluation algorithms, and then make some predictions.
If you do, you have a template that can be easily applied to one another on a data set. Once you have confidence, you can further improve these key steps, such as further data preparation and improved results.
Machine learning Hello World
using machine learning python start small projects is the best iris (yuan wei hua, English iris, iris Do not confuse) classification problem. (Download link here)
This project is a good project, the best is that it is easy to understand.
⦁ iris attributes are numeric types, so it is easy to think of how to load data, process the data.
⦁ This is a classification problem, so that we can practice in a relatively simple machine learning algorithm - supervised learning algorithm.
⦁ This is a multi-classification problems (not just fall into two categories is not binary), requires special handling.
This dataset ⦁ only four attributes, data line 150, means that the dataset is extremely small, can easily get up memory (also facilitate the display screen or directly printed on A4 paper)
All units are the same ⦁ property values, the same scale, without any scaling and conversion can directly get started.
Taught python machine learning process (this time is really beginning)
In this section, we introduce how to start a small project from start to finish.
We talk about the steps to be completed:
⦁ install python and Scipy platform
⦁ load data sets
various statistics ⦁ calculated data set.
⦁ dataset visualization
⦁ apply some algorithm on the data set and evaluate
⦁ predicted
to finish each step requires some time.
Recommend yourself to enter some commands involved, or you can copy and paste to speed up, in short, look no stronger than hands-on.
1. Download and install python Scipy platform
if you have not previously had the title of the tool, install. Here I do not want to introduce a very detailed installation process, which introduces a lot. If you're a developer, and the like are simple to install package for you.
1.1 Installation SciPy library
preset here python versions 2.7 and 3.5. Both the above version should be no problem.
More crucial need to install the library has five. Here's what this article when installing Python SciPy libraries need to be installed:
 Scipy
 Numpy
 Matplotlib
 Pandas
 Sklearn
install these libraries there are many ways. My advice is to choose a method and has always insisted on this method to install all packages mentioned above.
Attaching the library to provide a very good guide. Lane explains how to install links on different platforms (such as Linux, mac OS X and Windows). If you have any questions, please refer to the link in practice, so thousands of people have had the same experience.
In Mac OS X, you can use macport to install python2.7 and libraries. Macport information, see its home page.
Linux can be done with the installation package manager, such as Fedora's yum to install RPMs.
If you use Windows, or you are not too sure of their own system, I recommend installing free software Anaconda. Pre-installed inside what you need.
 Note: This version requires scikit-learn your installed version 0.18 or above.
1.2 start python, check the installed version of the
installation is best to confirm your python environment after completion is not working properly.
The following is a script used to detect the environment. Each library which we use import and export version number.
Open a command line to start the python interpreter
Python
I recommend entered directly interpreter script below, or write your own version and then run the command line, rather than in a large editor or IDE. Try to keep things simple, to ensure that rather than focus attention on a variety of machine learning tool chain.
Script as follows:
# Checkthe versions of the Libraries
#Python versionimport sysprint ( 'Python: { }'. Format (sys.version)) # scipyimport scipyprint ( 'scipy: {}'. Format (scipy .__ version __)) # numpyimport numpyprint ( 'numpy: {}'. Format ( numpy .__ version __)) # matplotlibimport matplotlibprint ( 'matplotlib: {}' format (matplotlib .__ version __)) # pandasimport pandasprint (. 'pandas: {}'. format (pandas .__ version __)) # scikit-learnimport sklearnprint ( 'sklearn: {} 'format (sklearn .__ version__ )).
Here is my local output:
the Python: 2.7.11 (default,. 1-Mar 2016, 18:40:10)
[the Apple the LLVM the GCC 4.2.1 7.0.2 Compatible (clang- 700.1.81)]
SciPy: 0.17.0
numpy: 1.10.4
matplotlib: 1.5.1
PANDAS: 0.17.1
sklearn: 0.18.1
Compare your output. If your version of the same or higher with the above, there is no problem under ideal circumstances. These API library does not frequently change. If your version is slightly higher than the above, this tutorial stuff should still apply to you.
If an error occurs, please find a way to correct it.
If you can not clear run over the script, then you will not be able to complete this tutorial.
Recommendations to Google search your error message, or ask questions on Stack Exchange.
2. Load data
we will use iris data set. This data set is very famous, because this is our machine learning and statistics academia "Hello world" data set.
Data set contains 150 observations of the iris. There are four data set size of some of the data are flowers, in centimeters. Column 5 is the result of observation, that is, types of flowers. All observations belonging to one of three flowers in the iris.
Next we loaded from the iris data URL CSV file.
2.1 Import library
first import all to use in this tutorial program modules, functions and objects.
# Loadlibrariesimport pandasfrom pandas.tools.plottingimport scatter_matriximport matplotlib.pyplotas pltfrom sklearn import model_selectionfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn.linear_modelimport LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.neighborsimport KNeighborsClassifierfrom sklearn.discriminant_analysisimport LinearDiscriminantAnalysisfrom sklearn. naive_bayesimport GaussianNBfrom sklearn.svm import SVC
each program should import correctly. If an error occurs, you need to reinstall python + Scipy environment.
(A look at some of the above recommendations on the installation environment)
2.2 to load the data
we can learn from the repository directly from the UCI Machine (repository) to load the data.
We pandas to load data. We will use after the pandas to explore various statistics describing the data, and data visualization.
Note that when we load the data given the name of each column. This helps to explore the characteristics of excavation data after us.
Loaddataset #
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = [ 'sepal-length', 'sepal-width', 'Petal-length ',' Petal-width ',' class']
dataSet = pandas.read_csv (URL, names = names)
data set is loaded should be no surprises.
If your network has a problem, you can download iris data set, and then put it in the working path. Like the above loading method, but requires a URL into a local file name.
3. Data Set Summary
Now is the time to look at our data set up.
The current step data we observed from different angles.
 1. Data Dimensions
 2. peep data itself
 3. The statistical summary of all properties
 4. segment data by categorical variables
in several ways on top of the seemingly esoteric, in fact, do not worry, each observation data needs only one command. In future projects may be repeated using these commands.
3.1 Data Dimensions
us by looking at the characteristics of the model data set, you can quickly know how much data set contains data instances (ie line) and how much property (ie column).
Shape #
Print (dataset.shape)
see example 150, five attributes:
(150, 5)
3.

Head #
Print (dataset.head (20))
can be seen that the head 20 of data lines:
sepal sepal-width-length Petal Petal-width-length class
0 5.1 1.4 0.2 3.5 of Iris-setosa
. 1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
. 3 4.6 3.1 for 1.5 0.2 Iris-setosa
. 4 5.0 3.6 1.4 0.2 Iris-setosa
. 5 5.4 3.9 1.7 0.4 Iris-setosa
. 6 4.6 3.4 1.4 0.3 Iris-setosa
. 7 5.0 3.4 for 1.5 0.2 Iris-setosa
. 8 4.4 2.9 1.4 0.2 Iris-setosa
. 9 4.9 3.1 for 1.5 0.1 Iris-setosa
10 5.4 3.7 for 1.5 0.2 Iris-setosa
. 11 4.8 3.4 1.6 0.2 Iris-setosa
12 is 4.8 3.0 1.4 0.1 Iris-setosa
13 is 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 setosa-IRIS
15 for 1.5 5.7 4.4 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris 16-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
statistical summary of all the properties of the 3.3
Now we can look at the statistical summary of each property.
This includes the total number, mean, minimum, maximum, and some percentage.
Descriptions #
Print (dataset.describe ())
we can see all numerical quantities have the same dimensions (cm), similar to the interval, which is between 0-8 cm.
sepal-length sepal-width Petal-length Petal-width
COUNT 150.000000 150.000000 150.000000 150.000000
Mean 5.843333 3.054000 3.758667 1.198667
STD 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
7.900000 4.400000 6.900000 2.500000 max
3.4 class distribution
let us look at the number of instances belonging to each class (rows) of. We see this as an absolute number.
Classdistribution #
Print (dataset.groupby ( 'class'). Size ())
we can see that each contained the same number of instances (50, or 33% of the total number of each)
class
Iris-setosa 50
IRIS- 50 versicolor
Iris-virginica. 50
4. data visualization
now that we have a basic understanding of the data. On this basis, we need to use visual means to further deepen our understanding.
We will describe two drawing methods:
1. univariate plot, for a better understanding of each attribute.
2. Multivariate drawing for a better understanding of the relationship between attributes.
4.1 Univariate drawing
us first look at a drawing of a single variable, that is, with each individual variable drawing.
If the input variable is numeric, we can draw a box plot for each (box and whisker plots) input variable.
Boxand Whisker Plots #
dataset.plot (kind = 'Box', subplots = True, layout = (2, 2), sharex = False, sharey = False)
plt.show ()
which allows us a clearer understanding of the properties of input Distribution.

 

image.png
We can also create a histogram for each input variable to understand its distribution.
#histograms
dataset.hist ()
plt.show ()
looks like there are two possible input variable Gaussian distribution. This phenomenon is worthy of note that we can use an algorithm based on this assumption.

 

image.png
4.2 Multivariate drawing
Now we look at the relationship between variables.
First, we look at all the properties two hundred twenty-one set of scatter plots comparing each other. This chart will help us locate the structural relationship between the input variables.
Plot Matrix #scatter
scatter_matrix (DataSet)
plt.show ()
Note that some attributes FIG phenomena pair off to delay than diagonal occur. This fact shows that highly relevant and predictable relationship.

 

image.png
5. The algorithm evaluates
it is time to data modeling, and then evaluate the accuracy of those models seen in the data. No data refers to data models never seen before (Translator's Note: the future of data, for example, is not in the training data, not the data in the test data).
The contents of steps inside us as follows:
 1. Validation data set divided
 2. Set test tool, using ten-fold cross validation (10-fold cross validation).
 3. Establish 6 (5) different models, with flowers of various sizes to predict the types of flowers.
 4. Select the best model.
5.1 Create a validation data set
we need to know the model we created good or bad, was good.
We will then estimate the accuracy of our model on unseen data by statistical methods. We also think that the best model by measuring the accuracy of the data on the actual no, the best model is more specific accuracy on unseen data estimate.
In this case, we need to keep some data in advance, the data for the algorithm is not visible. We will use these data to retain the best model to help us understand how accurate at the time of the actual making predictions.
We load the dataset into two parts, of which 80% is divided into a part, we use it to train our model. Of which 20% of our retained as the validation data set.
Split-OUT Validation datasetarray # = = dataset.valuesX Array [:, 0:. 4] Array the Y = [:,. 4]
validation_size = 0.20
SEED. 7 =
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split (X, Y, test_size = validation_size, random_state = seed)
now has a training data set: X_train, Y_train
also have to use after the validation data set: X_validation, Y_validation
5.2 test tool set
us to estimate the accuracy of TEN fold cross-validation.
Our approach is to particular data set into 10 parts, 9 parts used for training, one for testing parts. Cross-validation process covering all parts of the test portion of the combined training division (that is, each of them to do a test set).
Testoptions and Evaluation Metric #
the SEED = 7
Scoring = 'Accuracy'
We use the standard "accuracy" to measure our model. = Number of correct prediction accuracy of examples / total number of instances dataset 100 (for example, 95% accuracy). When we run and evaluate each model, we will use the scoring variables in the above code.
5.3 model
we do not know our problems for which algorithm is better, or which configuration to use. We feel some classifications from drawing on some dimension linearly separable (see how I speak not sure), we look forward to our observations universal.
Here we evaluate six different algorithms:
 logistic regression (LR)
 linear discriminant analysis (LDA)
 K Nearest Neighbor (KNN)
 classification and regression tree (CART)
 Gaussian naive Bayes classifier (NB)
 support vector machine (SVM)
Simple mixing above linear algorithm (LR and LDA), and nonlinear algorithms (KNN, CART, NB and SVM). Before each run we reset the random number seed to ensure that each algorithm is for the evaluation of the same data-division. In order to ensure that the results can be compared directly.
Let us build these six models, and then evaluate:
# SpotCheck Algorithms
Models = []
models.append (( 'LR', LogisticRegression ()))
models.append (( 'LDA', LinearDiscriminantAnalysis ()))
models.append (( 'the KNN', KNeighborsClassifier ()))
models.append (( 'the CART', DecisionTreeClassifier ()))
models.append (( 'NB', GaussianNB ()))
models.append (( 'the SVM', the SVC ( )))
#evaluate in each Turn Model
Results = []
names = [] for name, Model inmodels:
kfold = model_selection.KFold (n_splits = 10, random_state = SEED)
cv_results = model_selection.cross_val_score (Model, X_train, Y_train, CV = kfold, scoring = scoring)
results.append (cv_results)
names.append (name)
MSG = "% S:% F (% F)"% (name, cv_results.mean (), cv_results.std ())
Print (MSG)
5.4 to select the best model
we now have six models and each model accuracy assessment. We need to compare between each of these models, choose the most accurate model.
Operation of the above sample code, we have the following initial results:
the LR: 0.966667 (0.040825)
LDA: .975000 (.038188)
the KNN: 0.983333 (.033333)
the CART: .975000 (.038188)
NB: .975000 (0.053359)
the SVM: .981667 (.025000)
we We can see looks KNN estimates the highest score of accuracy.
We can even draw the assessment model graphics, data classification can be compared to an average of coverage and accuracy of each model. The results of each algorithm has a more accurate measure, because we have to assess each algorithm 10 times (ten-fold cross-validation).
Algorithms #compare
Fig plt.figure = ()
fig.suptitle ( 'Comparison Algorithm')
AX = fig.add_subplot (111)
plt.boxplot (Results)
ax.set_xticklabels (names)
plt.show ()
in the figure can be seen that box plots are squeezed at the top of the range, a lot of samples 100% accuracy.

 

image.png
6. predict
tested, KNN best accuracy. Now we look at the accuracy of the model on our validation set. Accuracy on the validation set is independent of the ultimate means of accurate detection of the optimum model. To retain a verification value set is to prevent you fall into the pit at the time of training, such as the emergence of over-fitting or missing data on the training set. Both will result in over-optimistic results.
KNN model we run directly on the validation set, as the summary results of the accuracy of the final score, get a confusion matrix and classified reports.
Validation DataSet Makepredictions ON #
KNN = KNeighborsClassifier ()
knn.fit (X_train, Y_train)
Predictions = knn.predict (X_validation)
Print (accuracy_score (Y_validation, Predictions))
Print (confusion_matrix (Y_validation, Predictions))
Print (classification_report (Y_validation, predictions))
we can see that the accuracy rate is 0.9 or 90%. Confusion matrix display has three prediction error. Finally, according to the report classification precision, recall, Fl value (f1-score), the support member Episode (Support) other indicators gives details of each class (if the validation set is relatively small).
0.9
[[. 7 0 0]
[0. 11. 1]
[0 2. 9]]
     Precision Recall F1-Score Support
1.00 1.00 1.00-setosa Iris 7
Iris versicolor-12 0.85 0.92 0.88
Iris-virginica 0.90 0.82 0.86 11
AVG / Total 0.90 0.90 0.90 30
summary
describes step by step how to complete a full machine learning project in python in this article. A small project from start to finish, including loading data into the forecast, the best way to be familiar with new platform

Plus free access

Guess you like

Origin www.cnblogs.com/guran0822/p/12204268.html