Taught you how to apply it to the actual Python developers no longer talk Wu law ☝☝☝

Taught you how to apply it to the actual Python developers no longer talk Wu law ☝☝☝

Want to do machine learning python you do, is not to where to start scratching their heads?
Here, I assume you are a novice, this article together with our Python completion of the first machine learning project.
I taught you the following:

  1. Download python, numpy, SciPy like the software and install it, these are the python in machine learning the most useful software packages.
  2. Loading a data set of statistical summaries ( Statistical Summaries), and data visualization to understand the structure of the data sets.
  3. Creating six machine learning models, choose the best side here, and then describes the method by which to determine the accuracy of a stable elected when the model predictions.

If you are a beginner machine learning, and you are determined to use language as a starting machine learning python, then this article should be more suitable for you.

 

 

 

At the beginning, Python looks a bit scary

Python is a very popular, very powerful interpreted language. Not the same as R, for research, development and production of complete systems, python is a complete development language, a complete platform.

Python also has many options available modules and libraries, the study said above, the development and production of complete system offers a variety of paths to achieve. Gives the impression that the prospect of overwhelming python.

Again, with the best way to learn Python machine learning is to complete an entire project.

  1. This forces you to do to install Python, and start the python interpreter (at least this is so).
  2. Doing so will give you a chance to look at the whole experience step by step how to complete a small project.
  3. Doing so will give you confidence, gives you the confidence to start their own small projects.

Beginners need to complete a project from start to finish

Related books and courses in this area a lot, but tend to make people frustrated. They give you a lot of problem solving and code fragments, but it's hard to understand how they are integrated together.

When you put machine learning algorithms to their own data set, when you're making a complete project.
A machine-learning project may not be as described below as a step by step in order to complete, but the machine learning project does have some well-known steps:

  1. Define the problem
  2. Prepare data
  3. Evaluation algorithm
  4. Improved results
  5. Present the results

Familiar with a new platform or a new tool is the best way from start to finish to complete a practical machine learning project, all critical steps of the above practice. In other words, hands-load data, statistical data, evaluation algorithms, and then make some predictions.

If you do, you have a template that can be easily applied to one another on a data set. Once you have confidence, you can further improve these key steps, such as further data preparation and improved results.

Machine learning Hello World
using python start of machine learning is the best small projects iris (yuan wei hua, English iris, iris Do not confuse) classification problem. (Download link here)

This project is a good project, the best is that it is easy to understand.

  1. Iris attributes are numeric types, so it is easy to think of how to load data, process the data.
  2. This is a classification problem, so that we can practice in a relatively simple machine learning algorithm - supervised learning algorithm.
  3. This is a multi-classification problems (not just fall into two categories is not binary), requires special handling.
  4. This data set is only four attributes, data line 150, means that the dataset is extremely small, you can easily get up memory (also easy to use the screen display or directly printed on A4 paper)
  5. All units are the same attribute values, the same scale, without any scaling and conversion can directly get started.

Taught python machine learning process (this time is really the beginning)

In this section, we introduce how to start a small project from start to finish.
We talk about the steps to be completed:

  1. Install python platform and Scipy
  2. Load data sets
  3. Calculate various statistics data set.
  4. Visualization data set
  5. Apply some algorithms on data collection and assessment
  6. Predict

To finish each step, it takes some time.

Recommend yourself to enter some commands involved, or you can copy and paste to speed up, in short, look no stronger than hands-on.

1. Download and install python platform Scipy

If you have not previously had the title of the tool, install. Here I do not want to introduce a very detailed installation process, which introduces a lot. If you're a developer, and the like are simple to install package for you.

1.1 Installation SciPy library

Here default python versions 2.7 and 3.5. Both the above version should be no problem.

More crucial need to install the library has five. Here's what this article when installing Python SciPy libraries need to be installed:
 Scipy
 Numpy
 Matplotlib
 Pandas
 Sklearn

There are many ways to install these libraries. My advice is to choose a method and has always insisted on this method to install all packages mentioned above.

Attaching the library to provide a very good guide. It illustrates the links in different platforms (such as how to install on Linux, mac OS X and Windows). If you have any questions, please refer to the link in practice, so thousands of people have had the same experience.

In Mac OS X, you can use macport to install python2.7 and libraries. Macport information, see its home page.
Linux can be done with the installation package manager, such as Fedora's yum to install RPMs.
If you use Windows, or you are not too sure of their own system, I recommend installing free software Anaconda. Pre-installed inside what you need.

 Note: This version requires scikit-learn your installed version 0.18 or above.

1.2 start python, check the installed version

After installation is best to confirm your python environment is not working properly.
The following is a script used to detect the environment. Where import each library we use, and prints the version number.
Open a command line to start the python interpreter
Python

I recommend the following script directly in the input interpreter, or write your own version and then run from the command line rather than in a large editor or run the IDE. Try to keep things simple, to ensure that rather than focus attention on a variety of machine learning tool chain.

Script is as follows:

# Checkthe versions of libraries

 #Python versionimport sysprint('Python: {}'.format(sys.version))# scipyimport scipyprint('scipy: {}'.format(scipy.__version__))# numpyimport numpyprint('numpy: {}'.format(numpy.__version__))#matplotlibimport matplotlibprint('matplotlib: {}'.format(matplotlib.__version__))#pandasimport pandasprint('pandas: {}'.format(pandas.__version__))#scikit-learnimport sklearnprint('sklearn: {}'.format(sklearn.__version__))

Here is my local output:
Python: 2.7.11 (default, Mar 1 2016, 18:40:10)
[4.2.1 Compatible the Apple LLVM GCC 7.0.2 (Clang-700.1.81)]
SciPy: 0.17.0
numpy : 1.10.4
matplotlib: 1.5.1
PANDAS: 0.17.1
sklearn: 0.18.1

Compare your output. If your version of the same or higher with the above, there is no problem under ideal circumstances. The library API does not change frequently. If your version is slightly higher than the above, this tutorial stuff should still apply to you.
If an error occurs, please find a way to correct it.

If you can not clear run over the script, then you will not be able to complete this tutorial.
Recommendations to Google search your error message, or ask questions on Stack Exchange.

2. Load data
we will use iris data set. This data set is very famous, because this is our machine learning and statistics academia "Hello world" data set.
Iris dataset contains 150 observations. There are four data set size of some of the data are flowers, in centimeters. Column 5 is the result of observation, that is, types of flowers. All observations belonging to one of three flowers in the iris.

Next we will load the iris data URL CSV file.

2.1 Import library
first import all to use in this tutorial program modules, functions and objects.

# Loadlibrariesimport pandasfrom pandas.tools.plottingimport scatter_matriximport matplotlib.pyplotas pltfrom sklearn import model_selectionfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn.linear_modelimport LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.neighborsimport KNeighborsClassifierfrom sklearn.discriminant_analysisimport LinearDiscriminantAnalysisfrom sklearn.naive_bayesimport GaussianNBfrom sklearn.svm import SVC

Each program should normally import. If an error occurs, you need to reinstall python + Scipy environment.
(A look at some of the suggestions above regarding installation environment)

2.2 load data
, we can directly load data UCI machine learning repository (repository).
We pandas to load data. We will use after the pandas to explore various statistics describing the data, and data visualization.

Note that when we load the data given the name of each column. This helps to explore the characteristics of excavation data after us.

# Loaddataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

names =['sepal-length', 'sepal-width','petal-length', 'petal-width','class']

dataset =pandas.read_csv(url, names=names)

Not surprisingly dataset should be loaded.
If your network has a problem, you can download iris data set, and then put it in the working path. Like the above loading method, but you need to URL into a local file name.

3. Data Set Summary
Now is the time to look at our data set up.
The current step data we observed from different angles.
 1. Data Dimensions
 2. peep data itself
 3. The statistical summary of all properties
 4. segment data by categorical variables
in several ways on top of the seemingly esoteric, in fact, do not worry, each observation data needs only one command. In future projects may be repeated using these commands.

3.1 Data Dimensions
us by looking at the characteristics of the model data set, you can quickly know how much data set contains data instances (ie line) and how much property (ie column).

# shape

print(dataset.shape)

See 150 examples, five attributes:
(150, 5)

3.2 Glimpse data
carefully observe the data actually always been a good way.

# head

print(dataset.head(20))

可以看到数据的先头 20行:
sepal-length sepal-width petal-petal length-width class
0 5.1 3.5 1.4 0.2 Iris silky-
1 4.9 3.0 1.4 0.2 Iris silky-
2 4.7 3.2 1.3 0.2 Iris silky-
3 4.6 3.1 1.5 0.2 Iris-silky
4 5.0 3.6 1.4 0.2 Iris-silky
5 5.4 3.9 1.7 0.4 Iris-silky
6 4.6 3.4 1.4 0.3 Iris-silky
7 5.0 3.4 1.5 0.2 Iris-silky
8 4.4 2.9 1.4 0.2 Iris-silky
9 4.9 3.1 1.5 0.1 Iris -setosa
10 5.4 3.7 1.5 0.2 Iris silky-
11 4.8 3.4 1.6 0.2 Iris silky-
12 4.8 3.0 1.4 0.1 Iris silky-
13 4.3 3.0 1.1 0.1 Iris silky-
14 5.8 4.0 1.2 0.2 Iris silky-
15 5.7 4.4 1.5 0.4 Iris-silky
16 5.4 3.9 1.3 0.4 Iris-silky
17 5.1 3.5 1.4 0.3 Iris silky-
18 5.7 3.8 1.7 0.3 Iris silky-
19 5.1 3.8 1.5 0.3 Iris-silky

3.3 Summary statistics of all the properties
now we can look at the statistical summary of each property.
This includes the total number, mean, minimum, maximum, and some percentage.

# descriptions

print(dataset.describe())

We can see all numerical quantities have the same dimensions (cm), similar to the interval, which is between 0-8 cm.
sepal-length sepal-width Petal-length Petal-width
COUNT 150.000000 150.000000 150.000000 150.000000
Mean 5.843333 3.054000 3.758667 1.198667
STD .828066 .433594 1.764420 .763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

3.4 class distribution
let us look at the number of instances belonging to each class (rows) of. We see this as an absolute number.

# classdistribution

print(dataset.groupby('class').size())

We can see that each contained the same number of instances ( 50, or 33% of the total number of each)
class
Iris-setosa 50
Iris versicolor-50
Iris 50-virginica.

4. Data Visualization
Now that we have a basic understanding of the data. On this basis, we need to use visual means to further deepen our understanding.

We will describe two drawing methods:
1. univariate plot, for a better understanding of each attribute.
2. Multivariate drawing for a better understanding of the relationship between attributes.

4.1 Univariate drawing
us first look at a drawing of a single variable, that is, with each individual variable drawing.
If the input variable is numeric, we can draw a box plot for each (input variable box and whisker plots).

# boxand whisker plots

dataset.plot(kind='box',subplots=True,layout=(2,2), sharex=False, sharey=False)

plt.show()

This allows us to enter a clearer understanding of the distribution of property.

 

 

 

image.png

We can also create a histogram for each input variable to understand its distribution.

#histograms

dataset.hist()

plt.show()

There are two input variables seem likely to Gaussian distribution. This phenomenon is worthy of note that we can use an algorithm based on this assumption.

 

Variable drawing more than 4.2
Now we look at the relationship between variables.
First, we look at all the properties two hundred twenty-one set of scatter plots comparing each other. This chart will help us locate the structural relationship between the input variables.

#scatter plot matrix

scatter_matrix(dataset)

plt.show()

Note the phenomenon of group delay diagonal appears certain properties pairwise alignment of the figure. This fact shows that highly relevant and predictable relationship.

The evaluation algorithm
is now time to data modeling, and then evaluate the accuracy of those models seen in the data. No data refers to data models never seen before (Translator's Note: the future of data, for example, is not in the training data, not the data in the test data).
The contents of steps inside us as follows:
 1. Validation data set divided
 2. Set test tool, using ten-fold cross validation (10-fold cross validation).
 3. Establish 6 (5) different models, with flowers of various sizes to predict the types of flowers.
 4. Select the best model.

5.1 Create a validation data set
we need to know the model we created good or bad, was good.

We will then estimate the accuracy of our model on unseen data by statistical methods. We also think that the best model by measuring the accuracy of the data on the actual no, the best model is more specific accuracy on unseen data estimate.

In this case, we need to keep some data in advance, the data for the algorithm is not visible. We will use these data to retain the best model to help us understand how accurate at the time of the actual making predictions.

We load the dataset into two parts, of which 80% is divided into a part, we use it to train our model. Of which 20% of our retained as the validation data set.

#Split-out validation datasetarray =dataset.valuesX =array[:,0:4]Y =array[:,4]

validation_size = 0.20

seed =7

X_train, X_validation, Y_train,Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

Now, with training data set: X_train, Y_train
also have to use after the validation data set: X_validation, Y_validation

5.2 Testing tools
we estimate the accuracy of the fold cross-validation TEN.
Our approach is to particular data set into 10 parts, 9 parts used for training, one for testing parts. Cross-validation process covering all parts of the test portion of the combined training division (that is, each of them to do a test set).

# Testoptions and evaluation metric

seed =7

scoring= 'accuracy'

We use the standard "accuracy" to measure our model. = Number of correct prediction accuracy of examples / total number of instances dataset 100 (e.g., 95% accuracy). When we run and evaluate each model, we will use the scoring variables in the above code.

5.3 model
we do not know our problems for which algorithm is better, or which configuration to use. We feel some classifications from drawing on some dimension linearly separable (see how I speak not sure), we look forward to our observations universal.
Here we evaluate six different algorithms:
 logistic regression (LR)
 linear discriminant analysis (LDA)
 K Nearest Neighbor (KNN)
 classification and regression tree (CART)
 Gaussian naive Bayes classifier (NB)
 support vector machine (SVM)
mixing the above simple linear algorithm ( the LR and LDA), and nonlinear algorithms (KNN, CART, NB and SVM). Before each run we reset the random number seed to ensure that each algorithm is for the evaluation of the same data-division. In order to ensure that the results can be compared directly.

Let us first establish the above-mentioned six kinds of models, and then evaluate:

# SpotCheck Algorithms

models =[]

models.append(('LR', LogisticRegression()))

models.append(('LDA', LinearDiscriminantAnalysis()))

models.append(('KNN', KNeighborsClassifier()))

models.append(('CART', DecisionTreeClassifier()))

models.append(('NB', GaussianNB()))

models.append(('SVM', SVC()))

#evaluate each model in turn

results =[]

names =[]for name, model inmodels:

    kfold= model_selection.KFold(n_splits=10, random_state=seed)

    cv_results= model_selection.cross_val_score(model,X_train, Y_train, cv=kfold, scoring=scoring)

    results.append(cv_results)

    names.append(name)

    msg= "%s: %f (%f)" % (name, cv_results.mean(),cv_results.std())

    print(msg)

5.4 to select the best model
we now have six models and each model accuracy assessment. We need to compare between each of these models, choose the most accurate model.

Operation of the above sample code, we have the following initial results:
the LR: .966667 (.040825)
LDA: 0.975000 (0.038188)
the KNN: 0.983333 (.033333)
the CART: 0.975000 (0.038188)
NB: .975000 (.053359)
the SVM: .981667 (.025000)
we you can see it looks KNN estimates the highest score of accuracy.

We can even draw the assessment model graphics, data classification can be compared to an average of coverage and accuracy of each model. The results of each algorithm has a more accurate measure because we each algorithm evaluated 10 times (ten-fold cross-validation).

#Compare Algorithms

fig =plt.figure()

fig.suptitle('Algorithm Comparison')

ax =fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

The figure we can see boxplot are squeezed at the top of the range, many of the samples to reach 100% accuracy rate.

 image.png

6. predict
tested, KNN best accuracy. Now we look at the accuracy of the model on our validation set. Accuracy on the validation set is independent of the ultimate means of accurate detection of the optimum model. To retain a verification value set is to prevent you fall into the pit at the time of training, such as the emergence of over-fitting or missing data on the training set. Both will result in over-optimistic results.

We run directly on the validation set KNN model, as the summary results of the accuracy of the final score, get a confusion matrix and classified reports.

# Makepredictions on validation dataset

knn =KNeighborsClassifier()

knn.fit(X_train,Y_train)

predictions = knn.predict(X_validation)

print(accuracy_score(Y_validation, predictions))

print(confusion_matrix(Y_validation, predictions))

print(classification_report(Y_validation, predictions))

We can see that the accuracy rate is 0.9 or 90%. Confusion matrix display has three prediction error. Finally, according to the report classification precision, recall, Fl value (f1-score), the support member Episode (Support) other indicators gives details of each class (if the validation set is relatively small).

0.9

[[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]

     precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 0.85 0.92 0.88 12
Iris-virginica 0.90 0.82 0.86 11

avg / total 0.90 0.90 0.90 30

Summary
describes how in this article step by step to complete a full python in Machine Learning project. A small project from start to finish, including loading data into the forecast, the best way to be familiar with new platform

 

Guess you like

Origin www.cnblogs.com/itye2/p/11703821.html