sklearn data set consisting of machine learning

Machine Learning: the model, strategy, optimization

"Statistical machine learning" that: machine learning models = + strategy + algorithm. In fact, machine learning can be expressed as: Learning = Representation + Evalution + Optimization. We can represent just such a teacher and Li Hang argument correspondence. Machine learning is mainly composed of three parts, namely: that (model), evaluation (strategy) and optimization (algorithms).

Represents (otherwise known as: Model): Representation

A principal to do is modeled, it can be called a model. The main work is to convert the model to be completed: the real problem into a problem computer can understand, is what we usually say modeling. Algorithms, data structures similar to conventional computer disciplines, how to convert problems into practical way a computer can represent. This part can see the "easy to learn machine learning algorithms." Given the data, how do we choose the corresponding problem to solve, choosing the right existing model is an important step.

Evaluation (or known as: Strategy): Evalution

The goal of the evaluation is to determine the merits of the model has been built. For models built in the first step of the evaluation is an indicator used to indicate the merits of the model. Here it would be the design of the evaluation indicators and some evaluation function. Learning will be targeted in the evaluation machine.

  • Classification

Optimization: Optimization

The goal is to optimize the function of the evaluation, we hope to be able to find the best model, that is the highest rating model.

Step in the development of machine learning applications

(1) data collection

We can use many methods to collect samples of protective gear, such as: production of a web crawler to extract data from the website, RSS or API feedback from the measured data obtained information sent from the device.

(2) prepare input data

After obtaining the data, it must also ensure that the data format meets the requirements.

(3) analyzing input data

The main effect of this step is to ensure that the data set is not garbage data. If you are using a trusted source of data, you can skip this step

(4) training algorithm

Machine learning algorithms to learn from this step really began. If you are using unsupervised learning algorithm, since the target value of the variable does not exist, and therefore do not need training algorithm, the algorithm all related content (5) In the first step

(5) test algorithm

This step will actually use the first (4) step machine learning knowledge information available. Of course, this also needs to assess the accuracy of the results, and then retrain your algorithm as needed

(6) using an algorithm

Into an application, perform the actual task. To check whether the above steps can work in a real environment. If new data is encountered problem, the same need to repeat the above step

 

scikit-learn the data set

We will describe the dataset class sklearn, utilities module comprising means for loading the data set, including loading and popular method for acquiring the reference data set. It also has some manual data generator.

sklearn.datasets

(1)datasets.load_*()

Acquisition of small-scale data sets, data contained in the datasets in

(2)datasets.fetch_*()

Obtain large data sets, you need to download from the network, the first argument is data_home, represents a data set download directory, the default is ~ / scikit_learn_data /, to modify the default directory, you can modify environment variables SCIKIT_LEARN_DATA

(3)datasets.make_*()

Locally generated data set

Load * and FETCH * data type returned by the function is datasets.base.Bunch, essentially a dict, its key attributes of the available access by way of the object. Mainly contains the following properties:

  • data: characteristic data array is two-dimensional array n_samples * n_features of numpy.ndarray

  • target: tag array is one-dimensional array n_samples of numpy.ndarray

  • DESCR: Data Description

  • feature_names: feature name

  • target_names: label name

Data set directory can () get through datasets.get_data_home, clear_data_home (data_home = None) to delete all downloaded data

  • datasets.get_data_home(data_home=None)

The return path scikit learning data directory. This folder is large data sets using some loader, in order to avoid downloading data. By default, the data directory settings file to the user's home folder named "scikit_learn_data" folder. Alternatively, by "SCIKIT_LEARN_DATA" environment variable, or by giving explicit folder path set it programmatically. '~' Sign-extended to the user's home folder. If the folder does not exist, it is automatically created.

  • sklearn.datasets.clear_data_home(data_home=None)

Delete data storage directory

Acquiring small data sets

For classification

  • sklearn.datasets.load_iris
class sklearn.datasets.load_iris(return_X_y=False)
  """
  加载并返回虹膜数据集

  :param return_X_y: 如果为True,则返回而不是Bunch对象,默认为False

  :return: Bunch对象,如果return_X_y为True,那么返回tuple,(data,target)
  """
In [12]: from sklearn.datasets import load_iris
    ...: data = load_iris()
    ...:

In [13]: data.target
Out[13]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [14]: data.feature_names
Out[14]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [15]: data.target_names
Out[15]:
array(['setosa', 'versicolor', 'virginica'],
      dtype='|S10')

In [17]: data.target[[1,10, 100]]
Out[17]: array([0, 0, 2])
name Quantity
category 3
feature 4
Number of samples 150
The number of each category 50
  • sklearn.datasets.load_digits
class sklearn.datasets.load_digits(n_class=10, return_X_y=False)
    """
    加载并返回数字数据集

    :param n_class: 整数,介于0和10之间,可选(默认= 10,要返回的类的数量

    :param return_X_y: 如果为True,则返回而不是Bunch对象,默认为False

    :return: Bunch对象,如果return_X_y为True,那么返回tuple,(data,target)
    """
In [20]: from sklearn.datasets import load_digits

In [21]: digits = load_digits()

In [22]: print(digits.data.shape)
(1797, 64)

In [23]: digits.target
Out[23]: array([0, 1, 2, ..., 8, 9, 8])

In [24]: digits.target_names
Out[24]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [25]: digits.images
Out[25]:
array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,  15.,   5.,   0.],
        [  0.,   3.,  15., ...,  11.,   8.,   0.],
        ...,
        [  0.,   4.,  11., ...,  12.,   7.,   0.],
        [  0.,   2.,  14., ...,  12.,   0.,   0.],
        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

        [[  0.,   0.,  10., ...,   1.,   0.,   0.],
        [  0.,   2.,  16., ...,   1.,   0.,   0.],
        [  0.,   0.,  15., ...,  15.,   0.,   0.],
        ...,
        [  0.,   4.,  16., ...,  16.,   6.,   0.],
        [  0.,   8.,  16., ...,  16.,   8.,   0.],
        [  0.,   1.,   8., ...,  12.,   1.,   0.]]])
name Quantity
category 10
feature 64
Number of samples 1797

For return

  • sklearn.datasets.load_boston
class  sklearn.datasets.load_boston(return_X_y=False)
  """
  加载并返回波士顿房价数据集

  :param return_X_y: 如果为True,则返回而不是Bunch对象,默认为False

  :return: Bunch对象,如果return_X_y为True,那么返回tuple,(data,target)
  """
In [34]: from sklearn.datasets import load_boston

In [35]: boston = load_boston()

In [36]: boston.data.shape
Out[36]: (506, 13)

In [37]: boston.feature_names
Out[37]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='|S7')

In [38]:
name Quantity
Target category 5-50
feature 13
Number of samples 506
  • sklearn.datasets.load_diabetes
class sklearn.datasets.load_diabetes(return_X_y=False)
  """
  加载和返回糖尿病数据集

  :param return_X_y: 如果为True,则返回而不是Bunch对象,默认为False

  :return: Bunch对象,如果return_X_y为True,那么返回tuple,(data,target)
  """
In [13]:  from sklearn.datasets import load_diabetes

In [14]: diabetes = load_diabetes()

In [15]: diabetes.data
Out[15]:
array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])
name Quantity
Target range 25-346
feature 10
Number of samples 442

Obtain large data sets

  • sklearn.datasets.fetch_20newsgroups
class sklearn.datasets.fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)
  """
  加载20个新闻组数据集中的文件名和数据

  :param subset: 'train'或者'test','all',可选,选择要加载的数据集:训练集的“训练”,测试集的“测试”,两者的“全部”,具有洗牌顺序


  :param data_home: 可选,默认值:无,指定数据集的下载和缓存文件夹。如果没有,所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param categories: 无或字符串或Unicode的集合,如果没有(默认),加载所有类别。如果不是无,要加载的类别名称列表(忽略其他类别)

  :param shuffle: 是否对数据进行洗牌

  :param random_state: numpy随机数生成器或种子整数

  :param download_if_missing: 可选,默认为True,如果False,如果数据不在本地可用而不是尝试从源站点下载数据,则引发IOError

  :param remove: 元组
  """
In [29]: from sklearn.datasets import fetch_20newsgroups

In [30]: data_test = fetch_20newsgroups(subset='test',shuffle=True, random_sta
    ...: te=42)

In [31]: data_train = fetch_20newsgroups(subset='train',shuffle=True, random_s
    ...: tate=42)
  • sklearn.datasets.fetch_20newsgroups_vectorized
class sklearn.datasets.fetch_20newsgroups_vectorized(subset='train', remove=(), data_home=None)
  """
  加载20个新闻组数据集并将其转换为tf-idf向量,这是一个方便的功能; 使用sklearn.feature_extraction.text.Vectorizer的默认设置完成tf-idf 转换。对于更高级的使用(停止词过滤,n-gram提取等),将fetch_20newsgroup与自定义Vectorizer或CountVectorizer组合在一起

  :param subset: 'train'或者'test','all',可选,选择要加载的数据集:训练集的“训练”,测试集的“测试”,两者的“全部”,具有洗牌顺序

  :param data_home: 可选,默认值:无,指定数据集的下载和缓存文件夹。如果没有,所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

  :param remove: 元组
  """
In [57]: from sklearn.datasets import fetch_20newsgroups_vectorized

In [58]: bunch = fetch_20newsgroups_vectorized(subset='all')

In [59]: from sklearn.utils import shuffle

In [60]: X, y = shuffle(bunch.data, bunch.target)
    ...: offset = int(X.shape[0] * 0.8)
    ...: X_train, y_train = X[:offset], y[:offset]
    ...: X_test, y_test = X[offset:], y[offset:]
    ...:

Obtain locally generated data

Generate local classified data:

  • sklearn.datasets.make_classification

    class make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    """
    生成用于分类的数据集
    
    :param n_samples:int,optional(default = 100),样本数量
    
    :param n_features:int,可选(默认= 20),特征总数
    
    :param n_classes:int,可选(default = 2),类(或标签)的分类问题的数量
    
    :param random_state:int,RandomState实例或无,可选(默认=无)
      如果int,random_state是随机数生成器使用的种子; 如果RandomState的实例,random_state是随机数生成器; 如果没有,随机数生成器所使用的RandomState实例np.random
    
    :return :X,特征数据集;y,目标分类值
    """
    
from sklearn.datasets.samples_generator import make_classification
X,y= datasets.make_classification(n_samples=100000, n_features=20,n_informative=2, n_redundant=10,random_state=42)

Generate local regression data:

  • sklearn.datasets.make_regression
class make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
  """
  生成用于回归的数据集

  :param n_samples:int,optional(default = 100),样本数量

  :param  n_features:int,optional(default = 100),特征数量

  :param  coef:boolean,optional(default = False),如果为True,则返回底层线性模型的系数

  :param random_state:int,RandomState实例或无,可选(默认=无)
    如果int,random_state是随机数生成器使用的种子; 如果RandomState的实例,random_state是随机数生成器; 如果没有,随机数生成器所使用的RandomState实例np.random

  :return :X,特征数据集;y,目标值
  """
from sklearn.datasets.samples_generator import make_regression
X, y = make_regression(n_samples=200, n_features=5000, random_state=42)

 

Select models

Is the core algorithm, data and computing is the foundation. This sentence illustrates the importance of a good machine learning algorithms. So we look to open several machine learning classification:

  • Supervised learning
    • K- nearest neighbor classification, decision trees, Bayesian logistic regression (LR), support vector machine (SVM)
    • Regression linear regression, ridge regression
    • Marked hidden Markov model (HMM)
  • Unsupervised Learning
    • K-means clustering

How to choose the right model algorithm

In solving the problem, we must consider the following two questions: First, the purpose of machine learning algorithms, what tasks you want to complete the algorithm, such as the probability of rain is forecast for tomorrow is grouped according to interest voters; Second, the need to analyze What data collection or

First consider the purpose of using a machine learning algorithm. After if you want to predict the value of the target variable, you can select supervised learning algorithm, or you can choose unsupervised learning algorithm, select supervised learning algorithm to determine the need for further determine the target variable type, if the target variable is discrete, yes / no, 1 / 2/3, a / B / C / or red / black / yellow, etc., may be selected classification algorithm; If the target variable is continuous value, such as 0.0 to 100.0, -999 to 999, etc., it is necessary to select the regression algorithm

If you do not want to predict the value of the target variable, you can choose unsupervised algorithm. Further analysis of the need to divide the data into discrete groups. If this is the only requirement, then using a clustering algorithm.

Of course, in most cases, the options given above can help readers choose the appropriate machine learning algorithms, but it is not a foregone unchanged. There are also classification algorithms can be used to return.

Second, consider the data problem, we should fully understand the data, the more fully the actual data to understand, the easier to create applications in line with the actual demand, the main should understand the characteristics of what data: characteristic values are  discrete variables  or  continuous variables  , whether the value of the missing feature values exist, resulting in missing values for any reason, the presence of enough data is an abnormal value, how often a feature of incidence, and the like. To fully understand the characteristics of the data mentioned above can shorten the time to select a machine learning algorithm.

Explained three types of supervised learning problem

(1) Classification  Classification is a core issue of supervised learning in supervised learning, when the output variable finite number of discrete values, prediction problem becomes a classification problem. At this time, the input variable may be discrete or may be continuous. Supervised learning Learning to live a classification model from the data classification decision function, called a classifier. New input classifier predicted output is called classification. The most basic is the binary classification, i.e. non-determination, as a prediction result selected from two categories; In addition there are problems polyphenols, i.e. select one of more than two categories.

Classification and the classification comprises two learning process, the learning process, according to known training data set using effective learning to learn a classifier, the classification process by a classifier learning new instance of classifying input. FIG. (X1, Y1), (X2, Y2) ... are the training data set, training the learning system to learn a classifier data P (Y | X), or Y = f (X); classification system by learning to classifier for classifying input new instance sub Xn + 1, Yn surgery i.e. prediction flag output Ray + 1

Classification is based on its characteristic data "categories", so in many fields have a wide range of applications. For example, in banking, we can build a customer segmentation model, according to customer classified according to the size of the credit risk; in the field of network security, illegal intrusion can be detected using the classified log data; in image processing, classification can be used detecting whether there is a face in the image appear; in handwriting recognition, digital classification can be used to identify the handwriting; in Internet search, category pages can help crawling, indexing and ranking pages.

That is an example of a classification application, text classification. Text here may be news reports, web pages, e-mail, academic papers. Category tend to be about the content of the text. Such as politics, sports, economy and so on; there are about text features, such as positive comments, negative comments; also be determined depending on the application, such as spam, non-spam. Text classification is based on features of the text to be divided existing classes. Is the feature vector input text, the text output is a category. The value usually appear word definition text is 1, and 0 otherwise; ,, may also be a multi-value indicates the frequency of words appearing in the text. Intuitively, if the "stock", "Banking", "Money," a lot of these words, this text may belong to economics, if the "tennis", "competition", "athlete" These words appear frequently, this text may belong to sports

(2) Regression

Another important issue is the return of supervised learning. Regression relationships between the input variables and the predicted output variables, in particular variable change, such as when the original value changes, the output value of the consequent variables. Regression models represent the official function mapping from input to output between variables. Stability study day regression function fit equivalent to: select a function curve to better fit the known data and the position data of the predicted good

Regression according to the number of input variables, divided into one regression and multiple regression; according to the types of relationships between the input variables and output variables of the model type, i.e., divided into linear and nonlinear regression regression.

Task in many areas can be formalized as a regression problem, for example, can be used to return to the business world, as the market trend forecasting, product quality management, customer satisfaction surveys, risk analysis tool attack.

(3) Dimensioning

Mark is a supervised learning problem. The problem is that a mark can be considered to promote the issue of classification, labeling problem is a simple form of a more complex structure prediction problem. Dimensioning is an input observation sequence, a marker sequence or the output sequence of states. Question mark in information extraction, natural language processing and other fields are widely used, is a fundamental problem in these areas. For example, natural language processing speech tagging is a typical label, i.e., the corresponding predicted sequence of speech a word sequence tag

Of course, our main concern is the classification and regression problems, algorithm complexity and Dimensioning

 

Model - A cross-validation

When the test is generally performed model, we will be data into training and test sets. In a given sample space, to come up with most of the samples as a training set to train the model, the remaining small portion of the sample using the model to predict just created.

Training set and test set

Split the training set and test set can be used train_test_split method cross_validation in most of the cross-validation iterators are built into a former division of data options for data indexing broken, that is, cross-validation iterator internal train_test_split methods used. The default will not be broken up, comprises cv = some_integer (direct) k fold cross-validation cross_val_score returns a random division. If the data set has temporality, do not break up the data subdivided!

  • sklearn.cross_validation.train_test_split
def train_test_split(*arrays,**options)
  """
  :param arrays:允许的输入是列表,数字阵列

  :param test_size:float,int或None(默认为无),如果浮点数应在0.0和1.0之间,并且表示要包括在测试拆分中的数据集的比例。如果int,表示测试样本的绝对数

  :param train_size:float,int或None(默认为无),如果浮点数应在0.0到1.0之间,表示数据集包含在列车拆分中的比例。如果int,表示列车样本的绝对数

  :param random_state:int或RandomState,用于随机抽样的伪随机数发生器状态,参数 random_state 默认设置为 None,这意为着每次打散都是不同的。
  """
from sklearn.cross_validation import train_test_split
from sklearn import datasets

iris = datasets.load_iris()
print iris.data.shape,iris.target.shape
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=42)
print X_train.shape,y_train.shape
print X_test.shape,y_test.shape

The above approach also has limitations. Because the test only once, does not necessarily represent the true accuracy of the model. Because the accuracy of the model and cut data points are related, in a small amount of data, the impact is particularly prominent. So we need a better solution.

Model evaluation, in addition to the training data and test data, will be involved in the validation data. Using the training data and test data were cross-validation, the only way to train a model that has the more reliable accuracy, the model also can expect on the new, unknown data sets, to have better performance. This is to promote the ability of the model, that guarantee the generalization ability.

holdout method

A typical method is the generalization ability evaluation model holdout cross-validation (holdout cross validation). holdout method is very simple, we only need to the original data set is divided into training and test sets, the former is used to train the model, which is used to evaluate the performance of the model. Generally, not a cross-validation The Holdout verification, because the data used and do not cross. Randomly selected from the original sample portion formed cross-validation data, while the remaining training data as it is. In general, less than one-third of the original sample data is selected as the authentication data. Therefore, the results obtained by this method does not really persuasive

k- fold cross-validation

K-fold cross-validation, the initial sample is divided into K sub-samples, a single sub-sample is kept as data validation model, the other K-1 samples used for training. Cross-validation was repeated K times, once for each sub-sample verification result on average K times, or in combination with other ways to finally obtain a single estimate. The advantage of this method is that, while the repeatable random sub-sample of training and validation, each time verification result, 10 fold cross-validation is the most common.

5-fold cross validation, for example, all of the available data set into a set of five, each iteration of which is selected from a set of data as a validation set, and four sets as the training set, groups 5 through the iterative process. Cross-validation of the benefits that can ensure that all data has the opportunity to be trained and validated, but also the greatest extent possible model performance optimization of the performance of the more credible.

The easiest way is to use the cross-validation cross_val_score estimator and function in the data set.

  • sklearn.cross_validation.cross_val_score
def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')
  """
  :param estimator:模型估计器

  :param X:特征变量集合

  :param y:目标变量

  :param cv:int,使用默认的3折交叉验证,整数指定一个(分层)KFold中的折叠数

  :return :预估系数
  """
from sklearn.cross_validation import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y))

Cross-validation approach has two main purposes:

  • Get as much valid information from a limited study data;
  • You can avoid over-fitting problem to some extent.

 Encounter problems or need to find a source notes qq: 2586251002

Published 232 original articles · won praise 93 · views 50000 +

Guess you like

Origin blog.csdn.net/qq_42370150/article/details/104966403
Recommended