05_ machine learning development process

05 machine learning development process

type of data

Discrete data

  • Definition: The number of different types of recording data obtained by individuals, also known as count data. These data are all integers, can no longer be broken down, not further improve their accuracy.
  • Such as: Personal 3.6

Continuous data:

  • Defined: variables can take any integer within a certain range, i.e. values ​​of variables may be continuous, e.g., length, time, mass value and the like, typically contain fractional part.

Note: the type of discrete interval inseparable, the separable section is continuous.

Machine learning algorithm classification

Supervised learning

  • Wherein the target value +

Classification (discrete)

- K-近邻算法
- 贝叶斯分类
- 决策树与随机森林
- 逻辑回归
- 神经网络

Return (continuous)

- 线性回归
- 岭回归

Mark

- 隐马尔可夫模型 (不做要求)

Unsupervised Learning

Clustering

- k-means

Development Process

data source

  1. Company data
  2. Data cooperation
  3. Purchase of data

Development Method

  1. Model: divided according to the type of application data types
  2. The basic processing data: pd (missing values, combined table ...)
  3. Features works (feature processing)
  4. Finding the right algorithm to predict
    • classification
    • return
  5. Model evaluation, a determination result (Model: Algorithms + Data)
    • Exchange algorithm
    • Tune parameters
    • Characteristics Engineering Data Processing
  6. On-line use

Data division and presentation

Sklearn data set

  1. Data Partitioning
  • Training data: for training, build the model (75%)
  • Test data: model used in the test, used to assess whether it is effective (25%)
  1. API data set into sklearn
  • sklearn.model_selection.train_test_split
    • x: the feature value data
    • y: a value tag of the data set
    • The size of the test set: test_size
    • random-state: the same random number seed sampling results, the same seed
  1. sklearn presentation data set API (sklearn.datasets)
  • Load get popular data sets, small-scale data sets, data contained in the datasets in
    datasets.load _ * ()

  • Obtain large data sets, you need to download from the network
    datasets.fetch _ * (data_home = None)
    Get the type of the return data set

  1. Large data sets for classification
    sklearn.datasets.fetch_20newsgroups (data_home = None, Subset = 'Train')
    Subset: 'Train', 'Test', 'All' - selecting a data set to be loaded

datasets.clear_data_home (data_home = None) directory data clearly

from sklearn.datasets import load_iris, fetch_20newsgroups
from sklearn.model_selection import train_test_split

li = load_iris()
# print(li.data)   # 特征
# print(li.target)  # 标签
# print(li.DESCR)
# print(li.feature_names)
# print(li.target_names)

# 注意返回值的顺序
# 训练集 train :x_train,y_train, 测试集 test: x_test, y_test
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, train_size=0.75)

print(x_train)
print('*' * 50)
print(x_test)
print(y_train)
print(y_test)

news = fetch_20newsgroups(subset='all')
print(news.data)
print(news.target)

Converter and estimator

Converter (Trransformer)

  1. fit_transform (): input data directly into
  2. fit (): input data, no output, calculating an average, variance, etc.
  3. transfrom (): converted data
>>> from sklearn.preprocessing import StandardScaler
>>> s = StandardScaler()
>>> s.fit_transform([[1,2,3],[4,5,6]])
>>> array([[-1., -1., -1.],
       [ 1.,  1.,  1.]])

>>> sa = StandardScaler()
>>> sa.fit([[2,3,4],[9,9,9]])
>>> StandardScaler(copy=True, with_mean=True, with_std=True)
>>> sa.transform([[1,2,3],[4,5,6]])  # 使用fit()中计算好的平均值,方差来计算
>>> array([[-1.28571429, -1.33333333, -1.4       ],
       [-0.42857143, -0.33333333, -0.2       ]])

NOTE: in normal use is preferentially used fit_transform (), when used alone, transform () uses Fit () in the calculated value to calculate the value input to the transform.

Estimator (Estimator)

  1. Definition: is a class that implements the API algorithm
  2. Category estimator:
    • sklearn.neighbors k- nearest neighbor
    • sklearn.naive_bayes Bayesian
    • sklearn.linear_model.LogisticRegression logistic regression
    • sklearn.tree tree and random forest
  3. Regression estimator
    • sklearn.linear_model.LinearRegression linear regression
    • sklearn.linear_model.Ridge ridge regression
  4. Estimator use process
    1. Call fit (x_train, y_train)
    2. Data input and test set
      • y_predict = predict(x_test)
      • Forecast accuracy: score (x_test, y_test)

Guess you like

Origin www.cnblogs.com/hp-lake/p/11838342.html