day10 knn linear regression handwritten numeral recognition class prediction

day10 data analysis

Examples of linear regression

Relationship between urban climate and sea

import numpy as np
import pandas as pd
from pandas import DataFrame,Series
import matplotlib.pyplot as plt 
%matplotlib inline  

Import data for each data coastal city

ferrara1 = pd.read_csv('./ferrara_150715.csv')
ferrara2 = pd.read_csv('./ferrara_250715.csv')
ferrara3 = pd.read_csv('./ferrara_270615.csv')
ferrara = pd.concat([ferrara1,ferrara2,ferrara3],ignore_index=True)
ferrara         # 级联 因为属性相同

torino1 = pd.read_csv('./torino_150715.csv')
torino2 = pd.read_csv('./torino_250715.csv')
torino3 = pd.read_csv('./torino_270615.csv')
torino = pd.concat([torino1,torino2,torino3],ignore_index=True) 
.....

data:1565768816308

faenza.dtypes (more than one, need to be removed after the cascade)

Unnamed: 0       int64
temp           float64
humidity         int64
pressure         int64
description     object
dt               int64
wind_speed     float64

Added to the list deleted

citys = [Ferrara, turin, Mantua, Milan, Ravenna, asti, bologna, piacenza, Cesena, faience]

for city in citys :
city.drop('Unnamed: 0',axis=1,inplace=True)

1

The maximum temperature is displayed in the relationship from the sea near and far

(Observed multiple cities)

#获取所有城市中的最高温度数据值
max_temp = []
for city in citys:
    max_temp.append(city['temp'].max())

max_temp

[33.43000000000001,
 34.69,
 34.18000000000001,
 34.81,....

Get all the data values ​​from the city from the sea

city_dist = []
for city in citys:
city_dist.append(city['dist'].max())

[47, 357, 121, 250, 8, 315, 71, 200, 14, 37]

Since variable distance, temperature dependent variable

plt.scatter (city_dist, max_temp) # Drawing

1565768952007

2

Observed, may be formed near a straight line from the sea, far from the sea can also form a straight line.

- 分别以100公里和50公里为分界点,划分为离海近和离海远的两组数据(近海:小于100  远海:大于50)

Groupby grouping can not be used, you must use the determination condition (expression) after analysis grouping

s_city_dist = Series (data = city_dist) # distance of all city
s_city_dist

Determine the condition of the coastal city

near_condition = s_city_dist<100
near_condition

near_dist = s_city_dist[near_condition]
print(near_dist)

Achieving the highest coastal city temperature

s_max_temp = Series (data = max_temp) # all maximum temperature
near_temp = s_max_temp [near_condition]
near_temp

Drawing the relationship between distance and temperature offshore

plt.scatter(near_dist,near_temp,c='r')

plt.scatter(near_dist,near_temp)

1565769065588

It is far from the sea

Data acquisition value of the temperature and distance of the city far from the sea

The development of the city far from the sea condition determination

far_condition = s_city_dist > 50
far_dist = s_city_dist[far_condition]

Obtaining the maximum temperature of the city far from the sea

far_temp = s_max_temp[far_condition]
far_temp

The offshore sea temperature and scatter far distance to draw a relationship between the displayed scattergram

plt.scatter(near_dist,near_temp,c='r')
plt.scatter(far_dist,far_temp,c='y',s=100) # 颜色大小

1565769106341

1565769112785

The temperature of the city far from the sea and scatter distance relationship linear regression
# 导包
from sklearn.linear_model import LinearRegression
# 1 . 创建一个算法模型对象
linear_far = LinearRegression()
linear_near = LinearRegression()
# 2 .将散点(样本)数据代入到算法模型中, 让算法模型去根据数据的特性进行线性回归(计算)
# fit 函数的参数 : X,y
# X : 样本数据中的特征数据(features)  二维
# y : 样本数据中的目标数据(target)

# 针对当前项目来讲: 散点图中的一个点标识的数据即为一个样本数据
# 样本集: 多个样本数据组成一个样本集

# 特征数据的改变会影响目标数据的改变,距离就是特征数据,温度就是目标数据

# 注意: 特征数据必须是一个二维数据
linear_far.fit(far_dist.reshape(-1,1),far_temp)

--->LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

prediction

linear_far.predict(200)

array([34.24476188])
The regression curve draw

(Enough to form a dot line)

Generates three hundred scatter

Axis coordinate set in the scattergram 300

xmin = far_dist.min()-20
xmax = far_dist.max()+20
x = np.linspace(xmin, xmax, 300)

parameter must predict sample set feature data (two-dimensional) # x corresponding to the predicted value, closest to the point

y = linear_far.predict(x.reshape(-1,1))

plt.scatter(x,y,s=10)
plt.scatter(far_dist,far_temp,c='r',s=30)

1565769232430

The temperature of the coastal city and distance relationship linear regression

plt.scatter(near_dist,near_temp,c='y',s=80)

linear_near.fit(near_dist.reshape(-1,1),near_temp)
_xmin = near_dist.min()-20
_xmax = near_dist.max()+20
_x = np.linspace(_xmin, _xmax, 300)
_y = linear_near.predict(_x.reshape(-1,1))
plt.scatter(_x,_y,s=10)
plt.scatter(near_dist,near_temp,c='y',s=80)

1565769309566

plt.xlabel('dist')
plt.ylabel('temp')
plt.title('temp&title')
plt.scatter(_x,_y,s=10)
plt.scatter(near_dist,near_temp,c='y',s=80)
plt.scatter(x,y,s=10)
plt.scatter(far_dist,far_temp,c='r',s=30)

1565769324922 Machine Learning

- 机器学习和AI(人工智能)之间的关联是什么?
    - 机器学习是实现人工智能的一种技术手段
- 算法模型:
    - 特殊的对象。对象内部封装了一个还没有求出解的方程(算法)。
    - 作用:
        - 预测:预测出一个未知的值
        - 分类:将一个未知的事物归到已知的分类中
        - 预测或者分类的结果就是模型对象方程的解
- 样本数据:
    - 组成部分:
        - 特征数据:自变量
        - 目标数据:因变量
    - 样本数据和算法模型对象之间的关联?
        - 可以将样本数据带入到算法模型中,对其内部的方程进行求解操作。一旦模型对象有解了,那么就可以实现分类或者预测的功能。
    - 训练模型:将样本数据带入到算法模型,让其模型对象有解。
    
- 算法模型的分类:
    - 有监督学习:如果算法模型需要的样本数据必须要包含特征数据和目标数据
    - 无监督学习:如果算法模型需要的样本数据只包含特征数据即可
- sklearn模块展开学习
    - 封装好了多种不同的算法模型
    
面积  楼层  采光率  售价
100   3    34%    80w
80   6    89%    100w
Import sklearn, linear regression algorithm model objects

Take samples

feature = city_dist # a list of feature data
target = city_max_temp # a list of object data

feature = np.array (feature) #np characteristic data in the form of
target = np.array (target) #np form target data

feature.shape

(10,)
from sklearn.linear_model import LinearRegression
#实例化算法模型对象
linner = LinearRegression() # y = wx + b
#训练模型
#X:二维的特征数据
#y:目标数据
linner.fit(feature.reshape((-1,1)),target)
#预测
linner.predict([[175],[201]])

---->array([33.87226813, 34.01981272])

样本集:用于对机器学习算法模型对象进行训练。样本集通常为一个DataFrame。
    - 特征数据:特征数据的变化会影响目标数据的变化。通常为多列。
    - 目标数据:结果。通常为一列。

The data into the open sea to the city on display a scatter plot and linear regression

x = np.linspace(0,360,num=100)
y = linner.predict(x.reshape(-1,1))

plt.scatter (city_dist, city_max_temp)
plt.scatter (X, Y)
plt.xlabel ( 'distance')
plt.ylabel ( "maximum temperature")
plt.title ( 'relationship between the distance and the maximum temperature')

1565769596462

Knowledge summary:

plt.scatter(city_dist, max_temp)  # 绘图

from sklearn.linear_model import LinearRegression
创建一个算法模型对象
linear_far = LinearRegression()

注意: 特征数据必须是一个二维数据
linear_far.fit(far_dist.reshape(-1,1),far_temp)
linear_far.predict(200)

K- nearest neighbor (KNN)

How to classify movies

As we all know, movies can be classified according to theme, but the theme itself is how to define? Who determines that a movie which belong to
a theme? That is what the common theme of the film has the same characteristics? These are necessary to making the film classification consider asking
questions. No one would say that they make the movie film and a movie similar to the previous, but we do know that each film in style
on there are likely to be similar and the same theme of the film. So what have the common feature of action films that are very similar between action movies,
with the romance there is a clear difference? Action movies, there will kiss the lens, there will love the film fight scenes, we
can not simply rely on the existence of fighting or kissing to determine the type of movie. But love's kiss shot more film, action film
fight scenes more frequently, based on the number of such scenes appeared in a movie in a movie can be used for classification.

The first chapter describes a machine learning algorithm: K- nearest neighbor, it is very effective and easy to master.

1, k- nearest neighbor algorithm theory

Briefly, K- nearest neighbor algorithm uses a distance measurement method between different values ​​in classification.

  • Advantages: high precision (calculated distance), is not sensitive to outliers (simple classification based on distance, special cases are ignored), assuming no data input (data not determined in advance).
  • Disadvantages: high time complexity, space complexity is high.
  • Applicable Data range: numeric and nominal type.

working principle

There is a sample data set, also called training set and each data sample set tag exists, i.e., we know that each of the data sample set
corresponding relationship between the category. After the input data is not a new tag, the data corresponding to each new data sample characteristics and focus
characteristics are compared, then the feature extraction algorithm focused sample data most similar (nearest neighbor) class label. In general, we
select only the sample data set before the K most similar data, this is the K- nearest neighbor algorithm K provenance, usually K is an integer of not more than 20.
Finally, select the highest number of classification K appears most similar data as new data classification
.

Back to the front of the film classification example, using the K- nearest neighbor classification romance and action movies. Some people have a lot of statistical movie fight scene and kissing scene, following figure shows the number of fights and kissing six films. If you have not seen a movie, how to determine it is a romance or action movies it? We can use the K- nearest neighbor algorithm to solve this problem.

1565773360592

First of all we need to know how many fight scenes and there is this unknown movie kiss scene, this location is unknown figure above question mark appears movie shot graphical representation of the number, the specific figures see table below.

1565773370861

Even if the movie does not know what type unknown, we can also be calculated by some method. First calculate the unknown concentration from a sample film and other film, as shown in FIG.1565773384459

Now we have all the movies and the unknown sample set from the movie, in ascending order according to the distance, you can find the K nearest movie. Assume k = 3, the three closest movie followed by California Man, He's Not Really into Dudes, Beautiful Woman. K- nearest neighbor algorithm according to the type of three films, decided to unknown type of film, and this film is all three love stories, we determined that the unknown is a romance movie.

Euclidean distance (Euclidean Distance)

Euclidean distance is the most common distance metric, measures the absolute distance between each point in a multidimensional space. Formula is as follows:

It can handle multi-dimensional1565773423353

2, use k- nearest neighbor in scikit-learn library

  • Classification: from sklearn.neighbors import KNeighborsClassifier

0) One of the most simple example

1. Film classification

import pandas as pd
import numpy as np

data = pd.read_excel('./my_films.xlsx')
data

1565773490231

#样本数据的提取
feature = data[['Action lens','Love lens']]
target = data['target']

print(feature, target)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(feature,target)
knn.predict([[19,19]])

---->array(['Love'], dtype=object)

A simple method for determining the value of k

knn.score(feature,target)

----->0.9166666666666666

2, if the forecast annual revenue greater than $ 50K

df = pd.read_csv('./adults.txt')
df.head()
df

1565773622945

1565773635985

Many variables

But the wage-related characteristics and objectives might be age occupation education_num hours_per_week () related

Gets age, education, jobs, get paid weekly working time as a machine learning data as a result of correspondence

Extracting sample data
feature = df[['age','education_num','occupation','hours_per_week']]
target = df['salary']
Data conversion, to convert the String data type is int
s = feature['occupation'].unique()  # 去重,否则数字太多,权重太高
s = feature['occupation']
dic = {}
j = 0 
for i in s :
    dic[i] = j
    j += 1
feature['occupation'] = feature['occupation'].map(dic)  feature.head()
  • Knn required numbers, and calculates the distance1565773772692

  • Not heavy, digital covers, causing the weight is too high1565773783472

  • After the de-emphasis, weights and a two-digit number can be1565773791445

Knowledge map method, data conversion

# 样本数据的拆分  32560  # 为了是正确的数据测试   训练数据 测试数据60
x_train = feature[:32500]
y_train = target[:32500]
x_test = feature[32500:]
y_test = target[32500:]
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(x_train,y_train)
knn.score(x_test,y_test)

--->0.7868852459016393

Test data for testing
print(np.array(y_test))
print(knn.predict(x_test))
def funcc(a,b):
    if a == b: 
        return True
    else:
        return False
c = []    
for index,i in enumerate(np.array(y_test)):
    print(funcc(i,knn.predict(x_test)[index]))
    c.append(funcc(i,knn.predict(x_test)[index]))
j = 0     
for i in c:
    if i:
        j+=1
        
print(j) 
print(j/len(c))# 0.7868852459016393

knn.predict([[50,26,0,50]])--->array(['<=50K'], dtype=object)

knn.predict([[30,13,1,40]])--->array(['<=50K'], dtype=object)

The results seem right!

Knowledge summary:

1
data = pd.read_excel('./my_films.xlsx')
#样本数据的提取
feature = data[['Action lens','Love lens']]
target = data['target']
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(feature,target)
knn.predict([[19,19]])

---->array(['Love'], dtype=object)
一种简单的方法确定k的值

knn.score(feature,target)
----->0.9166666666666666

2
String类型数据转换为int
别的操作类似1 

Digital Recognition

Guide package

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier

Refining sample data

img = plt.imread('./data/3/3_1.bmp')
img
plt.imshow(img)
img.shape # (28, 28) 为什么是二维的呢? 是矢量图?

1565776104587

data source:

1565776138700

1565776163518

0-9, from a variety of handwriting 0_500 ~ 0_1, increased accuracy

Training can also license plate number, 26 English letters

Refining sample data

Access to these data, added to the list, and then added to the np, np good operation, as well as multi-dimensional

feature = []
target = []
for i in range(0,10):
    for j in range(1,501):
        # ./data/3/3_1.bmp
        img_path = f'./data/{i}/{i}_{j}.bmp'
        img_arr = plt.imread(img_path)
        feature.append(img_arr)
        target.append(i) # 目标特征对应  1-9 数字
Feature data and target data
feature.shape  

feature = np.array(feature)
feature.shape
target = np.array(target)
target.shape

The sample is disrupted, the front or back 1 are all too regular a 9
but also the characteristic data and object data corresponding to the same seed treated with

np.random.seed(3)
np.random.shuffle(feature)
np.random.seed(3)
np.random.shuffle(target)
feature.shape
Deform

Because only supports two-dimensional feature data, the data should be in line with calls for a two-dimensional

# ValueError: Found array with dim 3. Estimator expected <= 2.
feature = feature.reshape(5000,784) # 28*28
feature.shape

---->(5000, 28, 28)

Obtaining training data and test data
x_train = feature[:4950]
y_train = target[:4950]
x_test = feature[4950:]
y_test = target[4950:]
Examples of model objects, training
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(x_train,y_train)
knn.score(x_test,y_test)

---->0.92

Save the model training good

Next time will not bother training, ready to use

from sklearn.externals import joblib
joblib.dump(knn,'./digist_knn.m')

---->['./digist_knn.m']

Good call training model
knn = joblib.load('./digist_knn.m')
knn

---->

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=15, p=2,
weights='uniform')

Use test data to test the model's accuracy

print('已知分类', y_test)
print('模型分类结果',knn.predict(x_test))

---->

Known classification [5. 9 1,778,273,042,253,709,759. 9. 9. 3 0 2 0. 6. 7. 6. 4 2 2 0. 1. 5. 7
. 8. 9. 1. 3. 1 0 0 2. 3 9238]
model classification results [5. 9. 1. 7. 7. 8 2. 7. 3 0. 1 2 2 5,370,175,909,932,426,106 2 0. 5. 1. 9
. 8. 1 0 0 913239238]

The external picture into the model identification
img_arr = plt.imread('./数字.jpg')
plt.imshow(img_arr)

1565776547440

Intercept test samples

four = img_arr[100:160,0:60]
plt.imshow(four)

1565776584912

four.shape

---->(60, 60, 3)

Examples of mean-based dimension reduction

exam = [[22,22],[333,333]]
exam = np.array(exam)
exam.shape
exam = exam.mean(axis=1) # 求平均值成1列了 
exam.shape
exam

---->array([ 22., 333.])

1 dimensionality reduction operation
four = four.mean(axis=2)
four.shape

---->(60, 60)

A compression ratio of 2 pixels, etc.
import scipy.ndimage as ndimage
four = ndimage.zoom(four,zoom=(28/60,28/60))
four.shape

--->(28, 28)

plt.imshow(four)

1565776696262

Because it is a line 784

3 also deform the test data
four = four.reshape((1,784))
four.shape

---->(1, 784)

The results show:

knn.predict(four)

--->array([4])

Knowledge summary:

= plt.imread IMG ( './ Data /. 3 / 3_1.bmp')
plt.imshow (IMG)
Feature = np.array (Feature) # Feature list is
saved model
from sklearn.externals Import JOBLIB
joblib.dump (KNN, './digist_knn.m')
call model
knn = joblib.load ( './ digist_knn.m' )
taken graphics
four = img_arr [100: 160,0: 60]
with a mean dimension reduction
four = four.mean (axis = 2)
like a compression ratio of
Import scipy.ndimage AS ndimage
Four = ndimage.zoom (Four, Zoom = (28 / 60,28 / 60))

Guess you like

Origin www.cnblogs.com/Doner/p/11353777.html