04-numpy read local data and index

1, numpy read data

  CSV: Comma-Separated Value, comma separated value file: Source File Status Table: wrapping ranks and comma-separated format text, each line represents a data record as a csv easy to display, reads and writes are so many places with small and medium sized store format and transmission of data csv, in order to facilitate teaching, we often operate csv file format, but the data in the database operation is also very easy to achieve

Download Data:

np.loadtxt(fname,dtype=np.float,delimiter=None,skiprows=0,usecols=None,unpack=False)

And an explanation of each parameter:

  fname: file, string or generator, or it may be .gz archive bz2

  dtype: data type, optional, CSV what data type the string read into the array, the default np.float

  delimiter: Split string, the default is any space, replaced by a comma * *

  skiprows: Skip the first x-line, the first row header generally skip

  usecols: read the specified columns, indexes, tuple type

  unpack: If True, the read attributes are written to a separate array variables, data is written into the read only False an array variable, default False

 

numpy read data:

 

  Now here's a multiple youtube1000 UK and USA respective video clicks, likes, dislikes, comments, number ([ "views", "likes", "dislikes", "comment_total"]) in csv, just apply the knowledge learned we try to manipulate it. (Requires specific data may be added Q: 1259553287)

  Source: https: //www.kaggle.com/datasnaek/youtube/data

Import numpy AS NP 

GB_data_path = " F.: \ BaiduNetdiskDownload \ youtube_video_data \ GB_video_data_numbers.csv " 
# load file, reads the data array (the unpack default False, only write an array variable) 
# parameters of a file path, two parameters, comma-separated, third data type parameter, the parameter read attribute are written into four different variable 
T2 = np.loadtxt (GB_data_path, DELIMITER = " , " , DTYPE = " int " )
 "" " 
DELIMITER: Specifies what symbol boundary , do not specify the data will lead to a whole string and error 
dtype: default value for large data changes to scientific notation ways 
unpack: the default is False, by default there will be a number of how many rows of data 
        True the case (1), each row of data will form a row, the column number of the original data, the data will be loaded out how many rows 
        the effect of transposition is equivalent to 
"" " 
Print (T2)

The transposition numpy:

 

  Is a transpose transformation, for the numpy array, is to exchange data in a diagonal direction, but also for the purpose of more convenient to process data

# Numpy transpose 
t2.T 
t2.transpose () 
t2.swapaxes ( . 1 , 0)
 # The above three methods the same effect

So, what we have learned before binding matplotlib the UK and US data presented?

        See the problem, what should we consider?

        What kind of results we want to reflect, to solve the problem?

        Choose what kind of presentation?

        What needs to be done data processing?

        write the code

Next, the need for data-related operations:

numpy indexing and slicing:

  For just loaded out of the data, if I want to select a column (row) where we should be how to do it? In fact, the operation is very simple, and the same operation in python list.

Import numpy AS NP 

T1 = np.arange (12 is) .reshape (3,4- )
 print (T1)
 # row fetching 
print (T1 [2], T1 [. 1 ], T1 [0])
 # take a plurality of rows of continuous 
print (T1 [. 1 :])
 # take a plurality of rows of discontinuous 
Print (T2 [:, [0,2 ]])
 # take a plurality of rows of discontinuous 
Print (T1 [[0,2 ]])
 Print (T1 [. 1 ,. 1 :])
 Print (T1 [:,. 1 ])
 # take more non-adjacent points 
# elected result is (0,0) (2,2 &) 
Print (T1 [[0,2], [ 0,2]])

Values ​​in numpy Review:

t1 [1] = 0

Modify the value of the ranks, we can easily achieve, but if the conditions are more complicated?

For example, we want to number less than 6 t 3 is replaced as follows.

numpy Boolean Index:

t1 <6 
t 1 [t1 <6] = 3

ternary operator in numpy

So the question is:

If we want the number is less than 6 t Replace 0, replacing more than 6 to 9, how should I do? ?

np.where(t1<6,0,9)

So the question is:

If we want the number is less than 6 t Replace 0, replacing more than 9 to 1, how should I do? ?

numpy in the clip (cutting)

t1.clip(6,9)

The above operation:

Replace less than 6 for 6, to replace more than 9 of 9, attention nan will not be replaced, so what nan that?

numpy of nan and inf:

  nan (NAN, Nan): not a number represents not a number

What appears nan numpy time:

      The time when we read the local papers when the float is, if there is missing, there will be nan as an inappropriate calculation of (such as infinity (inf) minus infinity)

inf (-inf, inf): infinity, inf represents positive infinity, -inf representing negative infinity, and when back appears inf include (-inf, + inf) such as dividing a number by 0, (python will complain directly, numpy in It is a inf or -inf)

So how to specify a nan or inf it? Note their type type

 numpy of nan and pay attention:

  1, the two are not equal nan np.nan == np.nan (False)

  2, np.nan! = Np.nan

  3, with the above properties, can determine the number of np.count_nonzero array nan (t1! = T1)

  4, due to the 2, then how to determine whether a number is nan it? Judged by np.isnan (a), return bool desired type such as replacing nan 0 t [np.isnan (t)] = 0

  5, nan, and any values ​​calculated are nan

 

So the question is, simply put in a replacement set of data nan to 0, appropriate it? It will bring what kind of impact? For example, after replacing all 0, if the average value is greater than 0 before the replacement, then replacing the mean certainly becomes small, so the more general way is to replace the missing values are mean (median) or delete missing values the line then the question is: how to calculate how the value of a data set or delete the mean of the row (column) of missing data [described in the pandas]

numpy commonly used statistical functions:

  Summation: t1.sum (axis = None) 

  Mean: t1.mean (a, axis = None) by a larger influence of outliers

  Value: np.median (t, axis = None)

  Maximum: t1.max (axis = None)

  Minimum: t1.min (axis = None)

  Extreme: np.ptp (t1, axis = None) that is the difference between the maximum value and the minimum value of

  Standard deviation: t1.std (axis = None) (

Standard deviation is a group of the degree of dispersion of a measure of the average data. A large standard deviation, values ​​most representative of the difference between the large and the average value; a small difference in the standard, these values ​​are representative of the average value closer to reflect the fluctuation of the data stable situation, said that the greater volatility, about instability

The default return all of the statistical results of a multidimensional array, if specified axis returns a result of the current axis

 

ndarry mean filling missing values:

Nan value t1 exists in how to operate the nan wherein the mean of each column is filled

t = array([[ 0., 1., 2., 3., 4., 5.],

     [6, 7, of, 9., 10., 11.],

     [12., 13., 14., of, 16, 17],

     [ 18., 19., 20., 21., 22., 23.]])

 

 

summary

  How to select a row or multiple rows of data (column)?

  How to select the row or column assignment?

  How to replace a value greater than 10 greater than 10?

  How np.where use? np.clip How to use?

  How to transpose (Axis)?

  Read and save data to what csv np.nan and np.inf that?

  Commonly used several statistical functions you remember?

  What information reflecting the standard deviation of the data?

 The next move hands:

  Britain and the United States before the respective data youtube1000 combination matplotlib draw a histogram of the number of their comments

 (Only need to draw one country to another can know how to draw, draw a histogram need to determine the class interval, and calculate the number of the group)

Just start drawing all the number of reviews found that most number of comments in 5000, so we should take the 5000 drawing, and re-estimate the group apart

Import numpy AS NP
 from matplotlib Import pyplot AS plt 
plt.rc ( " font " , Family = " SimHei " ,) 
GB_data_path = " F: \ BaiduNetdiskDownload \ youtube_video_data \ GB_video_data_numbers.csv " 
# load the file, read the data array (unpack default False, only write an array variable) 
# parameters of a file path, two parameters, a comma separated, third data type parameter, the parameter read attribute are written into four different variables 
t2 = np.loadtxt (GB_data_path, delimiter = " , " , dtype = " int " )
 "" " 
clicks, likes, dislikes, number of comments 
." ""
print(t2)
t3 T2 = [:, -. 1 ] 
T3 = T3 [T3 <5000 ]
 Print (T3) 
D = 100 
sum_bins = (t3.max () - t3.min ()) // D
 Print (t3.max (), T3 .min ()) 
plt.figure (figsize = (20,8), dpi = 80 ) 
plt.hist (T3, sum_bins,) 
plt.xlabel ( " the number of comments " ) 
plt.ylabel ( " review the number of video data " ) 
plt.title ( " the number of reviews histogram " ) 

plt.show ()

 

 

Note: If the number of groups will not divide, to the back of the grid will appear irregular phenomenon

 

  Want to know the relationship between the United Kingdom in the number of video youtube comments and likes, it should be how to draw change plans

  (Analysis: We just need to do it out of one another can be completed, the need to understand the relationship between the number of comments and likes numbers, when you want to see who's who of the relationship between time, then scatter plot can be better seen two !!! relationship between variables)

 

import numpy as np
from matplotlib import pyplot as plt
plt.rc("font",family="SimHei",)
GB_data_path = "F:\BaiduNetdiskDownload\youtube_video_data\GB_video_data_numbers.csv"
#加载文件,读取数据为数组(unpack默认False,只写入一个数组变量)
#参数一文件路径,参数二,逗号分割,参数三数据类型,参数四读入属性分别写入不同变量
t2 = np.loadtxt(GB_data_path,delimiter=",",dtype="int")
"""
点击、喜欢、不喜欢,评论数量
"""
print(t2)
#选择出喜欢数小于500000的全部数据
all_data =t2[t2[:,1]<=500000]
#获取评论数
content = all_data[:,-1]
like = all_data[:,1]
plt.figure(figsize=(15,6),dpi=100)
plt.scatter(content,like)
plt.xlabel("评论数")
plt.ylabel("喜欢数")
plt.title("评论与喜欢之间的关系图")
plt.show()

 

 

 

 

 

 可以看出评论数增加喜欢数也在增加。

数组的拼接:

现在我希望把之前案例中两个国家的数据方法一起来研究分析,那么应该怎么做?

np.vstack((t1,t2))#竖直拼接
np.hstack((t1,t2))#水平拼接

数组的行列交换

数组水平或者竖直拼接很简单,但是拼接之前应该注意什么? 竖直拼接的时候:每一列代表的意义相同!!!否则牛头不对马嘴 如果每一列的意义不同,这个时候应该交换某一组的数的列,让其和另外一类相同 那么问题来了? 如何交换某个数组的行或者列呢?

例如:

import numpy as np

t = np.arange(12,24).reshape(3,4)
print(t)
#行与行交换
t[[0,2],:] = t[[2,0],:]
print(t)
import numpy as np

t = np.arange(12,24).reshape(3,4)
print(t)
#列与列交换
t[:,[0,3]] = t[:,[3,0]]
print(t)

动手:

现在希望把之前案例中两个国家的数据方法一起来研究分析,同时保留国家的信息(每条数据的国家来源),应该怎么办

numpy更多好用的方法:

    1、获取最大值最小值的位置

      1)np.argmax(t,axis=0)

      2) np.argmin(t,axis=1)

    2、创建一个全为0的数组:np.zeros((3,4))

    3、创建一个全为1的数组:np.ones((3,4))

    4、创建一个对角线为1的正方形数组(方阵):np.eye(3) 

import numpy as np

t = np.arange(12,24).reshape(3,4)
print(t)
print(np.argmax(t,axis=0))#必须是行元素之间比
#[2 2 2 2]索引的是位置
print(np.argmin(t,axis=1))#[0 0 0]
import numpy as np
print(np.zeros((3,4)))
"""
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]"""
print(np.ones((3,4)))
"""
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]"""
print(np.eye(4))
"""
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
"""

有了这些方法,那么刚才的问题就可以迎刃而解啦。用0代表美国,1代表英国,先分别拼接在对应的位置上,在将美国和英国垂直拼接。就可以对相关操作啦。

numpy生成随机数:

 

 

 

 

分布的补充:

  均匀分布:

在相同的大小范围内的出现概率是等可能的

 

 

 正态分布:

呈钟型,两头低,中间高,左右对称

 

 

 

 

numpy的注意点copy和view:

a=b 完全不复制,a和b相互影响

a = b[:],视图的操作,一种切片,会创建新的对象a,但是a的数据完全由b保管,他们两个的数据变化是一致的,

a = b.copy(),复制,a和b互不影响

 

Guess you like

Origin www.cnblogs.com/lishuntao/p/11625997.html