"Python Data Science Handbook" NumPy Concise Guide

"Python Data Science Handbook" NumPy Concise Guide

1. NumPy data type

Python is a dynamically typed language

a = 'four'
a = 4
  • The above code is legal for Python, but for statically typed languages, such as C is not legitimate, statically typed language mandatory to declare a variable type variable assignment before, and prohibit certain types of variables assigned to other types of value.
  • Dynamic characteristics compared to static language makes Python language is more flexible, but this flexibility comes at a price realization

How Python dynamic type

  • Python is implemented by a standard C language, Python bottom of each object is represented by a C language structure.

  • Below is the source code for Python3.4 integer defined
struct _longobject { 
    long ob_refcnt; #引用计数,用于Python解释器的内存分配与回收
    PyTypeObject *ob_type; #变量的类型,是一个字符串
    size_t ob_size; #该变量中存储数值的大小
    long ob_digit[1]; #实际存储的内容
};
  • C is an integer in the language tag of a memory location, the position of the byte will be interpreted as an integer (integer coded in accordance)
  • Python language is a pointer to an integer, the structure of the code points above, in addition to saving the data itself, but also saves some extra information to support dynamic languages ​​Python feature
  • A list of language Python array of C language can be compared to heterogeneous, that is, the following code is legal
L = [true,'four',12,3.141592]
  • To ensure that the list of heterogeneous work, is a complete list of objects in each element of the claim, which guarantees flexibility of dynamic languages, but in some cases it is not necessary
L = range(1000)

Python is a fixed type array

  • Python provides a fixed type array objects that can be accessed by the array module
import array
L = range(50)
A = array.array('i',L) #'i'为整型的数据类型码
  • In practice, an array of objects are favored NumPy provided, in addition to the fixed type array data structure, also provides a range of its efficient operation, this last section describes how to create an array NumPy
import numpy as np
np.__version__ #如果输出为一个版本号,则证明当前环境下NumPy可用
 
/* 从Python列表创建NumPy数组 */
np.array([1,4,2,5,3]) 
np.array([3.14, 4, 2, 3]) #向上转型
np.array([1, 2, 3, 4], dtype='float32') #指定数组元素的数据类型

/* 从头创建NumPy数组 */
/* 下面的代码是自解释的,如果不能理解,可以参阅文档 */
np.zeros(10,dtype=int)
np.ones((3,5),dtype=float)
np.full((3,5),3.14)
np.arange(0,20,2)
np.linspace(0,1,5)
np.random.random((3,3))
np.random.normal(0,1,(3,3))
np.random.randint(0,10,(3,3))
np.eye(3)
np.empty(3)

2. The basic operation of the array NumPy

Property array

import numpy as np
np.random.seed(0)  # 设置随机数种子,确保程序每次执行时生成同样的数组     

/* 创建一些数组对象 */
/* 这些数组对象会在后面的例子中反复使用 */
x1 = np.random.randint(10, size=6)  # 一维数组       
x2 = np.random.randint(10, size=(3, 4))  # 二维数组       x3 = np.random.randint(10, size=(3, 4, 5))  # 三维数组

/* 数组的属性 */
x3.ndim # 3
x3.shape # (3,4,5)
x3.size # 60
x3.dtype # int64
x3.itemsize # 8 -->指的是每个元素的字节大小
x3.nbytes # 480 -->指的是数组总字节大小

Array index

/* Python“多维数组”的访问方式 */
A2 = [[1,2],[3,4]]
A2[0][1] # 结果是2
# A2[0,1] # 抛出异常

/* NumPy多维数组的访问方式 */
x2[0,1] # 结果是9
  • In addition to using the array element indexed access NumPy outside, may also be used to modify the index of the array element, attention NumPy automatically inserted into numerical values ​​of the original type of position, which may cause subtle errors

Slice of the array

/* 多维数组的切片 */
x2[:2,:3] # 获得0,1行,0,1,2列的数据组成的表格
x2[::-1, ::-1] # 行列同时被逆序

/* 获取单行单列,结合使用索引与切片 */
x2[:,0] # 获取第一列
x2[0,:] # 获取第一行
x2[0] # 获取第一行的简写

/* 子数组非副本 */
x2_sub = x2[:2,:2]
x2_sub[0,0] = -1
# 执行上面的操作,不仅x2_sub数组的左上角元素会被修改,原来的x2数组的左上角元素也会被修改

/* 获取为副本的子数组 */
x2_sub_copy = x2[:2,:2].copy()
x2_sub_copy[0,0] = -1
# 执行上面的操作,只有x2_sub数组的左上角元素会被修改,原来的x2数组保持不变

Deformation Array

/* 将1*9的数组转化成3*3的数组 */
grid = np.arange(1,10).reshape((3,3))
# 原数组的大小必须与变形后数组的大小相同,返回的是原数组的一个非副本视图

/* 一维数组转化为行向量或者列向量 */
x = np.array([1,2,3])
x.reshape((1,3)) # 转化为行向量
x[np.newaxis,:] # 转化为行向量
x.reshape((3,1)) # 转化为列向量
x[:,np.newaxis] # 转化为列向量

Splicing and splitting array

/* 数组的拼接 */
x = np.array([1, 2, 3])
y = np.array([3, 2])     
z = [6,6,6]    
np.concatenate([x, y, z])

/* 二维数组的拼接 */
grid = np.array([[1, 2, 3],[4, 5, 6]]) 
np.concatenate([grid, grid]) # 沿着第一个轴拼接
np.concatenate([grid, grid],1) # 沿着第二个轴拼接

np.vstack([x,grid])
np.hstack([grid,y])

/* 数组的分裂 */
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5]) 
# 按照给定的索引分裂,会得到 [1,2,3] [99,99] [3,2,1]三个数组

grid = np.arange(16).reshape((4, 4)) 
upper, lower = np.vsplit(grid, [2]) 
/*
    [[0,1,2,3],
     [4,5,6,7]],
    [[8,9,10,11],
     [12,13,14,15]]
*/
left, right = np.hsplit(grid, [2]) 
/*
    [[0,1],
     [4,5],
     [8,9],
     [12,13]],    
    [[2,3],
     [6,7],
     [10,11],
     [14,15]]
*/

3. General Functions

Python performance bottlenecks

  • Python likely to occur when traversing the list of large performance bottleneck, because the type Python interpreter checklist for each element object and use the data to find the type of function.
  • The key to solve this bottleneck is to find the type checking and function in the compiler rather than the run
  • NumPy generic function solve the performance bottleneck. General purpose function is NumPy array values ​​for faster operation is repeated

Common generic function NumPy

/* 算术运算 */
/* 下面所有算术运算符都是NumPy内置函数的简单封装器 */
/* 每一行算术运算的注释中注明了等价的NumPy内置函数的调用
x = np.arange(10)
x + 5   # np.add(x,5)
x - 5   # np.subtract(x,5)
x * 2   # np.multiply(x,2)
x / 2   # np.divide(x,2)
x // 2  # np.floor_divide(x,2)
x ** 2  # np.power(x,2)
x % 2   # np.mod(x,2)
-x      # np.negative(x)
-(0.5*x + 1) ** 2 # 复合运算注意优先级


/* 绝对值 */
x = np.arange(-5,6)
abs(x)  # np.absolute(x) 或 np.abs(x)
x = np.array([3-4j,4-3j,2+0j,0+1j])
abs(x)  # 对于复数,abs()返回的是其模

/* 三角函数 */
theta = np.linspace(0, np.pi, 6)
np.cos(theta)
np.sin(theta)
np.tan(theta)
# 反三角函数亦可用

/* 对数与指数函数 */
x = [1, 2, 3] 
np.exp(x)       # e^x
np.exp2(x)      # 2^x
np.power(3,x)   # 3^x

x = [1, 2, 4, 10]
np.log(x)       # ln(x)
np.log2(x)      # log2(x)
np.log10(x)     # log10(x)

/* 精度更高的对数与指数函数 */
x = [0, 0.001, 0.01, 0.1] 
np.expm1(x)      # e^x-1
np.log1p(x)      # log(1+x)

/* 其他通用函数可以参照scipy模块 */

Generic nature of the special function

/* 指定输出——节省内存的技巧 */
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y) 

/* folder OR reduce */
x = np.arange(1, 6)
np.add.reduce(x)        # 累加
np.multiply.reduce(x)   # 累积求积
np.add.accumulate(x)    # 保存累加的中间结果
np.multiply.accumulate(x) # 保存累积求积的中间结果

/* 外积 */
x = np.arange(0, 5)
y = np.arange(-4,1)
np.multiply.outer(x, y) # 对每一对x,y元素执行multiply函数 

4. Simple statistical computation - aggregate functions

The best summed value

  • The reason for using aggregate functions NumPy is very simple, it is faster
/* NumPy内置聚合函数比Python原生函数更快 */

# 求和
big_array = np.random.rand(1000000)
%timeit sum(big_array) # 326ms
%timeit np.sum(big_array) # 2ms

# 最大值
%timeit np.max(big_array) # 892µs
%timeit max(big_array) # 212ms

/* 可以直接调用NumPy数组对象的聚合函数 */
big_array.sum()
big_array.max()

Given dimension polymerization

  • By default, all elements of the array aggregation function of the polymerization, the polymerization can be specified dimensions
M = np.random.random((3, 4)) 

# axis参数指的是要被折叠的维度(轴)
M.min(axis=0) # 返回每一列的最小值
M.max(axis=1) # 返回每一行的最大值 

5. Broadcast

Broadcast presentation

/* 常数与数组的相加 */
import numpy as np
a = np.array([0, 1, 2]) 
a + 5 
# 在逻辑上,加法操作需要遍历a数组,但是NumPy使得这件事实际上并没有发生,可以这样简单的理解广播

/* 一维数组与二维数组相加 */
M = np.ones((3, 3)) 
M + a
/* 
    此处a扩张成了一个3*3的数组,为
    [[0,1,2],
     [0,1,2],
     [0,1,2]]
    然后与M相加
*/

/* 行向量与列向量的相加 */
a = np.arange(3)
b = np.arange(3)[:,np.newaxis]
a + b
/* 
    此处a扩张成了一个3*3的数组,为
    [[0,1,2],
     [0,1,2],
     [0,1,2]]
    b也扩张成了一个3*3的数组,为
    [[0,0,0],
     [1,1,1],
     [2,2,2]]
    因此a+b相加的结果就可以解释了
*/

Broadcasting rules

  • Rule 1: If the number of dimensions of the two arrays are not the same, the shape of the small dimensions of the array will fill in the leftmost 1
  • Rule 2: If the array does not match the shape of two at a certain dimension, then the shape of the array along the dimension is a 1 will be extended to match the shape of an array of additional
  • Rule 3: If the shape of the two arrays in any one dimension not match and no one dimension is equal to 1, then an exception is thrown
/* 例子1 */
M = np.ones((2, 3)) 
a = np.arange(3)
M.shape # (2,3)
a.shape # (3)
# 根据规则1,a.shape成为(1,3)
# 根据规则2,a.shape成为(2,3)
/* 例子2 */
a = np.arange(3).reshape((3, 1)) 
b = np.arange(3)
a.shape # (3,1)
b.shape # (3)
# 由规则1,b.shape成为(1,3)
# 由规则2,a.shape成为(3,3),b.shape成为(3,3)
/* 例子3 */
M = np.ones((3, 2))
a = np.arange(3)
M.shape # (3,2)
a.shape # (3)
# 由规则1,a.shape成为(1,3)
# 由规则2,a.shape成为(3,3)
# 此时M与a的形状仍然不能匹配,因此抛出异常

6. Mask

Comparison operation

x = np.arange(5)
x < 3   # 结果是[True, True, False, False, False]
x != 3  # 结果是[True, True, False, True, True]

/* 与算术运算符一样,比较运算操作也是借助通用函数实现的 */
/* 每一行的注释部分是其等价的的通用函数调用 */
x == 3  # np.equal(3)
x != 3  # np.not_equal(3)
x < 3   # np.less(3)
x <= 3  # np.less_equal(3)
x > 3   # np.greater(3)
x >= 3  # np.greater_equal(3)

Boolean logic

x = np.random.randint(10,size=(3,4))
np.count_nonzero(x<6)   # 统计True的个数
np.sum(x<6)             # 统计True的个数的另一种方法
np.any(x<6)             # 是否存在
np.all(x<6)             # 是否所有
# sum,any,all 都可以作用于特定的维度

/* 比较运算符对应的通用函数 */
(x>=2) & (x<=4) # np.bitwise_and((x>=2),(x<=4))
(x>=2) | (x<=4) # np.bitwise_or((x>=2),(x<=4))
(x>=2) ^ (x<=4) # np.bitwise_xor((x>=2),(x<=4))
(x>=2) ~ (x<=4) # np.bitwise_not((x>=2),(x<=4))

Mask

  • NumPy allows Boolean array as a mask, the mask by selecting an array of sub-arrays
x = np.random.randint(10,size=(3,4))
x[x<5] # 结果是[0, 2, 4, 4, 3, 2]

7. fancy index

  • Simple single scalar index value as an index, the index can be used a fancy indexed array, permits an easy access or modify complex array of sub-arrays

With an index value of fancy visit

/* 数组作为索引 */
x = np.rand.randint(100,size=10)
# 结果是 [9, 28, 75, 63, 81, 30, 55, 87, 84, 49]

ind = [3,7,4]
x[ind] 
# 结果是 [63, 87, 81]

ind = np.array([[3,4],[7,5]])
# 结果是 [[63, 81],[87, 30]]
# 显然,结果的形状与索引数组的形状一致

/* 多维数组的花哨索引 */
x = np.arange(12).reshape(3,4)
/*
    结果是:
    [[ 0,  1,  2,  3],
     [ 4,  5,  6,  7],
     [ 8,  9, 10, 11]]
*/

row = np.array([0,1,2])
col = np.array([2,1,3])
x[row,col]
/*
    结果是:
    [2, 5, 11]
    这是选择了点(0,2),(1,1),(2,3)后的输出结果
*/

The index modification value fancy

x = np.arrange(10)
i = np.array([2,1,8,4])
x[i] = 99
# 结果是 [0, 99, 99, 3, 99, 5, 6, 7, 99, 9]

8. sorted array


/* 排序 */
x = np.array([2,1,4,3,5])
y = np.sort(x)  # 产生副本
x.sort()        # 修改原数组

/* 获取索引值 */
i = np.argsort(x)
i               # 结果是 [1, 0, 3, 2, 4]
x[i]            # 结果是 [1, 2, 3, 4, 5]

/* 沿着多维数组的行或者列排序 */
x = np.random.randint(0,10,(4,6))
np.sort(x,axis=0) # 沿着列排序
np.sort(y,axis=1) # 沿着行排序

/* 前k小值 */
x = np.array([7,2,3,1,6,5,4])
np.partition(x,3)
# 结果是[2, 1, 3, 4, 6, 5, 7]
# 前3项是原数组中最小的三项,但是不保证这三项与其余项的顺序

9. The structure of the array

/* 创建一个结构化数组 */
x = np.zero(4,dtype={'names':('name','age','weight'),
                        'formats':('U10','i4','f8')})
# U10 即长度不超过10的unicode字符串
# i4 表示4字节的整型
# f8 表示8字节的浮点型

name = ['Alice','Bob','Cathy','Dive']
age = [25,45,37,19]
weight = [55.0,85.5,68.0,61.5]
x['name'] =  name
x['age'] = age
x['weight'] = weight
/*
    结果如下:
    [('Alice', 25, 55.),
     ('Bob', 45, 85.5), 
     ('Cathy', 37, 68.),
     ('Dive', 19, 61.5)]
*/

/* 操作结构化数组 */
x['name']       # ['Alice', 'Bob', 'Cathy', 'Dive']
x[0]            # ('Alice', 25, 55.)
x[-1]['name]    # 'Dive'
x[x['age']<30]['name'] # ['Alice', 'Dive']

Guess you like

Origin www.cnblogs.com/pkuimyy/p/11485058.html