This content is completely for your own study (your own practice notes), and all the content comes from Chapter 4 of the book Data Analysis Using Python
Let's give an example to reflect the difference of numpy, assuming that numpy contains 1 million integers, and there is a Python list with the same data content:
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
#计算的时间
%time for _ in range(10):my_arr2 = my_arr*2
Wall time: 21 ms
Compiler : 341 ms
#计算的时间
%time for _ in range(10):my_list2=[x*2 for x in my_list]
Wall time: 948 ms
The numpy method is 10 to 100 times faster than the python method and uses less memory.
4.1, Numpy ndarray: multidimensional array object
One of the core features of Numpy is the N-dimensional array object—ndarry.
#导入numpy
import numpy as np
#随机生成数组
data = np.random.randn(2,3)
data
array([[ 0.53526407, 1.42752699, -0.68798613],
[-0.45544835, -1.35615318, -1.6924118 ]])
#数学操作
data*10
array([[ 5.3526407 , 14.27526989, -6.87986133],
[ -4.55448354, -13.56153181, -16.92411803]])
data+data
array([[ 1.07052814, 2.85505398, -1.37597227],
[-0.91089671, -2.71230636, -3.38482361]])
#维度
data.shape
(2, 3)
#数据类型
data.dtype
dtype('float64')
4.1.1. Generate ndarry
list conversion
data1 = [6,7.5,8,0,1]
arr1 = np.array(data1)
arr1
array([6. , 7.5, 8. , 0. , 1. ])
Nested sequences, such as lists of equal length, are automatically converted to multidimensional arrays
data2 = [[1,2,3,4],[5,6,7,8]]
arr2 = np.array(data2)
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
arr2.ndim
2
arr2.shape
(2, 4)
arr1.dtype
dtype('float64')
arr2.dtype
dtype('int32')
After the length and shape are given, zeros can create all zero data at one time, and ones can create all 1 data at one time. Empty can create a data with no initialized value
np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.zeros((3,6))
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
np.empty((2,3,2))
array([[[1.05075542e-311, 2.86558075e-322],
[0.00000000e+000, 0.00000000e+000],
[1.05699242e-307, 8.60952352e-072]],
[[4.26976457e-090, 2.00497183e-052],
[1.26141762e-076, 9.91606475e+164],
[6.48224660e+170, 5.82471487e+257]]])
np.ones((2,3))
array([[1., 1., 1.],
[1., 1., 1.]])
It is not safe to use np.empty to generate an array of all zeros, and sometimes it may return uninitialized garbage values
arange is an array version of python's built-in function range
np.arange(15)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
4.1.2, ndarray data type
data type, ie dytpe
arr1 = np.array([1,2,3],dtype=np.float64)
arr2 = np.array([1,2,3],dtype=np.int32)
arr1.dtype
dtype('float64')
arr2.dtype
dtype('int32')
Use the astype method to explicitly convert the data type of the array
arr = np.array([1,2,3,4,5])
arr.dtype
dtype('int32')
convert integer to float
float_arr = arr.astype(np.float64)
float_arr.dtype
dtype('float64')
arr = np.array([3.7,2.5,4.3,5.0])
arr
array([3.7, 2.5, 4.3, 5. ])
The floating-point number is converted into an integer, and the part after the decimal point will be eliminated directly
arr.astype(np.int32)
array([3, 2, 4, 5])
Convert a string representing a number to a number
Be careful when using the numpy.string_type for strings, as Numpy will correct its size or remove input without warning. pandas has more intuitive out-of-the-box operations when dealing with non-numeric data
numeric_strings = np.array(['1.25','-3.4','4.0'],dtype=np.string_)
numeric_strings
array([b'1.25', b'-3.4', b'4.0'], dtype='|S4')
numeric_strings.astype(float)
array([ 1.25, -3.4 , 4. ])
Use the dtype attribute of another array
int_array = np.arange(10)
int_array
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
calibers = np.array([.22,.270,.345,.234],dtype=np.float64)
calibers
array([0.22 , 0.27 , 0.345, 0.234])
int_array.astype(calibers.dtype)
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
Use type codes to pass in data types
empty_unit32 = np.empty(8,dtype='u4')
empty_unit32
array([3264175145, 1070344437, 343597384, 1070679982, 3779571220,
1070994554, 1168231105, 1070461878], dtype=uint32)
4.1.3 Numpy array arithmetic
arr = np.array([[1.,2.,3.],[4.,5.,6.]])
arr
array([[1., 2., 3.],
[4., 5., 6.]])
arr + arr#加
array([[ 2., 4., 6.],
[ 8., 10., 12.]])
arr - arr#减
array([[0., 0., 0.],
[0., 0., 0.]])
arr * arr#乘
array([[ 1., 4., 9.],
[16., 25., 36.]])
arr / arr#除
array([[1., 1., 1.],
[1., 1., 1.]])
1 / arr#倒数
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
arr ** 0.5#开根号
array([[1. , 1.41421356, 1.73205081],
[2. , 2.23606798, 2.44948974]])
Comparison between arrays of the same size will produce an array of boolean values
arr2 = np.array([[0.,4.,1.],[7.,4.,23.]])
arr2
array([[ 0., 4., 1.],
[ 7., 4., 23.]])
arr2 > arr
array([[False, True, False],
[ True, False, True]])
4.1.4 Basic index and slice
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[5]
5
arr[5:8]
array([5, 6, 7])
arr[5:8] = 12
arr
array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
array_slice = arr[5:8]
array_slice
array([12, 12, 12])
When changing the value in array_slice, the original array will also change, the slice of the array is the view of the original array
array_slice[1] = 123456
array_slice
array([ 12, 123456, 12])
arr
array([ 0, 1, 2, 3, 4, 12, 123456, 12,
8, 9])
If you want a copy of the slice instead of a view, use arr[5:8].copy()
array_copy = arr[2:5].copy()
array_copy
array([2, 3, 4])
array_copy[1] = 12345
array_copy
array([ 2, 12345, 4])
arr
array([ 0, 1, 2, 3, 4, 12, 123456, 12,
8, 9])
A [:] that does not write a slice value will refer to the ownership value of the array
array_slice[:] = 64
arr
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
Two-dimensional array
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr2d[2]
array([7, 8, 9])
select a single element
arr2d[0][2]
3
arr2d[0,2]
3
three-dimensional array
arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
arr3d
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0]#是一个2*3的数组
array([[1, 2, 3],
[4, 5, 6]])
Both scalars and arrays can be passed to arr3d[0]
old_values = arr3d[0].copy()
old_values
array([[1, 2, 3],
[4, 5, 6]])
arr3d[0] = 42
arr3d
array([[[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0] = old_values
arr3d
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Similar arr3d[1,0] returns a one-dimensional array:
arr3d[1,0]
array([7, 8, 9])
split into two steps
x = arr3d[1]
x
array([[ 7, 8, 9],
[10, 11, 12]])
x[0]
array([7, 8, 9])
Note: The arrays returned in the subset selection above are views
4.1.4.1 Slice indexing of arrays
arr
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
arr[1:6]
array([ 1, 2, 3, 4, 64])
Two-dimensional array
arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr2d[:2]#行
array([[1, 2, 3],
[4, 5, 6]])
Do multi-group slicing, similar to multi-group indexing
arr2d[:2,1:]
array([[2, 3],
[5, 6]])
Select the first two columns of the second row
arr2d[1,:2]
array([4, 5])
Select the first two rows of the third column
arr2d[:2,2]
array([3, 6])
arr2d[:,:1]
array([[1],
[4],
[7]])
assignment
arr2d[:2,1:] = 0
arr2d
array([[1, 0, 0],
[4, 0, 0],
[7, 8, 9]])
4.1.5 Boolean indexing
names = np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
data = np.random.randn(7,4)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
data
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
names == 'Bob'
array([ True, False, False, True, False, False, False])
data[names == 'Bob']
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194]])
Note: When the length of the Boolean value array is incorrect, the method of Boolean value selection data will not report an error, so be careful when using it
data[names == 'Bob',2:]
array([[ 0.68263748, -0.0983769 ],
[-0.51739626, -1.16344194]])
data[names == 'Bob',3]
array([-0.0983769 , -1.16344194])
can use! = or ~ negates the condition
names != 'Bob'
array([False, True, True, False, True, True, True])
data[~(names == 'Bob')]
array([[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
cond = names == 'Bob'
data[~cond]
array([[-0.14467573, -1.73207863, -0.20321916, 0.75697117],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538],
[ 0.13222977, -1.45147521, 0.54796917, 1.19053359],
[-1.02140787, 0.9426649 , -0.75485246, 0.20162042]])
mask = (names == 'Bob') | (names == 'Will')
mask
array([ True, False, True, True, True, False, False])
data[mask]
array([[-0.16858164, -0.33108982, 0.68263748, -0.0983769 ],
[ 1.38042424, -1.31551497, 2.10397966, 1.98598204],
[-0.20164359, 0.81705695, -0.51739626, -1.16344194],
[ 0.07882572, -0.68212957, 0.59073925, 1.49971538]])
Note: The python keywords and and or are not useful for boolean arrays, use & and | instead
data[data < 0]=0
data
array([[0. , 0. , 0.68263748, 0. ],
[0. , 0. , 0. , 0.75697117],
[1.38042424, 0. , 2.10397966, 1.98598204],
[0. , 0.81705695, 0. , 0. ],
[0.07882572, 0. , 0.59073925, 1.49971538],
[0.13222977, 0. , 0.54796917, 1.19053359],
[0. , 0.9426649 , 0. , 0.20162042]])
names != 'Joe'
array([ True, False, True, True, True, False, False])
data[names != 'Joe']=7
data
array([[7. , 7. , 7. , 7. ],
[0. , 0. , 0. , 0.75697117],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[0.13222977, 0. , 0.54796917, 1.19053359],
[0. , 0.9426649 , 0. , 0.20162042]])
4.1.6 Magic index
arr = np.empty((8,4))
for i in range(8):
arr[i]=i
arr
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
Select a subset in a specific order
arr[[4,3,0,6]]
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[0., 0., 0., 0.],
[6., 6., 6., 6.]])
If a negative index is used, selection will be done from the tail
arr[[-3,-5,-7]]
array([[5., 5., 5., 5.],
[3., 3., 3., 3.],
[1., 1., 1., 1.]])
arr = np.arange(32).reshape((8,4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
arr[[1,5,7,2],[0,3,1,2]]
array([ 4, 23, 29, 10])
arr[[1,5,7,2]][:,[0,3,1,2]]
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
Magic indexing is not the same as slicing, it always copies the data into a new array
4.1.7 Array transpose and conversion
Transpose is a special way of reorganizing data that returns a view of the underlying data without duplicating anything. Arrays have a transpose method and also have a special T property.
arr = np.arange(15).reshape((3,5))
arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
arr.T
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
Computing the matrix inner product will use np.dot
arr = np.random.randn(6,3)
arr
array([[-0.23144783, -1.53102926, -0.2230637 ],
[ 1.65451328, -0.74725816, -0.64295544],
[ 1.78178001, 0.19446786, -1.34621907],
[ 0.12343761, 1.37570397, -0.92405543],
[ 1.12624911, -1.76795706, -1.18655746],
[ 0.92947622, 2.64016736, -1.06539457]])
np.dot(arr.T,arr)
array([[ 8.11332223, 0.09713011, -5.85149832],
[ 0.09713011, 14.92898035, -1.42608964],
[-5.85149832, -1.42608964, 5.67231753]])
For higher dimensional arrays, the transpose method can accept a tuple containing the axis number, which is used to permute the axis
arr = np.arange(16).reshape((2,2,4))
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
arr.transpose(1,0,2)
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
Here, the axes have been reordered so that what was originally the second axis becomes the first, the first becomes the second, and the last axis has not changed
ndarray has a swapaxes method that takes a pair of axes numbers as arguments and adjusts the axes for reorganizing the data
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
arr.swapaxes(1,2)
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes return a view of the data without copying the data
4.2 Universal Functions: Fast Element-wise Array Functions
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
#平方根
np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
#平方
np.square(arr)
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int32)
#自然指数值
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
binary universal function
x = np.random.randn(8)
y = np.random.randn(8)
x
array([ 0.43774471, 0.30353109, -0.4385476 , -0.07085461, -0.41682892,
1.74171657, 0.22694261, 0.48012626])
y
array([ 0.38091604, 0.7351168 , 0.04363922, 0.39276555, -0.11270609,
-0.68831551, -0.64187507, 0.2514712 ])
#逐个元素将x,y中的最大值计算出来
np.maximum(x,y)
array([ 0.43774471, 0.7351168 , 0.04363922, 0.39276555, -0.11270609,
1.74171657, 0.22694261, 0.48012626])
There are also some generic functions that return multiple arrays. For example, modf is a vectorized version of python's built-in function divmod. It returns the fractional and integer parts of an array of float values
arr = np.random.randn(7)*5
arr
array([ 0.69713224, -0.39436563, -1.4239261 , 10.89444784,
8.31602522, -0.52237816, -10.31292285])
remainder, whole_part = np.modf(arr)
remainder
array([ 0.69713224, -0.39436563, -0.4239261 , 0.89444784, 0.31602522,
-0.52237816, -0.31292285])
whole_part
array([ 0., -0., -1., 10., 8., -0., -10.])
arr
array([ 0.69713224, -0.39436563, -1.4239261 , 10.89444784,
8.31602522, -0.52237816, -10.31292285])
np.sqrt(arr)
<ipython-input-85-b58949107b3d>:1: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(arr)
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
np.sqrt(arr,arr)
<ipython-input-86-e3ca18b15869>:1: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(arr,arr)
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
arr
array([0.83494446, nan, nan, 3.30067385, 2.88375193,
nan, nan])
4.3 Array-Oriented Programming Using Arrays
We want to compute the value of the function sqrt(x 2 + y 2 ) on some grid data . The np.meshgrid function takes two one-dimensional arrays and generates a two-dimensional matrix from all (x,y) pairs of the two arrays.
#随机生成数据
points = np.arange(-5,5,0.01)
#生成二维矩阵
xs, ys = np.meshgrid(points,points)
ys
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
xs
array([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
#根据公式计算z
z = np.sqrt(xs ** 2 + ys ** 2)
z
array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
7.06400028],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
...,
[7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
7.04279774],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568]])
Generate visualizations of two-dimensional arrays using matplotlib
import matplotlib.pyplot as plt
plt.imshow(z,cmap=plt.cm.gray)
plt.colorbar()
#设置标题
plt.title('sqrt(x^2+y^2)')
Text(0.5, 1.0, 'sqrt(x^2+y^2)')
4.3.1 Manipulating conditional logic as an array
The np.where function is a vectorized version of the ternary expression x if condition else y
xarr = np.array([1.1,1.2,1.3,1.4,1.5])
yarr = np.array([2.1,2.2,2.3,2.4,2.5])
cond = np.array([True,False,True,True,False])
result = [(x if c else y)for x,y,c in zip(xarr,yarr,cond)]
result
[1.1, 2.2, 1.3, 1.4, 2.5]
If the array is too large, the speed will be very slow. It won't work if the array is multidimensional. And when using np.where, it can be done very simply
result = np.where(cond,xarr,yarr)#第二个第三个参数并不需要是数组,也可以是标量
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
arr = np.random.randn(4,4)
arr
array([[ 1.45673658, 0.97095783, -0.90075114, -0.86810283],
[ 0.7691019 , -1.44098307, 1.23655136, -0.0863179 ],
[-0.26002458, -0.44007831, -0.64002542, 0.58748434],
[ 1.23704204, -1.42979856, 1.10834965, 0.50134018]])
arr>0
array([[ True, True, False, False],
[ True, False, True, False],
[False, False, False, True],
[ True, False, True, True]])
#将所有正值替换成2,负值替换成-2
np.where(arr>0,2,-2)
array([[ 2, 2, -2, -2],
[ 2, -2, 2, -2],
[-2, -2, -2, 2],
[ 2, -2, 2, 2]])
#将所有正值换成2
np.where(arr>0,2,arr)
array([[ 2. , 2. , -0.90075114, -0.86810283],
[ 2. , -1.44098307, 2. , -0.0863179 ],
[-0.26002458, -0.44007831, -0.64002542, 2. ],
[ 2. , -1.42979856, 2. , 2. ]])
4.3.2 Mathematical and statistical methods
#生成数据
arr = np.random.randn(5,4)
arr
array([[-0.24008142, -0.08617688, 0.42879457, -1.05699554],
[-0.86102647, -0.01481326, -0.49326453, -0.51728933],
[-1.04369519, -0.07668856, 0.12641113, -0.34170659],
[-0.34358427, -1.19146826, 0.79855649, -0.56526347],
[ 0.34119469, 0.60338427, 0.23612535, 1.70667616]])
#平均值
arr.mean()
-0.1295455547355409
np.mean(arr)
-0.1295455547355409
#和
arr.sum()
-2.5909110947108185
#计算每一列的平均值
arr.mean(axis=1)
array([-0.23861482, -0.47159839, -0.3339198 , -0.32543988, 0.72184512])
#计算行轴向的和
arr.sum(axis=0)
array([-2.14719266, -0.76576269, 1.09662301, -0.77457876])
arr = np.array([0,1,2,3,4,5,6,7])
#从零开始元素累积和
arr.cumsum()
array([ 0, 1, 3, 6, 10, 15, 21, 28], dtype=int32)
arr = np.array([[0,1,2],[3,4,5],[6,7,8]])
arr
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr.cumsum(axis=0)
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]], dtype=int32)
#从1开始元素累积积
arr.cumprod(axis=1)
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]], dtype=int32)
4.3.3 Methods for Arrays of Boolean Values
arr = np.random.randn(100)
#计算正值的个数
(arr>0).sum()
51
bools = np.array([False,False,True,False])
bools.any()#是否至少有一个True
True
bools.all()#是否全部为True
False
4.3.4 Sorting
arr = np.random.randn(6)
arr
array([-0.28600425, 0.20138334, 0.61513703, -1.54104191, 0.71169457,
1.28541225])
arr.sort()#排序
arr
array([-1.54104191, -0.28600425, 0.20138334, 0.61513703, 0.71169457,
1.28541225])
arr = np.random.randn(5,3)
arr
array([[ 0.44551524, 0.22691436, -1.49874737],
[ 0.36256785, 1.19204608, 0.31673416],
[ 0.07827487, 0.64557507, -1.31371171],
[-1.01458161, -0.82770194, -0.06353473],
[-0.40078359, 2.48821946, -0.50991488]])
arr.sort(1)
arr
array([[-1.49874737, 0.22691436, 0.44551524],
[ 0.31673416, 0.36256785, 1.19204608],
[-1.31371171, 0.07827487, 0.64557507],
[-1.01458161, -0.82770194, -0.06353473],
[-0.50991488, -0.40078359, 2.48821946]])
#计算一个数组的分位数,并选出分位数所对应的值
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05*len(large_arr))]
-1.7200330679547906
4.3.5 Unique Values and Other Collection Logic
np.unique, returns the array formed by sorting the unique values in the array
names = np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
np.unique(names)
array(['Bob', 'Joe', 'Will'], dtype='<U4')
ints = np.array([3,3,3,2,2,1,1,4,4])
np.unique(ints)
array([1, 2, 3, 4])
np.unique compared to pure python
sorted(set(names))
['Bob', 'Joe', 'Will']
np.in1d, which can check whether the value in one array is in another array and return an array of boolean values
values = np.array([6,0,0,3,2,5,6])
np.in1d(values,[2,3,6])
array([ True, False, False, True, True, False, True])
4.4 Using arrays for file input and output
np.save and np.load are two tool functions for efficiently accessing hard disk data. Arrays are stored in an uncompressed format by default, and the suffix is .npy.
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('some_array',arr)
np.load('some_array.npy')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez passes an array as an argument to this function and is used to save multiple arrays in an uncompressed file.
np.savez('array_archive.npz',a=arr,b=arr)
arch = np.load('array_archive.npz')
arch['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
If the data is already compressed, you can use np.savez_compressed.
np.savez_compressed('arrays_compressed..npz',a=arr,b=arr)
arch1 = np.load('arrays_compressed..npz')
arch1['a']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4.5 Linear Algebra
x = np.array([[1.,2.,3.],[4.,5.,6.]])
y = np.array([[6.,23.],[-1,7],[8,9]])
x
array([[1., 2., 3.],
[4., 5., 6.]])
y
array([[ 6., 23.],
[-1., 7.],
[ 8., 9.]])
x.dot(y)
array([[ 28., 64.],
[ 67., 181.]])
x.dot(y) is equivalent to np.dot(x,y)
np.dot(x,y)
array([[ 28., 64.],
[ 67., 181.]])
np.dot(x,np.ones(3))
array([ 6., 15.])
The special symbol @ is also used as an infix operator for dot multiplication matrix operations
x @ np.ones(3)
array([ 6., 15.])
numpy.linalg has a standard set of functions for matrix factorization, as well as other commonly used functions such as inversion and determinant solving
from numpy.linalg import inv, qr
X = np.random.randn(5,5)
mat = X.T.dot(X)
mat
array([[ 6.88097643, -0.40153042, -0.11773682, 4.82061317, -0.00948514],
[-0.40153042, 2.93777143, 2.28436549, -3.33712964, 0.27895677],
[-0.11773682, 2.28436549, 2.34334495, -1.8758072 , 0.8700664 ],
[ 4.82061317, -3.33712964, -1.8758072 , 8.08801733, -1.40096259],
[-0.00948514, 0.27895677, 0.8700664 , -1.40096259, 5.84629622]])
#求逆
inv(mat)
array([[ 1.6344894 , -5.30599418, 3.59242482, -2.48157756,
-0.87347595],
[ -5.30599418, 20.96406817, -14.76492044, 8.96595671,
3.3369904 ],
[ 3.59242482, -14.76492044, 11.00957592, -6.09349382,
-2.38834458],
[ -2.48157756, 8.96595671, -6.09349382, 4.14309849,
1.46783835],
[ -0.87347595, 3.3369904 , -2.38834458, 1.46783835,
0.71759003]])
mat.dot(inv(mat))
array([[ 1.00000000e+00, -3.62276349e-15, 2.97708758e-15,
1.00365641e-15, -1.28563628e-15],
[ 1.16986541e-15, 1.00000000e+00, -1.04489259e-15,
-1.43234162e-15, 1.85031636e-16],
[-9.17892019e-16, 5.90211635e-15, 1.00000000e+00,
1.54913164e-15, 7.28854924e-16],
[ 1.67430797e-15, -6.24584441e-15, 2.02034005e-15,
1.00000000e+00, -1.38074650e-15],
[ 6.79495103e-16, -4.09884887e-15, 3.43563276e-15,
-1.61484598e-15, 1.00000000e+00]])
#计算QR分解
q,r = qr(mat)
r
array([[-8.41197519, 2.41336148, 1.31408829, -8.76534057, 0.8426876 ],
[ 0. , -4.40454207, -3.50602341, 5.0520601 , -1.60816204],
[ 0. , 0. , -0.98999659, -0.25669779, -3.68304671],
[ 0. , 0. , 0. , -1.68870316, 4.4795456 ],
[ 0. , 0. , 0. , 0. , 0.22210084]])
4.6 Pseudo-random number generation
Use normal to get a 4*4 normal distribution
samples = np.random.normal(size=(4,4))
samples
array([[-1.08982894, -0.38664288, 0.08795078, -0.58766288],
[-0.55362143, 0.53318817, -1.24544404, -0.28009587],
[-0.62227897, -0.96513278, 0.94540138, -0.1743617 ],
[-1.02020369, 0.44070475, 0.16880846, 1.32297271]])
Using numpy.random is an order of magnitude faster than the pure-python way of generating large samples
from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0,1) for _ in range(N)]
966 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compiler time: 0.15 s
%timeit np.random.normal(size=N)
29.7 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
np.random.seed(1234)#更改随机数种子
#为了避免全局状态,可以使用numpy.random.RandomState创建一个随机数生成器,使数据独立于其他的随机数状态
rng = np.random.RandomState(1234)
rng.randn(10)
array([ 0.47143516, -1.19097569, 1.43270697, -0.3126519 , -0.72058873,
0.88716294, 0.85958841, -0.6365235 , 0.01569637, -2.24268495])
4.7 Random walks
#1000步的随机漫步
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
step = 1 if random.randint(0,1) else -1
position += step
walk.append(position)
plt.plot(walk[:100])
#1000次随机投掷硬币的结果,每次结果为1或-1
nsteps = 1000
draws = np.random.randint(0,2,size=nsteps)
steps = np.where(draws>0,1,-1)
walk = steps.cumsum()
walk.min()
-9
walk.max()
60
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x2225fe86e20>]
#np.abs(walk)>=10表示连续在一个方向走了十步,argmax()可以返回布尔值数组中最大值的第一个位置(True就是最大值)
(np.abs(walk)>=10).argmax()
297
4.7.1 Simulating multiple random walks at once
#一次性跨行算出全部5000个随机步的累计和
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0,2,size=(nwalks,nsteps))#0/1
steps = np.where(draws>0,1,-1)
walks = steps.cumsum(1)
walks
array([[ 1, 2, 3, ..., 46, 47, 46],
[ 1, 0, 1, ..., 40, 41, 42],
[ 1, 2, 3, ..., -26, -27, -28],
...,
[ 1, 0, 1, ..., 64, 65, 66],
[ 1, 2, 1, ..., 2, 1, 0],
[ -1, -2, -3, ..., 32, 33, 34]], dtype=int32)
plt.plot(walk[:100])
[<matplotlib.lines.Line2D at 0x2225fe82fd0>]
walks.max()
122
walks.min()
-128
#计算30的最小穿越时间
#使用any方法检查
hits30 = (np.abs(walks)>30).any(1)
hits30
array([ True, True, True, ..., True, False, True])
hits30.sum()#达到30的数字
3210
#选出绝对值步数超过30的步所在的行,并使用argmax从轴向1上获取穿越时间
crossing_times = (np.abs(walks[hits30])>=30).argmax(1)
crossing_times.mean()
501.89283489096573