数据分析之表示

数据存取与函数

数据的CSV文件存取

CSV (Comma‐Separated Value, 逗号分隔值)

CSV是一种常见的文件格式，用来存储批量数据。
np.savetxt(frame, array, fmt=’%.18e’, delimiter=None)
- frame：文件、字符串或产生器，可以是.gz或.bz2的压缩文件。
- array：存入文件的数组。
- fmt：写入文件的格式，例如：%d，%.2f，%.18e。
- delimiter：分割字符串，默认是任何空格。
```
In [1]: import numpy as np

In [2]: a = np.arange(100).reshape(5, 20)

In [3]: np.savetxt('a.csv', a, fmt = '%d', delimiter = ',')

In [4]: np.savetxt('a.csv', a, fmt = '%.1f', delimiter = ',')
```

np.loadtxt(frame, dtype=np.float, delimiter=None, unpack=False)

frame：文件、字符串或产生器，可以是.gz或.bz2的压缩文件。
dtype：数据类型，可选。
delimiter：分割字符串，默认是任何空格。
unpack：如果True，读入属性将分别写入不同变量。

In [5]: b = np.loadtxt('a.csv', delimiter = ',')

In [6]: b
Out[6]:
array([[  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
         11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,
         31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.],
       [ 40.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.,  49.,  50.,
         51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,  59.],
       [ 60.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,  69.,  70.,
         71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,  79.],
       [ 80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,  89.,  90.,
         91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,  99.]])

In [7]: b = np.loadtxt('a.csv', dtype = np.int, delimiter = ',')

In [8]: b
Out[8]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
        37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,
        57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
        77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
        97, 98, 99]])

CSV文件的局限性
- CSV只能有效存储一维和二维数组。
- np.savetxt() np.loadtxt()只能有效存取一维和二维数组。

多维数据的存取

a.tofile(frame, sep = ‘’, format = ‘%s’)
- frame：文件、字符串。
- sep：数据分割字符串，如果是空串，写入文件为二进制。
- format：写入数据的格式。

np.fromfile(frame, dtype = float, count = ‐1, sep = ‘’)

frame : 文件、字符串。
dtype : 读取的数据类型。
count : 读入元素个数， ‐1表示读入整个文件。

sep : 数据分割字符串，如果是空串，写入文件为二进制。

In [1]: import numpy as np

In [2]: a = np.arange(100).reshape((5, 10, 2))

In [3]: a.tofile("b.dat", sep = ',', format = '%d')

In [4]: c = np.fromfile("b.dat", dtype = np.int, sep = ',')

In [5]: c
Out[5]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [6]: d = np.arange(100).reshape((5, 10, 2))

In [7]: d.tofile("e.dat", format = '%d')

In [8]: f = np.fromfile("e.dat", dtype = np.int).reshape((5, 10, 2))

In [9]: f
Out[9]:
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]],

       [[20, 21],
        [22, 23],
        [24, 25],
        [26, 27],
        [28, 29],
        [30, 31],
        [32, 33],
        [34, 35],
        [36, 37],
        [38, 39]],

       [[40, 41],
        [42, 43],
        [44, 45],
        [46, 47],
        [48, 49],
        [50, 51],
        [52, 53],
        [54, 55],
        [56, 57],
        [58, 59]],

       [[60, 61],
        [62, 63],
        [64, 65],
        [66, 67],
        [68, 69],
        [70, 71],
        [72, 73],
        [74, 75],
        [76, 77],
        [78, 79]],

       [[80, 81],
        [82, 83],
        [84, 85],
        [86, 87],
        [88, 89],
        [90, 91],
        [92, 93],
        [94, 95],
        [96, 97],
        [98, 99]]])

该方法需要读取时知道存入文件时数组的维度和元素类型，a.tofile()和np.fromfile()需要配合使用，可以通过元数据文件来存储额外信息。

numpy的便捷文件存取

np.save(fname, array)或np.savez(fname, array)
- fname：文件名，以.npy为扩展名，压缩扩展名为.npz。
- array : 数组变量。

np.load(fname)

fname：文件名，以.npy为扩展名，压缩扩展名为.npz。

In [1]: import numpy as np

In [2]: a = np.arange(100).reshape((5, 10, 2))

In [3]: np.save("a.npy", a)

In [4]: b = np.load("a.npy")

In [5]: b
Out[5]:
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17],
        [18, 19]],

       [[20, 21],
        [22, 23],
        [24, 25],
        [26, 27],
        [28, 29],
        [30, 31],
        [32, 33],
        [34, 35],
        [36, 37],
        [38, 39]],

       [[40, 41],
        [42, 43],
        [44, 45],
        [46, 47],
        [48, 49],
        [50, 51],
        [52, 53],
        [54, 55],
        [56, 57],
        [58, 59]],

       [[60, 61],
        [62, 63],
        [64, 65],
        [66, 67],
        [68, 69],
        [70, 71],
        [72, 73],
        [74, 75],
        [76, 77],
        [78, 79]],

       [[80, 81],
        [82, 83],
        [84, 85],
        [86, 87],
        [88, 89],
        [90, 91],
        [92, 93],
        [94, 95],
        [96, 97],
        [98, 99]]])

numpy的随机数函数子库

NumPy的random子库
- np.random.*。
- np.random.rand()。
- np.random.randn()。
- np.random.randint()。

np.random的随机数函数

$rand(d_0, d_1, .., d_n)$
- 根据 $d_0‐d_n$ 创建随机数数组，浮点数，[0, 1)，均匀分布。
$randn(d_0, d_1, .., d_n)$
- 根据 $d_0‐d_n$ 创建随机数数组，标准正态分布。
$randint(low, high, shape)$
- 根据shape创建随机整数或整数数组，范围是[low, high)。

seed(s)

随机数种子，s是给定的种子值。

In [1]: import numpy as np

In [2]: a = np.random.rand(3, 4, 5)

In [3]: a
Out[3]:
array([[[ 0.83217382,  0.61118163,  0.30590278,  0.71116985,  0.17603919],
        [ 0.81172826,  0.48699483,  0.81460026,  0.623493  ,  0.55107553],
        [ 0.86620657,  0.90390364,  0.81578716,  0.40891691,  0.2326408 ],
        [ 0.68340764,  0.63524826,  0.83587159,  0.19651474,  0.40335556]],

       [[ 0.08825054,  0.7536702 ,  0.20601117,  0.74694086,  0.50562464],
        [ 0.40090569,  0.55872767,  0.8516932 ,  0.6771278 ,  0.17050715],
        [ 0.2302661 ,  0.40845969,  0.7862477 ,  0.58725055,  0.60185095],
        [ 0.65965904,  0.53130834,  0.07635645,  0.12635217,  0.29800591]],

       [[ 0.83050967,  0.91737354,  0.58201733,  0.34622195,  0.09050391],
        [ 0.35995164,  0.67825314,  0.4664572 ,  0.81583095,  0.45641495],
        [ 0.87039701,  0.19466886,  0.99467247,  0.90765088,  0.14004337],
        [ 0.61259295,  0.3306319 ,  0.92665368,  0.88046747,  0.91534592]]])

In [4]: sn = np.random.randn(3, 4, 5)

In [5]: sn
Out[5]:
array([[[ 1.19890639, -1.04116259,  0.86310116, -0.44739765,  1.69247186],
        [-0.00500273, -1.72542962, -1.85303537,  0.32333834,  0.14319963],
        [-0.34740334,  1.13620253, -0.80952734,  0.5312918 , -0.40971214],
        [-0.16907081,  0.64456188,  0.68710157,  0.65388462, -1.67621268]],

       [[ 0.03925995, -2.42998597,  0.51701801, -0.33539791, -0.0650079 ],
        [-0.64138021,  0.63553016,  1.32243291, -1.22898913,  1.11420398],
        [ 0.55977458,  1.31880826, -1.90351377,  0.17989029,  0.82177797],
        [ 0.439646  ,  0.70095378, -1.13840963, -0.75624264, -0.44955757]],

       [[ 0.91002957,  0.6574978 ,  0.17688196,  0.12094879,  2.6644552 ],
        [ 2.07175029,  1.67643254, -0.55679737,  0.40536107,  1.5346217 ],
        [-0.27773992, -1.19676653,  0.71463593,  0.66457126,  0.14682141],
        [ 0.72601705,  1.4557889 , -1.60036546,  0.87030802, -0.82342311]]])

In [6]: b = np.random.randint(100, 200, (3, 4))

In [7]: b
Out[7]:
array([[171, 145, 171, 120],
       [191, 140, 162, 148],
       [128, 187, 191, 162]])

In [8]: np.random.seed(10)

In [9]: np.random.randint(100, 200, (3, 4))
Out[9]:
array([[109, 115, 164, 128],
       [189, 193, 129, 108],
       [173, 100, 140, 136]])

In [10]: np.random.seed(10)

In [11]: np.random.randint(100, 200, (3, 4))
Out[11]:
array([[109, 115, 164, 128],
       [189, 193, 129, 108],
       [173, 100, 140, 136]])

$shuffle(a)$
- 根据数组a的第1轴进行随机排列，改变数组x。
$permutation(a)$
- 根据数组a的第1轴产生一个新的乱序数组，不改变数组x。

choice(a, size, replace, p)

从一维数组a中以概率p抽取元素，形成size形状新数组，replace表示是否可以重用元素，默认为False。


In [1]: import numpy as np

In [2]: a = np.random.randint(100, 200, (3, 4))

In [3]: a
Out[3]:
array([[141, 109, 136, 153],
       [115, 118, 192, 152],
       [169, 143, 169, 126]])

In [4]: np.random.shuffle(a)    # 数组a发生了改变

In [5]: a
Out[5]:
array([[115, 118, 192, 152],
       [169, 143, 169, 126],
       [141, 109, 136, 153]])

In [6]: np.random.shuffle(a)

In [7]: a
Out[7]:
array([[141, 109, 136, 153],
       [169, 143, 169, 126],
       [115, 118, 192, 152]])

In [8]: b = np.random.randint(100, 200, (3, 4))

In [9]: b
Out[9]:
array([[136, 197, 152, 165],
       [195, 154, 132, 155],
       [127, 121, 123, 161]])

In [10]: np.random.permutation(b)   # 数组b没有被改变
Out[10]:
array([[127, 121, 123, 161],
       [195, 154, 132, 155],
       [136, 197, 152, 165]])

In [11]: b
Out[11]:
array([[136, 197, 152, 165],
       [195, 154, 132, 155],
       [127, 121, 123, 161]])

In [12]: c = np.random.randint(100, 200, (8,))

In [13]: c
Out[13]: array([195, 112, 130, 169, 156, 180, 102, 130])

In [14]: np.random.choice(c, (3, 2))
Out[14]:
array([[102, 102],
       [180, 130],
       [130, 130]])

In [15]: np.random.choice(c, (3, 2), replace = False)
Out[15]:
array([[180, 112],
       [130, 130],
       [156, 169]])

In [16]: np.random.choice(c, (3, 2), p = c / np.sum(c))
Out[16]:
array([[130, 130],
       [130, 130],
       [130, 169]])

$uniform(low, high, size)$
- 产生具有均匀分布的数组，low起始值，high结束值，size形状。
$normal(loc, scale, size)$
- 产生具有正态分布的数组，loc均值，scale标准差，size形状。

poisson(lam, size)

产生具有泊松分布的数组，lam随机事件发生率，size形状。

In [1]: import numpy as np

In [2]: u = np.random.uniform(0, 10, (3, 4))

In [3]: u
Out[3]:
array([[ 5.77047246,  1.01256917,  1.95423089,  9.20612219],
       [ 2.74586998,  3.40639646,  8.98460191,  4.54858063],
       [ 7.79810175,  3.70017557,  0.29809813,  0.99915164]])

In [4]: n = np.random.normal(10, 5, (3, 4))

In [5]: n
Out[5]:
array([[  3.29396378,  23.49129387,  12.98561076,  18.09697008],
       [  5.62780735,  10.50709279,   7.99811177,   2.961821  ],
       [ 15.52654509,   7.95058117,  11.39385202,  12.41859651]])

numpy的统计函数

NumPy直接提供的统计类函数
- np.*。
- np.std()。
- np.var()。
- np.average()。

NumPy的统计函数

sum(a, axis = None)
- 根据给定轴axis计算数组a相关元素之和，axis整数或元组。
mean(a, axis = None)
- 根据给定轴axis计算数组a相关元素的期望，axis整数或元组。
average(a, axis = None, weights = None)
- 根据给定轴axis计算数组a相关元素的加权平均值。
std(a, axis = None)
- 根据给定轴axis计算数组a相关元素的标准差。
var(a, axis = None)
- 根据给定轴axis计算数组a相关元素的方差。

axis = None是统计函数的标配参数。


In [1]: import numpy as np

In [2]: a = np.arange(15).reshape(3, 5)

In [3]: a
Out[3]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [4]: np.sum(a)
Out[4]: 105

In [5]: np.mean(a, axis = 1)
Out[5]: array([  2.,   7.,  12.])

In [6]: np.mean(a, axis = 0)
Out[6]: array([ 5.,  6.,  7.,  8.,  9.])

In [7]: np.average(a, axis = 0, weights = [10, 5, 1])
Out[7]: array([ 2.1875,  3.1875,  4.1875,  5.1875,  6.1875])

In [8]: np.std(a)
Out[8]: 4.3204937989385739

In [9]: np.var(a)
Out[9]: 18.666666666666668

min(a)，max(a)
- 计算数组a中元素的最小值、最大值。
argmin(a)，argmax(a)
- 计算数组a中元素最小值、最大值的降一维后下标。
unravel_index(index, shape)
- 根据shape将一维下标index转换成多维下标。
ptp(a)
- 计算数组a中元素最大值与最小值的差。

median(a)

计算数组a中元素的中位数（中值）。

In [1]: import numpy as np

In [2]: b = np.arange(15, 0, -1).reshape(3, 5)

In [3]: b
Out[3]:
array([[15, 14, 13, 12, 11],
       [10,  9,  8,  7,  6],
       [ 5,  4,  3,  2,  1]])

In [4]: np.max(b)
Out[4]: 15

In [5]: np.argmax(b)    # 扁平化后的下标
Out[5]: 0

In [6]: np.unravel_index(np.argmax(b), b.shape)     # 重塑成多维下标
Out[6]: (0, 0)

In [7]: np.ptp(b)
Out[7]: 14

In [8]: np.median(b)
Out[8]: 8.0

numpy的梯度函数

np.gradient(f)

计算数组f中元素的梯度，当f为多维时，返回每个维度梯度。
梯度：连续值之间的变化率，即斜率。

XY坐标轴连续三个X坐标对应的Y轴值：a, b, c，其中b的梯度是：(c‐a)/2。

In [1]: import numpy as np

In [2]: a = np.random.randint(0, 20, (5))

In [3]: a
Out[3]: array([ 2,  4, 11, 12, 17])

In [4]: np.gradient(a)
Out[4]: array([ 2. ,  4.5,  4. ,  3. ,  5. ])

In [5]: b = np.random.randint(0, 20, (5))

In [6]: b
Out[6]: array([ 2,  7,  7, 18, 15])

In [7]: np.gradient(b)
Out[7]: array([ 5. ,  2.5,  5.5,  4. , -3. ])   # 两种情况的计算：存在两侧值比如2.5 = (7‐2)/2和只有一侧值比如-3 = (15‐18)/1

In [8]: c = np.random.randint(0, 50, (3, 5))

In [9]: c
Out[9]:
array([[41, 17, 33, 14, 20],
       [ 0, 17, 37, 19,  6],
       [20,  1, 11, 49, 20]])

In [10]: np.gradient(c)
Out[10]:
[array([[-41. ,   0. ,   4. ,   5. , -14. ],
        [-10.5,  -8. , -11. ,  17.5,   0. ],
        [ 20. , -16. , -26. ,  30. ,  14. ]]),  # 最外层维度的梯度
 array([[-24. ,  -4. ,  -1.5,  -6.5,   6. ],
        [ 17. ,  18.5,   1. , -15.5, -13. ],
        [-19. ,  -4.5,  24. ,   4.5, -29. ]])]  # 第二层维度的梯度

python进阶之数据分析与展示（三）