Data analysis --- numpy basics (3)

In the last two articles, we introduced some basic usage of numpy function and the usage of its extension function. Here is an introduction to the numpy library to read and write files.

One, use numpy to read files

1. numpy to store, store and read csv files

CSV (with a comma as a separator) is a common file format used to store batch data

storage:

# 文件存储
np.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', 
          header='', footer='', comments='# ', encoding=None)

fname: file, string, can be a compressed file of .gz or .bz2
X: Array stored in the file
fmt: The format of the written file, for example: %d %.2f %.18e
delimiter: the string to split the column, the default is any space
newline: the string that splits the line
header: file header

Read:

# 文件读取
np.loadtxt(fname,  delimiter=None, skiprows=0,
           usecols=None)

fname: the name of the file to be read
delimiter: the string to split the column, the default is any space
skiprows: skip the first row, the default is 0, usually skip the file header
usecols: the desired column

Example 1. Storage:

# 存储
import numpy as np
a = np.arange(50).reshape(5, 10)
# 保存为.txt文件
file = np.savetxt('./test/a.csv', a, fmt = '%d',delimiter=',')

The saved files are as follows:

Example 2, read:

# 文件读取
np_file = np.loadtxt('./test/a.csv', delimiter=',')
print(np_file)
# 只取第一列和第五列数据
np_file1 = np.loadtxt('./test/a.csv',usecols=(0, 4), delimiter=',')
print(np_file1)

"""
np_file: [[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
           [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
           [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
           [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]
           [40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]]
第1列和第五列数据 [[ 0.  4.]
                 [10. 14.]
                 [20. 24.]
                 [30. 34.]
                 [40. 44.]]
"""

Note: csv can only effectively store one-dimensional and two-dimensional arrays, and np.savetxt() and np.loadtxt() can only effectively store one-dimensional and two-dimensional arrays.

2. numpy performs multi-dimensional data access:

storage:

a.tofile(fid, sep="", format="%s")

fid: file, string
sep: data segmentation string, if it is an empty string, write to the file as binary
format: the format of the written data

Read:


fromfile(file, dtype=float, count=-1, sep='')

file: file, string
dtype: the type of data read
count: the number of elements read, -1 means read the entire file
sep: data segmentation string, if it is an empty string, write the file as binary

storage:

# 多维数组的存储
b = np.arange(50).reshape(5, 5, 2)
b.tofile("./test/b.bat", sep=",", format="%d")

Read:

# 多维数组的读取
np.fromfile('./test/b.bat', dtype=np.int, sep=',')
"""
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
"""
np.fromfile('./test/b.bat', dtype=np.int, sep=',').reshape(5, 5,2)
"""
array([[[ 0,  1], [ 2,  3], [ 4,  5], [ 6,  7], [ 8,  9]],
        [[10, 11], [12, 13], [14, 15], [16, 17], [18, 19]],
        [[20, 21], [22, 23], [24, 25], [26, 27], [28, 29]],
        [[30, 31], [32, 33], [34, 35], [36, 37], [38, 39]],
        [[40, 41], [42, 43], [44, 45], [46, 47], [48, 49]]])
"""

Note : This method needs to know the dimension and element type of the array when it is stored in the file when reading, and b.tofile() and np.fromfile() need to be used together to store additional information through the metadata file.

3. Convenient file access in numpy

np.save(file, arr)   np.savez(file, arr)

file: file name, with .npy as the extension, and the compressed extension is .npz
arr: array variable

load() automatically recognizes npz files and returns an object similar to a dictionary. The contents of the array can be obtained by using the array name as the key.

np.load(file)

file: file name, with .npy as the extension, and the compressed extension is .npz

a = np.arange(50).reshape(5,5,2)
np.save("a.npy", a)
b = np.load('a.npy')
print(b)

To store data in this way, it is convenient to save the training set, validation set, test set, and their labels in deep learning. When stored in this way, what you need to load and the number of files is greatly reduced. The file name will be changed everywhere. It is a better way to store data.

Wonderful recommendation

Python image recognition-image similarity calculation

Install GPU version of TensorFlow (cuda + cudnn) under win10

TensorFlow-GPU linear regression visualization code, and summary of the problem

Classification of all crawler articles

Selenium-based automated sliding verification code cracking

Crawl 58job, Ganji job and Zhaopin recruitment, and use data analysis to generate echarts graph

Data analysis --- numpy basics (3)

Guess you like