Table of contents
1. Basic introduction to numpy
2. The data structure of numpy
(1) Commonly used creation functions
(2) Common conversion functions
(5) Multidimensional data access
2. Basic introduction of pandas
2. The data structure of pandas
(2) Addition, deletion, modification and query of Dataframe
3. Demonstration of the use of numpy and pandas
(3) Change index, use loc, iloc index
1. Basic introduction to numpy
1. What is numpy
NumPy (Numerical Python) is an open source numerical computing extension of Python. This tool can be used to store and process large matrices, which is much more efficient than Python's own nested list structure (this structure can also be used to represent matrices), and supports a large number of dimensional arrays and matrix operations , and also provides a large library of mathematical functions for array operations.
2. The data structure of numpy
The data structure of numpy is ndarray, which is an N-dimensional array type that describes a collection of "items" of the same type. At the same time, the data types in ndarray are the same, so the data types it contains are the same, so numpy The speed of computing data is very fast.
3. numpy data type
The data types supported by numpy are much more than the built-in types of Python, which can basically correspond to the data types of the C language, and some of the types correspond to the built-in types of Python. The following table lists common NumPy primitive types.
name | describe |
bool_ | Boolean data type (True or False) |
int_ | default integer type |
intp | Integer type used for indexing |
... | ... |
4. Properties of numpy arrays
In NumPy, each linear array is called an axis (axis), that is, dimensions (dimensions). For example, a two-dimensional array is equivalent to two one-dimensional arrays, where each element in the first one-dimensional array is another one-dimensional array. So a one-dimensional array is the axis in NumPy, the first axis is equivalent to the underlying array, and the second axis is the array in the underlying array. And the number of axes, the rank, is the number of dimensions of the array.
Attributes | illustrate |
ndarray.ndim() | rank, i.e. number of axes or number of dimensions |
ndarray.shape() | The dimension of the array, for a matrix, n rows and m columns |
ndarray.size() | The total number of array elements, equivalent to the value of n*m in .shape |
ndarray.dtype() | the element type of the ndarray object |
ndarray.itemsize() | the size of each element in the ndarray object, in bytes |
ndarray.flags() | Memory information of ndarray object |
ndarray.real() | The real part of the ndarray element |
ndarray.imag() | Imaginary part of ndarray elements |
ndarray.data() | The buffer containing the actual array elements, since the elements are generally obtained through the index of the array, so this attribute is usually not needed. |
5. Common functions of numpy
(1) Commonly used creation functions
Function name | effect |
np.ndarray() | create array |
np.arange(n) | ndarray type with elements from 0 to n-1 |
np.ones(shape) | Generate an array of all 1s according to shape, shape is a tuple type |
np.zeros((shape) | Generate an array of all 0s according to shape, shape is a tuple type |
np.full(shape, val) | Generate an array according to shape, each element value is val |
np.eye(n) | Create a square n*n identity matrix with 1s on the diagonal and 0s on the rest |
np.ones_like(a) | Generate an array of all 1s according to the shape of the array array_03 |
np.zeros_like(a) | Generate an array of all 0s according to the shape of the array a |
np.full_like(array_03,99) | Generate an array according to the shape of the array a, each element value is val |
np.linspace(1,10,10) | Fill data at equal intervals according to the start and end data to form an array |
np.concatenate((a, b), axis=0) | Vertical join # Merge two or more arrays into a new array |
Note: shape is a shape, that is, assuming that shape is (2,3), the matrix generated by him is the shape of 2 rows and 3 columns, and the following shapes have the same meaning.
(2) Common conversion functions
①Array dimension conversion
function | effect |
np.reshape(shape) | Returns an array of shape without changing the array elements, and the original array a remains unchanged9 |
np.resize(shape) | Change the shape of the array and modify the original array |
np.swapaxes(ax1, ax2) | Swap the two dimensions |
np.flatten() | Reduce the dimension of the array and return the folded one-dimensional array, the original array remains unchanged |
②Array type conversion
function | effect |
b.astype(np.int16) | The astype() method will definitely create a new array (a copy of the original data), even if the two types are the same |
b.tolist() | Convert to list type |
Note: b is the variable name, not the defined numpy name. For convenience, everyone will use numpy as as np
(3) Array operations
function | effect |
np.abs(a) | Take the absolute value of each element |
np.sqrt(a) | Calculate the square root of each element |
np.square(a) | Calculate the square of each element |
np.log(a) np.log10(a) np.log2(a) | Calculate the natural logarithm, 10, and base 2 logarithms of each element |
np.ceil(a) np.floor(a) | Calculate the ceiling value and floor value of each element (ceiling is rounded up, floor is rounded down) |
np.rint(a) | Each element is rounded |
np.modf(a) | Returns the fractional and integer parts of each element of an array as two separate arrays |
np.exp(a) | Calculate the index value of each element |
np.sign(a) | Calculate the sign value 1(+), 0, -1(-) of each element |
np.maximum(a, b) np.fmax() | compare (or compute) element-wise maximum |
np.minimum(a, b) np.fmin() | 取最小值 |
np.mod(a, b) | 元素级的模运算 |
np.copysign(a, b) | 将b中各元素的符号赋值给数组a的对应元素 |
(4)CSV文件存取
np.savetxt(frame, array, fmt=’% .18e’, delimiter = None): frame是文件、字符串等,可以是.gz .bz2的压缩文件; array 表示存入的数组; fmt 表示元素的格式
np.loadtxt(frame, dtype=np.float, delimiter = None, unpack = False) : frame是文件、字符串等,可以是.gz .bz2的压缩文件;dtype:数据类型,读取的数据以此类型存储;delimiter: 分割字符串,默认是空格; unpack: 如果为True,读入属性将分别写入不同变量。
(5)多维数据存取
函数 | |
a.tofile(frame, sep=’’, format=’%s’ ) | frame: 文件、字符串; sep: 数据分割字符串,如果是空串,写入文件为二进制 ; format:: 写入数据的格式 |
np.fromfile(frame, dtype = float, count=-1, sep=’’) | frame: 文件、字符串 ; dtype: 读取的数据以此类型存储; count:读入元素个数, -1表示读入整个文件; sep: 数据分割字符串,如果是空串,写入文件为二进制 |
np.save(frame, array) | frame: 文件名,以.npy为扩展名,压缩扩展名为.npz ; array为数组变量 |
np.load(fname) | frame: 文件名,以.npy为扩展名,压缩扩展名 |
注:
a.tofile() 和np.fromfile()要配合使用,要知道数据的类型和维度
np.save() 和np.load() 使用时,不用自己考虑数据类型和维度
(6)随机数函数
函数 | 作用 |
np.random.rand(d0, d1, …,dn) | 各元素是[0, 1)的浮点数,服从均匀分布 |
np.random.randn(d0, d1, …,dn) | 标准正态分布 |
np.random.randint(low,high,(shape)) | 依shape创建随机整数或整数数组,范围是[ low, high) |
np.random.seed(s) | 随机数种子 |
np.random.shuffle(a) | 根据数组a的第一轴进行随机排列,改变数组a |
np.random.permutation(a) | 根据数组a的第一轴进行随机排列, 但是不改变原数组,将生成新数组 |
np.random.choice(a[, size, replace, p]) | 从一维数组a中以概率p抽取元素, 形成size形状新数组,replace表示是否可以重用元素,默认为False。 |
(7)梯度函数
np.gradient(a) : 计算数组a中元素的梯度,f为多维时,返回每个维度的梯度
离散梯度: xy坐标轴连续三个x轴坐标对应的y轴值:a, b, c 其中b的梯度是(c-a)/2
而c的梯度是: (c-b)/1
当为二维数组时,np.gradient(a) 得出两个数组,第一个数组对应最外层维度的梯度,第二个数组对应第二层维度的梯度。
(8)统计函数
函数 | 作用 |
sum(a, axis = None) | 依给定轴axis计算数组a相关元素之和,axis为整数或者元组 |
mean(a, axis = None) | 同理,计算平均值 |
average(a, axis =None, weights=None) | 依给定轴axis计算数组a相关元素的加权平均值 |
std(a, axis = None) | 同理,计算标准差 |
var(a, axis = None) | 计算方差 |
min(a) max(a) | 计算数组a的最小值和最大值 |
argmin(a) argmax(a) | 计算数组a的最小、最大值的下标(注:是一维的下标) |
unravel_index(index, shape) | 根据shape将一维下标index转成多维下标 |
ptp(a) | 计算数组a最大值和最小值的差 |
median(a) | 计算数组a中元素的中位数(中值) |
二、pandas基本介绍
1、什么是pandas
pandas 是基于NumPy 的一种工具,该工具是为解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。我们在使用的过程中发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。
2、pandas的数据结构
pandas 中主要有两种数据结构:Series 和 DataFrame。
①Series:一种一维的数组型对象,它包含了一个值序列(与 NumPy 中的类型相似),并且包含了数据标签,称为索引(index)。最简单的序列可以仅仅由一个数组组成。注意:Series 中的索引值是可以重复的。
②DataFrame:表示矩阵的数据表,它包含已排序的列集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame 既有行索引也有列索引,它可以被视为一个共享相同索引的 Series 的字典。在 DataFrame 中,数据被存储为一个以上的二维块,而不是列表、字典或其他 一维数组的集合。
③Time- Series:以时间为索引的Series
④Panel:三维的数组,可以理解为DataFrame的容器
⑤Panel4D:是像Panel一样的4维数据容器
⑥PanelND:拥有factory集合,可以创建像Panel4D一样N维命名容器的模块
3、pandas常用操作
(1)数据读取与写入
(2)Dataframe的增删改查
方法 | 作用 |
df.values | 查看所有元素 |
df.index | 查看索引 |
df.columns | 查看所有列名 |
df.dtype | 查看字段类型 |
df.size | 元素总数 |
df.ndim | 表的维度数 |
df.shape | 返回表的行数与列数 |
df.info | DataFrame的详细内容 |
df.T | 表转置 |
(3)查看Dataframe
①基本查看方式
方式 | 作用 |
df['col1'] | 单列数据 |
df['col1'][2:7] | 单列多行 |
df[['col1','col2']][2:7] | 多列多行 |
df[:][2:7] | 多行数据 |
df.head() | 前几行 |
df.tail() | 后几行 |
②loc,iloc的查看方式(大多数时候建议用loc)
# loc[行索引名称或条件,列索引名称]
# iloc[行索引位置,列索引位置]
类型 | 方法 |
单列切片 | df.loc[:,'col1'] |
df.iloc[:,3] | |
多列切片 | df.loc[:,['col1','col2']] |
df.iloc[:,[1,3]] | |
花式切片 | df.loc[2:5,['col1','col2']] |
df.iloc[2:5,[1,3]] | |
条件切片 | df.loc[df['col1']=='245',['col1','col2']] |
df.iloc[(df['col1']=='245').values,[1,5]] |
③数据更改
方式 | 作用 |
df.loc[df['col1']=='258','col1']=214 | 更改某个字段的数据,不可逆 |
df['col2'] = 计算公式/常量 | 增加一列数据 |
df.drop(labels=rang(1,11),axis=0,inplace=True) | 删除某几行数据,inplace为True时在源数据上删除,False时需要新增数据集 |
df.drop(labels['col1','col2'],axis=1,inplace=True) | 删除某几列数据 |
④处理时间序列
方法 | 作用 |
df['time'] = pd.to_datetime(df['time']) | 转换字符串时间为标准时间 |
year = df['time'].year() | 提取时间序列信息 |
df['time'] = df['time'] + pd.Timedelta(days=1) | 时间加减法,使用Timedelta,支持weeks,days,hours,minutes,seconds,但不支持月和年 |
df['time'] = df['time'] - pd.to_datetime('2016-1-1') | |
df['time'].max() - df['time'].min() | 时间跨度计算 |
三、numpy、pandas的使用演示
本次演示主要针对索引切片的演示,计算函数在上文中都有含有解释,我们在实际的数据预处理中就是从脏数据筛选出自己想要的数据,在生成清洗过后的数据文件,
1、数据介绍
import pandas as pd
import numpy as np
df = pd.read_csv("student.csv")
数据文件名:student.csv
数据样式,通过pandas的pd.read_csv()方法数据读入(图中电话随意生成有,如有影响联系删除):
2、数据提取
(1)列的提取
单列提取
单列索引根据columns列名进行索引,索引所得是index+被索引列所有值。
多列提取
值得注意的是多列索引时需要加入两个中括号,而不是['学号','姓名'],同时索引的列名不用按照顺序填写,可以根据自己的需求和顺序填写,索引所得结果是根据填写的索引列顺序一样。
(2)列多行索引
前面未指定列名时将显示全部信息,数据行数根据后面的索引所定,后面索引方式遵循的是Python的索引规则,左包括而右不包括。
(3)更改index,使用loc,iloc索引
index的更改使永久性的更改,所以在做之前可以通过复制生成新的数据对象,以免“污染”原数据,同时对于里面的数据更改也是一样,是永久性的更改,所以当我们需要按照不同的要求增删改查的时候,进行文本复制之后再做更改使最佳的选择。
dataframe的切片(先行后列)
df.loc
df.loc[首行index值:末行index值, 首列列名:末列列名]
df.iloc
df.iloc[起始行位置:结束行位置, 起始列位置:结束列位置]
(4)特定值提取
特定值提取通过语法如自变量名[自变量名['列名']==‘筛选条件’],切记若写成自变量名['列名']==‘筛选条件’形式就会生成如下图二的形式,相当于系统通过查找比对返回的是判断结果值。