Introduction to pandas and numpy for Python preprocessing and the use of common functions

Table of contents

1. Basic introduction to numpy

1. What is numpy

2. The data structure of numpy

3. numpy data type

4. Properties of numpy arrays

5. Common functions of numpy

(1) Commonly used creation functions

(2) Common conversion functions

(3) Array operations

(4) CSV file access

(5) Multidimensional data access

(6) Random number function

(7) Gradient function

(8) Statistics function

2. Basic introduction of pandas

1. What is pandas

2. The data structure of pandas

3. Pandas common operations

(1) Data reading and writing

(2) Addition, deletion, modification and query of Dataframe

(3) View Dataframe

3. Demonstration of the use of numpy and pandas

1. Data introduction

2. Data extraction

(1) Column extraction

(2) Column multi-row index

 (3) Change index, use loc, iloc index

(4) Specific value extraction


1. Basic introduction to numpy

1. What is numpy

NumPy (Numerical Python) is an open source numerical computing extension of Python. This tool can be used to store and process large matrices, which is much more efficient than Python's own nested list structure (this structure can also be used to represent matrices), and supports a large number of dimensional arrays and matrix operations , and also provides a large library of mathematical functions for array operations.

2. The data structure of numpy

The data structure of numpy is ndarray, which is an N-dimensional array type that describes a collection of "items" of the same type. At the same time, the data types in ndarray are the same, so the data types it contains are the same, so numpy The speed of computing data is very fast.

3. numpy data type

The data types supported by numpy are much more than the built-in types of Python, which can basically correspond to the data types of the C language, and some of the types correspond to the built-in types of Python. The following table lists common NumPy primitive types.

numpy basic data types
name describe
bool_ Boolean data type (True or False)
int_ default integer type
intp Integer type used for indexing
... ...

4. Properties of numpy arrays

In NumPy, each linear array is called an axis (axis), that is, dimensions (dimensions). For example, a two-dimensional array is equivalent to two one-dimensional arrays, where each element in the first one-dimensional array is another one-dimensional array. So a one-dimensional array is the axis in NumPy, the first axis is equivalent to the underlying array, and the second axis is the array in the underlying array. And the number of axes, the rank, is the number of dimensions of the array.

Important properties of ndarray object
Attributes illustrate
ndarray.ndim() rank, i.e. number of axes or number of dimensions
ndarray.shape() The dimension of the array, for a matrix, n rows and m columns
ndarray.size() The total number of array elements, equivalent to the value of n*m in .shape
ndarray.dtype() the element type of the ndarray object
ndarray.itemsize() the size of each element in the ndarray object, in bytes
ndarray.flags() Memory information of ndarray object
ndarray.real() The real part of the ndarray element
ndarray.imag() Imaginary part of ndarray elements
ndarray.data() The buffer containing the actual array elements, since the elements are generally obtained through the index of the array, so this attribute is usually not needed.

5. Common functions of numpy

(1) Commonly used creation functions

Commonly used creation functions
Function name effect
np.ndarray() create array
np.arange(n) ndarray type with elements from 0 to n-1
np.ones(shape) Generate an array of all 1s according to shape, shape is a tuple type
np.zeros((shape) Generate an array of all 0s according to shape, shape is a tuple type
np.full(shape, val) Generate an array according to shape, each element value is val
np.eye(n) Create a square n*n identity matrix with 1s on the diagonal and 0s on the rest
np.ones_like(a) Generate an array of all 1s according to the shape of the array array_03
np.zeros_like(a) Generate an array of all 0s according to the shape of the array a
np.full_like(array_03,99) Generate an array according to the shape of the array a, each element value is val
np.linspace(1,10,10) Fill data at equal intervals according to the start and end data to form an array
np.concatenate((a, b), axis=0) Vertical join # Merge two or more arrays into a new array

Note: shape is a shape, that is, assuming that shape is (2,3), the matrix generated by him is the shape of 2 rows and 3 columns, and the following shapes have the same meaning.

(2) Common conversion functions

①Array dimension conversion

Array dimension conversion
function effect
np.reshape(shape) Returns an array of shape without changing the array elements, and the original array a remains unchanged9
np.resize(shape) Change the shape of the array and modify the original array
np.swapaxes(ax1, ax2) Swap the two dimensions
np.flatten() Reduce the dimension of the array and return the folded one-dimensional array, the original array remains unchanged

②Array type conversion

Array type conversion
function effect
b.astype(np.int16) The astype() method will definitely create a new array (a copy of the original data), even if the two types are the same
b.tolist() Convert to list type

Note: b is the variable name, not the defined numpy name. For convenience, everyone will use numpy as as np

(3) Array operations

prime group operation
function effect
np.abs(a) Take the absolute value of each element
np.sqrt(a) Calculate the square root of each element
np.square(a) Calculate the square of each element
np.log(a) np.log10(a) np.log2(a) Calculate the natural logarithm, 10, and base 2 logarithms of each element
np.ceil(a) np.floor(a) Calculate the ceiling value and floor value of each element (ceiling is rounded up, floor is rounded down)
np.rint(a) Each element is rounded
np.modf(a) Returns the fractional and integer parts of each element of an array as two separate arrays
np.exp(a) Calculate the index value of each element
np.sign(a) Calculate the sign value 1(+), 0, -1(-) of each element
np.maximum(a, b) np.fmax() compare (or compute) element-wise maximum
np.minimum(a, b) np.fmin() 取最小值
np.mod(a, b) 元素级的模运算
np.copysign(a, b) 将b中各元素的符号赋值给数组a的对应元素

(4)CSV文件存取

np.savetxt(frame, array, fmt=’% .18e’, delimiter = None): frame是文件、字符串等,可以是.gz .bz2的压缩文件; array 表示存入的数组; fmt 表示元素的格式

np.loadtxt(frame, dtype=np.float, delimiter = None, unpack = False) : frame是文件、字符串等,可以是.gz .bz2的压缩文件;dtype:数据类型,读取的数据以此类型存储;delimiter: 分割字符串,默认是空格; unpack: 如果为True,读入属性将分别写入不同变量。

(5)多维数据存取

多维数据存取
函数
a.tofile(frame, sep=’’, format=’%s’ ) frame: 文件、字符串; sep: 数据分割字符串,如果是空串,写入文件为二进制 ; format:: 写入数据的格式
np.fromfile(frame, dtype = float, count=-1, sep=’’) frame: 文件、字符串 ; dtype: 读取的数据以此类型存储; count:读入元素个数, -1表示读入整个文件; sep: 数据分割字符串,如果是空串,写入文件为二进制
np.save(frame, array) frame: 文件名,以.npy为扩展名,压缩扩展名为.npz ; array为数组变量
np.load(fname) frame: 文件名,以.npy为扩展名,压缩扩展名

注:

a.tofile() 和np.fromfile()要配合使用,要知道数据的类型和维度

np.save() 和np.load() 使用时,不用自己考虑数据类型和维度

(6)随机数函数

numpy随机数函数
函数 作用
np.random.rand(d0, d1, …,dn) 各元素是[0, 1)的浮点数,服从均匀分布
np.random.randn(d0, d1, …,dn) 标准正态分布
np.random.randint(low,high,(shape)) 依shape创建随机整数或整数数组,范围是[ low, high)
np.random.seed(s) 随机数种子
np.random.shuffle(a) 根据数组a的第一轴进行随机排列,改变数组a
np.random.permutation(a) 根据数组a的第一轴进行随机排列, 但是不改变原数组,将生成新数组
np.random.choice(a[, size, replace, p]) 从一维数组a中以概率p抽取元素, 形成size形状新数组,replace表示是否可以重用元素,默认为False。

(7)梯度函数

np.gradient(a) : 计算数组a中元素的梯度,f为多维时,返回每个维度的梯度
离散梯度: xy坐标轴连续三个x轴坐标对应的y轴值:a, b, c 其中b的梯度是(c-a)/2
而c的梯度是: (c-b)/1
当为二维数组时,np.gradient(a) 得出两个数组,第一个数组对应最外层维度的梯度,第二个数组对应第二层维度的梯度。

(8)统计函数

统计函数
函数 作用
sum(a, axis = None) 依给定轴axis计算数组a相关元素之和,axis为整数或者元组
mean(a, axis = None) 同理,计算平均值
average(a, axis =None, weights=None) 依给定轴axis计算数组a相关元素的加权平均值
std(a, axis = None) 同理,计算标准差
var(a, axis = None) 计算方差
min(a) max(a) 计算数组a的最小值和最大值
argmin(a) argmax(a) 计算数组a的最小、最大值的下标(注:是一维的下标)
unravel_index(index, shape) 根据shape将一维下标index转成多维下标
ptp(a) 计算数组a最大值和最小值的差
median(a) 计算数组a中元素的中位数(中值)

二、pandas基本介绍

1、什么是pandas

pandas 是基于NumPy 的一种工具,该工具是为解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。我们在使用的过程中发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。

2、pandas的数据结构

pandas 中主要有两种数据结构:Series 和 DataFrame。
        ①Series:一种一维的数组型对象,它包含了一个值序列(与 NumPy 中的类型相似),并且包含了数据标签,称为索引(index)。最简单的序列可以仅仅由一个数组组成。注意:Series 中的索引值是可以重复的。
        ②DataFrame:表示矩阵的数据表,它包含已排序的列集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame 既有行索引也有列索引,它可以被视为一个共享相同索引的 Series 的字典。在 DataFrame 中,数据被存储为一个以上的二维块,而不是列表、字典或其他 一维数组的集合。

        ③Time- Series:以时间为索引的Series

        ④Panel:三维的数组,可以理解为DataFrame的容器

        ⑤Panel4D:是像Panel一样的4维数据容器

        ⑥PanelND:拥有factory集合,可以创建像Panel4D一样N维命名容器的模块

3、pandas常用操作

(1)数据读取与写入

(2)Dataframe的增删改查

Dataframe的增删改查
方法 作用
df.values 查看所有元素
df.index 查看索引
df.columns 查看所有列名
df.dtype 查看字段类型
df.size 元素总数
df.ndim 表的维度数
df.shape 返回表的行数与列数
df.info DataFrame的详细内容
df.T 表转置

(3)查看Dataframe

①基本查看方式

查看方式
方式 作用
df['col1'] 单列数据
df['col1'][2:7] 单列多行
df[['col1','col2']][2:7] 多列多行
df[:][2:7] 多行数据
df.head() 前几行
df.tail() 后几行

②loc,iloc的查看方式(大多数时候建议用loc)

# loc[行索引名称或条件,列索引名称]
# iloc[行索引位置,列索引位置]

切片方式
类型

方法

单列切片 df.loc[:,'col1']
df.iloc[:,3]
多列切片 df.loc[:,['col1','col2']]
df.iloc[:,[1,3]]
花式切片 df.loc[2:5,['col1','col2']]
df.iloc[2:5,[1,3]]
条件切片 df.loc[df['col1']=='245',['col1','col2']]
df.iloc[(df['col1']=='245').values,[1,5]]

③数据更改

数据更改操作
方式 作用
df.loc[df['col1']=='258','col1']=214 更改某个字段的数据,不可逆
df['col2'] = 计算公式/常量 增加一列数据
df.drop(labels=rang(1,11),axis=0,inplace=True) 删除某几行数据,inplace为True时在源数据上删除,False时需要新增数据集
df.drop(labels['col1','col2'],axis=1,inplace=True) 删除某几列数据

④处理时间序列

Dataframe时间处理
方法 作用
df['time'] = pd.to_datetime(df['time']) 转换字符串时间为标准时间
year = df['time'].year() 提取时间序列信息
df['time'] = df['time'] + pd.Timedelta(days=1) 时间加减法,使用Timedelta,支持weeks,days,hours,minutes,seconds,但不支持月和年
df['time'] = df['time'] - pd.to_datetime('2016-1-1')
df['time'].max() - df['time'].min() 时间跨度计算

三、numpy、pandas的使用演示

本次演示主要针对索引切片的演示,计算函数在上文中都有含有解释,我们在实际的数据预处理中就是从脏数据筛选出自己想要的数据,在生成清洗过后的数据文件,

1、数据介绍

import pandas as pd
import numpy as np
df = pd.read_csv("student.csv")

数据文件名:student.csv

数据样式,通过pandas的pd.read_csv()方法数据读入(图中电话随意生成有,如有影响联系删除):

2、数据提取

(1)列的提取

单列提取

单列索引根据columns列名进行索引,索引所得是index+被索引列所有值。

多列提取

值得注意的是多列索引时需要加入两个中括号,而不是['学号','姓名'],同时索引的列名不用按照顺序填写,可以根据自己的需求和顺序填写,索引所得结果是根据填写的索引列顺序一样。

(2)列多行索引

前面未指定列名时将显示全部信息,数据行数根据后面的索引所定,后面索引方式遵循的是Python的索引规则,左包括而右不包括。

 (3)更改index,使用loc,iloc索引

index的更改使永久性的更改,所以在做之前可以通过复制生成新的数据对象,以免“污染”原数据,同时对于里面的数据更改也是一样,是永久性的更改,所以当我们需要按照不同的要求增删改查的时候,进行文本复制之后再做更改使最佳的选择。

 dataframe的切片(先行后列)
df.loc
df.loc[首行index值:末行index值, 首列列名:末列列名]


df.iloc
df.iloc[起始行位置:结束行位置, 起始列位置:结束列位置]

(4)特定值提取

特定值提取通过语法如自变量名[自变量名['列名']==‘筛选条件’],切记若写成自变量名['列名']==‘筛选条件’形式就会生成如下图二的形式,相当于系统通过查找比对返回的是判断结果值。

 

Guess you like

Origin blog.csdn.net/Sheenky/article/details/124716291