Data analysis - pandas

Brief introduction

Tools pandas Python is a powerful data analysis package, which is based Numpy built, appears because of pandas, so Python language has become one of the most widely used and powerful data analysis environment.

Pandas main functions:

  1. It includes a data structure DataFrame its function, Series
  2. Integration time series function
  3. And provides a wealth of math operations
  4. Flexibility to handle missing data
  •  installation

>: pip install pandas

  • Reference Methods:

import pandas as pd

 Series

Series objects is similar to a one-dimensional array, by a set of data and a set of data associated with the tag (index) Composition

 Create a way

  • Creating common

The value of the array index of an array and print out the index on the left, the right value, because no index for the specified data, then automatically creates a 0 to N-1 (N is the length of the data) of the integer index, the value of when can you get through the index

  •  Custom Index 0.1

index is an index list, which contains a string, you can still value index by default.

  •  Custom Index 0.2

  •  Other create

 Create an array of values ​​is zero

For the Series, in fact, we think it is a fixed length and ordered dictionary, because of its index and data is matched by position, as we will use the context of the dictionary, you will certainly use Series

 Missing Data processing

  • dropna () # line is filtered out of NaN
  • fillna () # fill in missing data
  • isnull () # returns a Boolean array, missing values ​​corresponding to True
  • notnull () # returns a Boolean array, missing values ​​corresponding to False

 Missing data values

  •  A treatment: dropna

The default value is NaN dropna the line filter, without modifying the original data, if the specified inplace = True, then modify the original data

  •  Second way process: fillna

 fillna may be modified to a digital 0 NaN (typically modified to 0), the original data is not modified, if the specified inplace = True, then modify the original data

  • Analyzing missing values: isnull, notull

 Series Characteristics

Because pandas are based, so support ndarray Series features built Numpy:

  • Creating Series from ndarray: Series (arr)
  • Scalar (number): sr * 2
  • 两个Series运算
  • 通用函数:np.ads(sr)
  • 布尔值过滤:sr[sr>0]
  • 统计函数:mean()、sum()、cumsum()

支持字典的特性:

  • 从字典创建Series:Series(dic),
  • In运算:'a'in sr、for x in sr
  • 键索引:sr['a'],sr[['a','b','d']]
  • 键切片:sr['a':'c']
  • 其他函数:get('a',default=0)等

 索引取值

  • loc属性 # 以标签解释
  • iloc属性 # 以下标解释

 Series数据对齐

pandas在运算时,会按索引进行对齐然后计算。如果存在不同的索引,则结果的索引值是NaN。

 将两个Series对象相加时将缺失值设为0:

 将缺失值设为0,所以最后算出来b索引对应的结果为14

补充: 灵活的算术方法:add,sub,div,mul

DataFrame

DataFrame是一个表格型的数据结构,相当于是一个二维数组,含有一组有序的列。他可以被看做是由Series组成的字典,并且共用一个索引。

 创建方式

  • 方式一

 产生的DataFrame会自动为Series分配所索引,并且列会按照排序的顺序排列

  •  方式二:

 自定义行索引,源于Series的自定义索引

 查看数据

常用属性和方法:

  • index 获取行索引
  • columns 获取列索引
  • T 转置
  • columns 获取列索引
  • values 获取值索引
  • describe 获取快速统计

 索引和切片

  • DataFrame有行索引和列索引。
  • DataFrame同样可以通过标签和位置两种方法进行索引和切片。

DataFrame使用索引切片:

  • 方法1:两个中括号,先取列再取行。 df['A'][0]
  • 方法2(推荐):使用loc/iloc属性,一个中括号,逗号隔开,先取行再取列。
    • loc属性:解释为标签
    • iloc属性:解释为下标
  • 向DataFrame对象中写入值时只使用方法2
  • 行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配。(注意:两部分都是花式索引时结果可能与预料的不同)

 常见的获取数据方式

  • read_文件后缀  读取数据

  •  head 读取指定行数

  •  to_文件后缀  保存数据

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/waller/p/11978921.html