Use of pandas module for data analysis (1)

What is Pandas?

  • The name Pandas comes from panel data and Python data analysis.

  • Pandas is a powerful tool set for analyzing structured data. It is built on NumPy and provides advanced data structures and data manipulation tools. It is one of the important factors that make Python a powerful and efficient data analysis environment.

  • A powerful set of tools needed to analyze and manipulate large structured data sets

  • The foundation is NumPy, which provides high-performance matrix operations

  • Provides a large number of functions and methods to process data quickly and conveniently

  • Applied to data mining, data analysis

  • Provide data cleaning function

pandas official website: http://pandas.pydata.org

Common data types of pandas:

  • Series One-dimensional, labeled array
  • DataFrame two-dimensional, Series container

Series:

Series是一种类似于一维数组的 对象,由一组数据(各种NumPy数据类型)以及一组与之对应的索引
(数据标签)组成.

类似一维数组的对象
由数据和索引组成
  索引(index)在左,数据(values)在右
  索引是自动创建的

Series creation:

In [1]: import pandas as pd

In [2]: import string

In [3]: import numpy as np
                                        
In [4]: t = pd.Series(np.arange(10),index=list(string.ascii_uppercase[:10]))
                                    ——>指定索引创建
In [5]: t
Out[5]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [6]: type(t)
Out[6]: pandas.core.series.Series

---------------------------------


In [7]: a = {string.ascii_uppercase[i]: i for i in range(10)}
            ——>通过字典推导式创建一个字典

In [8]: a
Out[8]: 
{'A': 0,
 'B': 1,
 'C': 2,
 'D': 3,
 'E': 4,
 'F': 5,
 'G': 6,
 'H': 7,
 'I': 8,
 'J': 9}

In [9]: pd.Series(a)  ——>通过字典创建一个Series,索引对应字典中的键
Out[9]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [10]: pd.Series(a,index=list(string.ascii_uppercase[5:15]))
Out[10]: 
F    5.0             重新给字典指定其他的索引之后,如果能够对应上,就取其
G    6.0             值,如果不能,就为Nan
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64


为什么类型为float呢?
numpy中nan为float,pandas会 自动根据数据类更改series的dtype类型
如果要修改dtype类型,修改方法和numpy的方法一样
如: pd.Series(range(10)).astype(float)

Series slice and index:

In [15]: t
Out[15]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [16]: t[2::2]  # 切片操作
Out[16]: 
C    2
E    4
G    6
I    8
dtype: int64

In [17]: t[2]  # 根据索引取值
Out[17]: 2

In [18]: t[[2,3,6]]  # 取不连续的值
Out[18]: 
C    2
D    3
G    6
dtype: int64

In [19]: t[t>4]       # 布尔索引
Out[19]: 
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [20]: 

In [20]: t["F"]      # 根据键取值
Out[20]: 5

In [22]: t[["A","F"]]  # 取不连续的值
Out[22]: 
A    0
F    5
dtype: int64

切片:直接传入start end或者步长即可
索引: 一个的时候直接传入序号或者index,多个的时候传入序号或者index的列表

Index and value of Series:

In [25]: t.index
Out[25]: Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')

In [26]: t.values
Out[26]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]: type(t.values)
Out[27]: numpy.ndarray

Series对象本质上由两个数组构成,一个数组构成对象的键(index,索引),
一个数组构成对象的值(values), 键——>值
ndarray的很多方法都可以运用于series类型,比如argmax, clip
series具有where方法,但是结果和ndarray不同

如:
 In [30]: t.where(t>2)
 Out[30]: 
 A    NaN
 B    NaN
 C    NaN
 D    3.0
 E    4.0
 F    5.0
 G    6.0
 H    7.0
 I    8.0
 J    9.0
 dtype: float64

pandas reads external data:

我们的这组数据存在csv中,我们直接使用pd.read_csv即可

和我们想象的有些差别,我们以为他会是一个Series类型,但是他是一个DataFrame,
那么接下来我们就来了解这种数据类型

对于数据库比如mysql或者mongodb中数据我们如何使用呢?

读取mysql:   pd.read_sql(sql_sentence,connection)

读取mongodb:

client = MongoClient()
collection = client["douban"]["t1"]

data = list(collection.find())

t1 = data[0]

t1 = pd.Series(t1)

DataFrame:

In [31]: t = pd.DataFrame(np.arange(12).reshape(3,4))

In [32]: t
Out[32]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

DataFrame对象既有行索引,又有列索引
行索引,表明不同行,横向索引(竖着的索引),叫index,0轴,axis=0
列索引,表名不同列,纵向索引(横着的索引),叫columns,1轴,axis=1

In [35]: t = pd.DataFrame(np.arange(12).reshape(3,4),index=list("ABC"),columns=list("WXYZ"))

In [36]: t
Out[36]: 
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

向DataFrame中传入字典:

In [37]: data = {"name":["xiaowang","xiaoli"],"age":[13,14],"id":[1001,1002]}

In [38]: data
Out[38]: {'name': ['xiaowang', 'xiaoli'], 'age': [13, 14], 'id': [1001, 1002]}

In [39]: pd.DataFrame(data)
Out[39]: 
       name  age    id
0  xiaowang   13  1001
1    xiaoli   14  1002


这就意味着我们可以直接将mongodb中的数据传入到DataFrame中

列表里面套字典:

In [46]: data = [{"age":18,"id":1001,"name":"aa"},{"name":"bb","id":1002},
    ...  {"name":"cc","age":16}]

In [47]: data
Out[47]: 
[{'age': 18, 'id': 1001, 'name': 'aa'},
 {'name': 'bb', 'id': 1002},
 {'name': 'cc', 'age': 16}]

In [48]: pd.DataFrame(data)
Out[48]: 
    age      id name
0  18.0  1001.0   aa
1   NaN  1002.0   bb
2  16.0     NaN   cc

Basic properties of DataFrame:

df.shape  # 行数列数
df.dtypes  # 列数据类型
df.ndim  # 数据维度
df.index  # 行索引
df.columns  # 列索引
df.values  # 对象值,二维ndarray数组

DataFrame overall situation query:

df.head(3) # 显示头部几行,默认5行
df.tail(3) # 显示末尾几行,默认5行
df.info()  # 相关信息概览:行数,列数,列索引,列非空值个数,列类型,列类型,内存占用
df.describe()  # 快速综合统计结果:计数,均值,标准差,最大值,四分位数,最小值

Commonly used statistical description methods:

方法                          说明
count                  非NA值的数量
describe               针对Series或各DataFrame列计算汇总统计
min、max               计算最小值和最大值
argmin、argmax         计算能够获取到最小值和最大值的索引位置(整数)
idxmin、idxmax         计算能够获取到最小值和最大值的索引值
quantile               计算样本的分位数(0到1)
sum                    值的总和
mean                   值的平均数
median                 值的算术中位数(50%分位数)
mad                    根据平均值计算平均绝对离差
var                    样本值的方差
std                    样本值的标准差
skew                   样本值的偏度(三阶矩)
kurt                   样本值的峰度(四阶矩)
cumsum                 样本值的累计和
cummin、cummax         样本值的累计最大值和累计最小值
cumprod                样本值的累计积
diff                   计算一阶差分(对时间序列很有用)
pct_change             计算百分数变化

Sort:

# by指定排序的字段,ascending默认为True,升序.
df.sort_values(by="Count_AnimalName",ascending=False)

Take row or column:

df_sorted = df.sort_values(by="Count_AnimalName")


选择行   df_sorted[:100]


我们具体要选择某一列该怎么选择呢?   df["Count_AnimalName "]
我们要同时选择行和列改怎么办?       df[:100][" Count_AnimalName "]

pandas loc:

df.loc 通过标签索引行数据.


In [50]: t
Out[50]: 
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [51]: t.loc["A","W"]  # 取某个指定元素
Out[51]: 0

In [52]: t.loc["A",["W","Z"]]  # 取一行和不连续多列
Out[52]: 
W    0
Z    3
Name: A, dtype: int64

In [53]: type(t.loc[["A"],["W","Z"]])  # 类型为DataFrame
Out[53]: pandas.core.frame.DataFrame

In [54]: t.loc[["A","C"],["W","Z"]]   # 取不连续的多行和多列
Out[54]: 
   W   Z
A  0   3
C  8  11

In [55]: t.loc["A":,["W","Z"]]    # 取多行和多列("A"及以后的行)
Out[55]: 
   W   Z
A  0   3
B  4   7
C  8  11

In [56]: t.loc["A":"C",["W","Z"]]   # 取多行和多列,("A"到"C"行)
Out[56]: 
   W   Z
A  0   3
B  4   7
C  8  11


冒号在loc里面是闭合的,即会选择到冒号后面的数据

iloc of pandas:

df.iloc 通过位置获取行数据

In [57]: t
Out[57]: 
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [58]: t.iloc[1:3,[2,3]]
Out[58]: 
    Y   Z
B   6   7
C  10  11

In [59]: t.iloc[1:3,1:3]
Out[59]: 
   X   Y
B  5   6
C  9  10

In [60]: t.loc["A","Y"]=100  # 更改数据

In [61]: t
Out[61]: 
   W  X    Y   Z
A  0  1  100   3
B  4  5    6   7
C  8  9   10  11

In [62]: t.iloc[1:2,0:2] = np.nan  # 此处panads会自动转换类型

In [63]: t
Out[63]: 
     W    X    Y   Z
A  0.0  1.0  100   3
B  NaN  NaN    6   7
C  8.0  9.0   10  11

Boolean index of pandas:

在给定的数据集中:假如我们想找到所有的使用次数超过800的狗的名字,应该怎么选择?

数据来源:https://www.kaggle.com/new-york-city/nyc-dog-names/data


In [67]: df = pd.read_csv("./笔记/python/dogNames2.csv")

In [68]: df[800<df["Count_AnimalName"]]
Out[68]: 
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
3251        COCO               852
9140         MAX              1153
12368      ROCKY               823


回到之前狗的名字的问题上,假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字,
应该怎么选择?

In [71]: df[(df["Row_Labels"].str.len()>4) & (df["Count_AnimalName"]>700)]
Out[71]: 
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856 
8552       LUCKY               723
12368      ROCKY               823
------------------------------------
&表示且,|表示或

注意: 不同的条件直接需要用括号括起来

pandas str string method:

方法	                                 说明
cat	                     实现元素级的字符串连接操作,可指定分隔符
contains        	         返回表示各字符串是否含有指定模式的布尔型数组
count	                 模式的出现次数
endswith、startswith     相当于对各个元素执行x.endswith(pattern)或x.startswith(pattern)
findall	                计算各字符串的模式列表
get	                    获取各元素的第i个字符
join	                    根据指定的分隔符将Series中各元素的字符串连接起来
len	                    计算各字符串的长度
lower、upper             转换大小写.相当于对各个元素执行x.lower()或x.upper()
match	                根据指定的正则表达式对各个元素执行re.match
pad	                    在字符串的左边、右边或左右两边添加空白符
center      	            相当于pad(side='both')
repeat	                重复值.例如,s.str.repeat(3)相当于对各个字符串执行x*3
replace	                用指定字符串替换找到的模式
slice	                对Series中的各个字符串进行子串截取
split	                根据分隔符或正则表达式对字符串进行拆分
strip、rstrip.lstrip     去除空白符,包括换行符.相当于对各个元素执行x.strip(),x.rstrip(),x.lstrip()


例: df["Row_Labels"].str.len()

Processing of missing data:

我们的数据缺失通常有两种情况:
一种就是空,None等,在pandas是NaN(和np.nan一样)
另一种是我们让其为0

对于NaN的数据,在numpy中我们是如何处理的?
在pandas中我们处理起来非常容易

判断数据是否为NaN:pd.isnull(df),pd.notnull(df)

处理方式1:删除NaN所在的行列dropna (axis=0, how='any', inplace=False)
处理方式2:填充数据,t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)


处理为0的数据:t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况,nan是不参与计算的,但是0会

-------------------------------
In [76]: t2
Out[76]: 
    age      id name
0  18.0  1001.0   aa
1   NaN  1002.0   bb
2  16.0     NaN   cc

In [77]: pd.isnull(t2)     # NaN的位置显示为True
Out[77]: 
     age     id   name
0  False  False  False
1   True  False  False
2  False   True  False
 
In [78]: pd.notnull(t2)   # 非NaN的位置显示为True
Out[78]: 
     age     id  name
0   True   True  True
1  False   True  True
2   True  False  True

In [81]: t2[pd.notnull(t2["age"])]  # 显示年龄不为NaN的行
Out[81]: 
    age      id name
0  18.0  1001.0   aa
2  16.0     NaN   cc

In [82]: t2.dropna(axis=0)      # 去除NaN所在的行
Out[82]: 
    age      id name
0  18.0  1001.0   aa

In [83]: t2.dropna(axis=0,how="all") # "all"代表这一行全部为NaN的时候才去除这一行
Out[83]: 
    age      id name
0  18.0  1001.0   aa
1   NaN  1002.0   bb
2  16.0     NaN   cc

In [84]: t2.dropna(axis=0,how="any") # how参数默认的值为"any"
Out[84]: 
    age      id name
0  18.0  1001.0   aa


In [87]: t2.dropna(axis=0,how="any",inplace=True)  # inplace参数代表在原数据的基础上去除,默认返回去除后的数据

In [88]: t2
Out[88]: 
    age      id name
0  18.0  1001.0   aa


In [91]: t2
Out[91]: 
    age      id name
0  18.0  1001.0   aa
1   NaN  1002.0   bb
2  16.0     NaN   cc

In [92]: t2.fillna(0)     # 将NaN的地方填充为0
Out[92]: 
    age      id name
0  18.0  1001.0   aa
1   0.0  1002.0   bb
2  16.0     0.0   cc

In [93]: t2.fillna(t2.mean()) # 将NaN的地方填充所在列的均值
Out[93]: 
    age      id name
0  18.0  1001.0   aa
1  17.0  1002.0   bb
2  16.0  1001.5   cc

In [94]: t2["age"] = t2.fillna(t2["age"].mean()) # 指定填充”age“列的NaN

In [95]: t2
Out[95]: 
  age      id name
0  18  1001.0   aa
1  17  1002.0   bb
2  16     NaN   cc
-------------------------------

pandas commonly used statistical methods:

  • Suppose we now have a set of data on the 1,000 most popular movies from 2006 to 2016. We want to know the average score of these movie data, the number of directors and other information, how should we get it?

  • Data source: https://www.kaggle.com/damianpanek/sunday-eda/data

Data style:

Rank,         Title,               Genre,                 Description,                                                                                                              Director,            Actors,                                       Year,     Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana",2016,       108,              7.2,   60545, 270.32,           59

Code:

import pandas as pd

path = "./IMDB-Movie-Data.csv"

# 读取数据
df = pd.read_csv(path)
# 数据的基本信息
print(df.info())
# 取第一行数据类观察数据
print(df.head(1))

# 求均值
print(df["Rating"].mean())

# 提取导游数据转换成列表,在转换成集合去掉重复的,然后求长度即可
print(len(set(df["Director"].tolist())))
print(len(df["Director"].unique()))

# 提取演员信息,利用列表推导式的双重for将双重列表装换成单层
temp_list = df["Actors"].str.split(", ").tolist()
temp_list = [j for i in temp_list for j in i]
print(len(set(temp_list)))

# 电影时长的最大值和最小值以及中位数
print(df["Runtime (Minutes)"].max())
print(df["Runtime (Minutes)"].argmax())
print(df["Runtime (Minutes)"].min())
print(df["Runtime (Minutes)"].argmin())
print(df["Runtime (Minutes)"].median())
  • For this set of movie data, if we want the distribution of ratings and runtime, how should we present the data?

runtime code:

import pandas as pd
from matplotlib import pyplot as plt


path = "./IMDB-Movie-Data.csv"
# 准备数据
df = pd.read_csv(path)

# 提取
runtime_data = df["Runtime (Minutes)"].values

max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

# 组数
d = 5
bin_nums = (max_runtime - min_runtime)//d

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)


# 选择直方图
plt.hist(runtime_data,bin_nums)


# 设置刻度
x_ticks = [i for i in range(min_runtime,max_runtime+d,d)]
plt.xticks(x_ticks)

# 添加描述
plt.xlabel("Runtime (Minutes)")
plt.ylabel("total numbers")

# 显示
plt.show()

Effect picture:

Insert picture description here

rating code:

import pandas as pd
from matplotlib import pyplot as plt

# 准备数据
df = pd.read_csv("./IMDB-Movie-Data.csv")

# print(df.head(1))
# print(df.info())

# 提取数据
rating_info = df["Rating"]
data_list = rating_info.values

# 计算最大值和最小值
max_data = data_list.max()
min_data = data_list.min()
# print(max_data,min_data)


# d = 20

# num_bin = (max_data-min_data)//0.5

# 设置不等距离的刻度
num_bin = [1.9,3.5]
i = 3.5

# 循环添加刻度
while i<=9.0:
    i += 0.5
    num_bin.append(i) 

print(num_bin)

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)

# 画直方图
plt.hist(data_list,num_bin)


# 设置x轴刻度,使其与hist传入的num_bin列表对应
plt.xticks(num_bin)


# 画图
plt.show()

Effect picture:

Insert picture description here

  • For this set of movie data, if we want to count the movie classification (genre), how should we deal with the data?
  • Idea: Reconstruct an array of all 0s, the column name is category, if a category appears in a piece of data, let 0 become 1

Code:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

path = "./IMDB-Movie-Data.csv"
# 准备数据
df = pd.read_csv(path)

# print(df["Genre"].head(3))

# 提取
temp_list = df["Genre"].str.split(",").tolist()

# 获取一个含有每种分类的列表
gen_list = list(set([j for i in temp_list for j in i]))

# 构造一个全为0的数组,(colums对应每种分类-上标-列标)
zero_df = pd.DataFrame(np.zeros((df.shape[0],len(gen_list))),columns=gen_list)

# 向数组里面填充数据
for i in range(df.shape[0]):

    # zero_df.loc[0,["Sci-fi","Mucical"]] = 1
    zero_df.loc[i,temp_list[i]] = 1

# 求出每一列的和(即每一种分类)
toal = zero_df.sum(axis=0)

# 排序
toal = toal.sort_values()

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)

# 画条形图
plt.bar(range(len(toal)),toal,width=0.3,color="orange")

# 设置x刻度 index返回的是行索引的数组
plt.xticks(range(len(toal)),toal.index)

# 显示图片
plt.show()

Effect picture:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_46456049/article/details/108919194