Use of pandas module (3)

Add a column of data to the DataFrame:

In [16]: x = pd.DataFrame(np.arange(12).reshape(3,4),columns=list("QWER"))

In [17]: x
Out[17]: 
   Q  W   E   R
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

In [18]: x["O"]= pd.DataFrame(np.arange(3).reshape(3,1))

In [19]: x
Out[19]: 
   Q  W   E   R  O
0  0  1   2   3  0
1  4  5   6   7  1
2  8  9  10  11  2
  • Now we have the data of 250,000 911 emergency calls from 2015 to 2017. Please count the number of different types of emergency in these data. If we also want to count the changes in the number of different types of emergency calls in different months, What should I do?

  • Data source: https://www.kaggle.com/mchirico/montcoalert/data

Data Format

lat				lng										desc										zip				 title			timeStamp			twp			addr	  			   e
40.2978759	-75.5812935	REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;	19525	EMS: BACK PAINS/INJURY	2015/12/10 17:10	NEW HANOVER	REINDEER CT & DEAD END	1
40.2580614	-75.2646799	BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;	19446	EMS: DIABETIC EMERGENCY	2015/12/10 17:29	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1

Code:

import pandas as pd
import numpy as np

# 准备数据
df = pd.read_csv("./911.csv")

# print(df["title"].head(5))
# print(df.info())

# 提取数据
temp_list = df["title"].str.split(": ").tolist()
# 不同的紧急情况
cate_list = list(set([i[0] for i in temp_list]))
# print(cate_list)

# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(cate_list))),columns=cate_list)

# 赋值
for cate in cate_list:
    # 利用布尔索引实现一次对一列进行赋值
    zeros_df[cate][df["title"].str.contains(cate)] = 1
    
# print(zeros_df)

# 求和
print(zeros_df.sum(axis=0))

----------------------------------

import pandas as pd
import numpy as np

# 准备数据
df = pd.read_csv("./911.csv")

# print(df["title"].head(5))
# print(df.info())


# 提取数据
temp_list = df["title"].str.split(": ").tolist()
# 不同的紧急情况
cate_list = [i[0] for i in temp_list]


# 添加一列数据
df["cate"] = pd.DataFrame(np.array(cate_list).reshape((df.shape[0],1)))

# 分组统计
print(df.groupby(by="cate").count()["title"])

Time series in pandas:

生成一段时间范围:


pd.date_range(start=None, end=None, periods=None, freq='D')

start和end以及freq配合能够生成start和end范围内以频率freq的一组时间索引
start和periods以及freq配合能够生成从start开始的频率为freq的periods个时间索引

In [2]: pd.date_range(start="20191010",end="20201122")
Out[2]: 
DatetimeIndex(['2019-10-10', '2019-10-11', '2019-10-12', '2019-10-13',
               '2019-10-14', '2019-10-15', '2019-10-16', '2019-10-17',
               '2019-10-18', '2019-10-19',
               ...
               '2020-11-13', '2020-11-14', '2020-11-15', '2020-11-16',
               '2020-11-17', '2020-11-18', '2020-11-19', '2020-11-20',
               '2020-11-21', '2020-11-22'],
              dtype='datetime64[ns]', length=410, freq='D')


In [3]: pd.date_range(start="20191010",end="20201122",freq="BM")
Out[3]: 
DatetimeIndex(['2019-10-31', '2019-11-29', '2019-12-31', '2020-01-31',
               '2020-02-28', '2020-03-31', '2020-04-30', '2020-05-29',
               '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30',
               '2020-10-30'],
              dtype='datetime64[ns]', freq='BM')

In [4]: pd.date_range(start="20191010",periods=10,freq="WOM-3FRI")
Out[4]: 
DatetimeIndex(['2019-10-18', '2019-11-15', '2019-12-20', '2020-01-17',
               '2020-02-21', '2020-03-20', '2020-04-17', '2020-05-15',
               '2020-06-19', '2020-07-17'],
              dtype='datetime64[ns]', freq='WOM-3FRI')

关于频率的更多缩写:

别名           偏移量类型             说明
D                Day              每日历日              
B             BusinessDay         每工作日
H                Hour             每小时
T或min          Minute            每分
S               Second            每秒
L或ms            Milli            每毫秒(即每千分之一秒)
U                Micro            每微秒(即每百万分之一秒)
M               MonthEnd          每月最后一个日历日
BM          BusinessMonthEnd      每月最后一个工作日
MS           MonthBegin           每月第一个日历日
BMS        BusinessMonthBegin     每月第一个工作日

Use time series in DataFrame:

index=pd.date_range("20170101",periods=10)
df = pd.DataFrame(np.random.rand(10),index=index)


回到最开始的911数据的案例中,我们可以使用pandas提供的方法把时间字符串转化为时间序列

df["timeStamp"] = pd.to_datetime(df["timeStamp"],format="")

format参数大部分情况下可以不用写,但是对于pandas无法格式化的时间字符串,我们可以使用该参数,比如包含中文


那么问题来了:
我们现在要统计每个月或者每个季度的次数怎么办呢?

pandas resampling:

重采样:指的是将时间序列从一个频率转化为另一个频率进行处理的过程,将高频率数据转化为低频率数据为降采样,
低频率转化为高频率为升采样

pandas提供了一个resample的方法来帮助我们实现频率转化

np.random.uniform(10,50,(100,1))的作用:生成100行1列的10到50之间的随机数二维数组

In [3]: t = pd.DataFrame(np.random.uniform(10,50,(100,1)),index=pd.date_range("20210912",periods=100))

In [4]: t
Out[4]: 
                    0
2021-09-12  48.035949
2021-09-13  10.695941
2021-09-14  40.903627
2021-09-15  19.935708
2021-09-16  35.640888
...               ...
2021-12-16  40.102581
2021-12-17  24.833710
2021-12-18  10.483476
2021-12-19  29.515657
2021-12-20  39.668012


In [6]: t.resample("M").mean()  # mean取降采样前到降采样后,缺失数据的均值,一般此mean为固定格式
Out[6]: 
                    0
2021-09-30  33.396144
2021-10-31  30.037768
2021-11-30  28.952921
2021-12-31  26.474234

In [7]: t.resample("10D").count()
Out[7]: 
             0
2021-09-12  10
2021-09-22  10
2021-10-02  10
2021-10-12  10
2021-10-22  10
2021-11-01  10
2021-11-11  10
2021-11-21  10
2021-12-01  10
2021-12-11  10

In [8]: t.resample("QS-JAN").count()
Out[8]: 
             0
2021-07-01  19
2021-10-01  81
  • Count the changes in the number of calls in different months in the 911 data
  • Count the changes in the number of different types of calls in different months in the 911 data

Code 1

import pandas as pd
from matplotlib import pyplot as plt

# 读取数据
df = pd.read_csv("./911.csv")

# 将时间字符串转换成时间序列
df["timeStamp"] = pd.to_datetime(df["timeStamp"])
# 将时间序列设置为索引
df.set_index("timeStamp",inplace=True)

print(df.head())

# 统计出911数据中不同月份电话次数(resample重采样取每个月的数据为一组)
count_by_month = df.resample("M").count()["title"]
print(count_by_month)

# 画图
_x = count_by_month.index
_y = count_by_month.values

# dir产看到当前对象拥有哪些方法
# for i in _x:
#     print(dir(i))
#     break

_x = [i.strftime("%Y%m%d") for i in _x]

# 图片大小
plt.figure(figsize=(20,8),dpi=80)

# 画折线图
plt.plot(range(len(_x)),_y)

# 画x刻度
plt.xticks(range(len(_x)),_x,rotation=45)


# 显示
plt.show()

Effect picture:

Insert picture description here

Code 2

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np


# 读取数据
df = pd.read_csv("./911.csv")


# 将时间字符串转换成时间序列
df["timeStamp"] = pd.to_datetime(df["timeStamp"])


# 添加列
temp_list = df["title"].str.split(": ").tolist()
cate_list = [i[0] for i in temp_list]
df["cate"] = pd.DataFrame(np.array(cate_list).reshape(df.shape[0],1))


# 将时间序列设置为索引
df.set_index("timeStamp",inplace=True)

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)


for group_name,group_data in df.groupby(by="cate"):
    
    # 对不同的分类都进行绘图
    count_by_month = group_data.resample("M").count()["title"]


    # 画图
    _x = count_by_month.index
    _y = count_by_month.values

    # 从时间对象转换成格式化字符串
    _x = [i.strftime("%Y%m%d") for i in _x]

    # 画折线图
    plt.plot(range(len(_x)),_y,label=group_name)

# 设置x刻度
plt.xticks(range(len(_x)),_x,rotation=45)
# 添加标识信息
plt.legend(loc="best")

# 显示图片
plt.show()

Effect picture:

Insert picture description here

  • Now we have air quality data for 5 cities in Beijing, Shanghai, Guangzhou, Shenzhen, and Shenyang. Please plot the changes in PM2.5 in 5 cities over time

  • Observe the time structure in this set of data. It is not a string. What should we do at this time?

  • Data source: https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities

  • The DatetimeIndex learned before can be understood as a timestamp, then PeriodIndex can be understood as a time period

  • periods=pd.PeriodIndex(year=data[“year”],month=data[“month”],day=data[“day”],hour=data[“hour”],freq=“H”)

  • So what if you downsample this time period?

  • data = df.set_index(periods).resample(“10D”).mean()

  • Data Format:

      No	year  month	day	hour season	PM_Dongsi	PM_Dongsihuan	PM_Nongzhanguan	PM_US Post	DEWP	HUMI	PRES	TEMP	cbwd	Iws	precipitation	Iprec
      1	2010	1	1	0	 4	    NA	      NA	NA	NA	-21	43	1021	-11	NW	1.79	0	0
      2	2010	1	1	1	 4	    NA	      NA	NA	NA	-21	47	1020	-12	NW	4.92	0	0
    

Code:

import pandas as pd
from matplotlib import pyplot as plt

file_path = "./PM2.5/BeijingPM20100101_20151231.csv"

# 准备数据
df = pd.read_csv(file_path)
# print(df.info())
# 将不同列数据拼凑为时间段
# 把分开的时间字符串通过PeriodIndex的方法转换为pandas的事件类型
period = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")

df["datetime"] = period
# print(period)
# print(type(period))
print(df.head(10))

# 把datetime设置为索引,inplace修改原数据
df.set_index("datetime",inplace=True)

# 进行降采样(mean取降采样前到降采样后,缺失数据的均值)
df = df.resample("7D").mean() # DataFrame默认对索引降采样

# 处理(删除)缺失数据
data = df["PM_US Post"]
data_china = df["PM_Dongsi"]

# 画图
_x = data.index
_x_china = [i.strftime("%Y%m%d") for i in data_china.index]
_x = [i.strftime("%Y%m%d") for i in _x]
_y = data.values
_y_china = data_china.values

# 设置图片大小
plt.figure(figsize=(20,8),dpi=80)


# 画折线图
plt.plot(range(len(_x)),_y,label="US_POST")
plt.plot(range(len(_x_china)),_y_china,label="CN_POST")

# 设置x刻度
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)

plt.legend(loc="best")

# 画图
plt.show()

Effect picture:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_46456049/article/details/108931389