pandas 入门 -- date_range

pandas date_range

start 开始时间

end 结束时间

periods 时间长度

freq 时间频率，默认为‘D’，可选H(our) , W(eek), B(usiness), S(emi-)M(onth),(min)T(es), S(econd), A(year),…

In [23]: import datetime

In [24]: datetime.datetime.strptime('2010-01-01', '%Y-%m-%d')  # 将字符串转成时间
Out[24]: datetime.datetime(2010, 1, 1, 0, 0)

In [25]: # 第一个参数是时间字符串，第二个参数是时间格式

In [26]: import dateutil

In [27]: dateutil.parser.parse('2001-01-01')
Out[27]: datetime.datetime(2001, 1, 1, 0, 0)

In [28]: # 像 datatime 一样 但这个省去时间格式

In [29]: dateutil.parser.parse('2001/01/01')
Out[29]: datetime.datetime(2001, 1, 1, 0, 0)

In [30]: dateutil.parser.parse('01/01/2001')
Out[30]: datetime.datetime(2001, 1, 1, 0, 0)

In [31]: dateutil.parser.parse('JAN/01/2001')
Out[31]: datetime.datetime(2001, 1, 1, 0, 0)

In [33]: pd.to_datetime(['2001-01-01', '2010/Feb/02'])
Out[33]: DatetimeIndex(['2001-01-01', '2010-02-02'], dtype='datetime64[ns]', freq=None)

In [34]: # 将不同的时间对象字符串转成时间

In [35]: pd.date_range('2010-01-01','2010-01-15')
Out[35]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14', '2010-01-15'],
              dtype='datetime64[ns]', freq='D')

In [36]: pd.date_range('2010-01-01',periods=20)
Out[36]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14', '2010-01-15', '2010-01-16',
               '2010-01-17', '2010-01-18', '2010-01-19', '2010-01-20'],
              dtype='datetime64[ns]', freq='D')

In [37]: # periods 是要生成的时间长度

In [38]: pd.date_range('2010-01-01',periods=20, freq='W')
Out[38]:
DatetimeIndex(['2010-01-03', '2010-01-10', '2010-01-17', '2010-01-24',
               '2010-01-31', '2010-02-07', '2010-02-14', '2010-02-21',
               '2010-02-28', '2010-03-07', '2010-03-14', '2010-03-21',
               '2010-03-28', '2010-04-04', '2010-04-11', '2010-04-18',
               '2010-04-25', '2010-05-02', '2010-05-09', '2010-05-16'],
              dtype='datetime64[ns]', freq='W-SUN')

In [39]: # 按照每周

In [40]: pd.date_range('2010-01-01',periods=20, freq='W-MON')
Out[40]:
DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25',
               '2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22',
               '2010-03-01', '2010-03-08', '2010-03-15', '2010-03-22',
               '2010-03-29', '2010-04-05', '2010-04-12', '2010-04-19',
               '2010-04-26', '2010-05-03', '2010-05-10', '2010-05-17'],
              dtype='datetime64[ns]', freq='W-MON')

In [41]: # 按照每周一

In [42]: pd.date_range('2010-01-01',periods=20, freq='B')
Out[42]:
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14', '2010-01-15', '2010-01-18',
               '2010-01-19', '2010-01-20', '2010-01-21', '2010-01-22',
               '2010-01-25', '2010-01-26', '2010-01-27', '2010-01-28'],
              dtype='datetime64[ns]', freq='B')

In [43]: # 去掉周六/日

In [45]: dt = _

In [46]: dt[0]
Out[46]: Timestamp('2010-01-01 00:00:00', offset='B')

In [47]: dt[0].to_pydatetime()
Out[47]: datetime.datetime(2010, 1, 1, 0, 0)

In [48]: # 转成 python 的 datatime

In [49]: pd.date_range('2010-01-01',periods=20, freq='1h20min')
Out[49]:
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00',
               '2010-01-01 02:40:00', '2010-01-01 04:00:00',
               '2010-01-01 05:20:00', '2010-01-01 06:40:00',
               '2010-01-01 08:00:00', '2010-01-01 09:20:00',
               '2010-01-01 10:40:00', '2010-01-01 12:00:00',
               '2010-01-01 13:20:00', '2010-01-01 14:40:00',
               '2010-01-01 16:00:00', '2010-01-01 17:20:00',
               '2010-01-01 18:40:00', '2010-01-01 20:00:00',
               '2010-01-01 21:20:00', '2010-01-01 22:40:00',
               '2010-01-02 00:00:00', '2010-01-02 01:20:00'],
              dtype='datetime64[ns]', freq='80T')

In [50]: # 任意时间

pandas 常用函数

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'one':[1,2,3,4], 'two':[4,5,6,7]}, index=['a','b','c'
   ...: ,'d'])

In [4]: df
Out[4]:
   one  two
a    1    4
b    2    5
c    3    6
d    4    7

In [5]: df.mean()
Out[5]:
one    2.5
two    5.5
dtype: float64

In [6]: df.mean(axis=1)
Out[6]:
a    2.5
b    3.5
c    4.5
d    5.5
dtype: float64

In [7]: # 按行求平均值

In [8]: df.sum()
Out[8]:
one    10
two    22
dtype: int64

In [9]: df.sum(axis=1)
Out[9]:
a     5
b     7
c     9
d    11
dtype: int64

In [10]: # 按行求和

In [11]: df.sort_values(by='one')
Out[11]:
   one  two
a    1    4
b    2    5
c    3    6
d    4    7

In [12]: df.sort_values(by='one', ascending=False)
Out[12]:
   one  two
d    4    7
c    3    6
b    2    5
a    1    4

In [13]: # ascending 按列 one 降序排列

In [16]: # 当列中有 NaN 时 ascending 排序 NaN 始终在最后

In [17]: df.sort_index()
Out[17]:
   one  two
a    1    4
b    2    5
c    3    6
d    4    7

In [18]: # 按索引排序

In [19]: df.sort_index(ascending=False)
Out[19]:
   one  two
d    4    7
c    3    6
b    2    5
a    1    4

In [20]: # 降序列排序

In [21]: df.sort_index(ascending=False, axis=1)
Out[21]:
   two  one
a    4    1
b    5    2
c    6    3
d    7    4

In [22]: # 列索引排序

In [23]:

pandas - 时间序列

时间序列就是以时间为索引的Series或DataFrame

datetime对象作为索引时时存储在DatetimeIndex对象中的

时间序列特殊功能：

传入“年” 或着 “年月” 作为切片方式

传入日期范围作为切片方式

函数支持：resample(), truncate(),…

In [2]: import pandas as pd
In [2]: import pandas as pd

In [3]: pd.date_range("2010-01-01", '2010-01-20')
Out[3]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14', '2010-01-15', '2010-01-16',
               '2010-01-17', '2010-01-18', '2010-01-19', '2010-01-20'],
              dtype='datetime64[ns]', freq='D')

In [4]: import numpy as np

In [5]: sr = pd.Series(np.arange(20),index=pd.date_range('2017-01-01',periods=20
   ...: ))

In [6]: sr
Out[6]:
2017-01-01     0
2017-01-02     1
2017-01-03     2
2017-01-04     3
2017-01-05     4
2017-01-06     5
2017-01-07     6
2017-01-08     7
2017-01-09     8
2017-01-10     9
2017-01-11    10
2017-01-12    11
2017-01-13    12
2017-01-14    13
2017-01-15    14
2017-01-16    15
2017-01-17    16
2017-01-18    17
2017-01-19    18
2017-01-20    19
Freq: D, dtype: int64

In [7]: sr.index
Out[7]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
               '2017-01-13', '2017-01-14', '2017-01-15', '2017-01-16',
               '2017-01-17', '2017-01-18', '2017-01-19', '2017-01-20'],
              dtype='datetime64[ns]', freq='D')

In [8]: # 我们在这创建的 sr 的 index 是 DatetimeIndex类型的

In [9]: # 此时sr 就是一个时间序列

In [10]: sr['2017-01']  # 获取2017年1月的数据
Out[10]:
2017-01-01     0
2017-01-02     1
2017-01-03     2
2017-01-04     3
2017-01-05     4
2017-01-06     5
2017-01-07     6
2017-01-08     7
2017-01-09     8
2017-01-10     9
2017-01-11    10
2017-01-12    11
2017-01-13    12
2017-01-14    13
2017-01-15    14
2017-01-16    15
2017-01-17    16
2017-01-18    17
2017-01-19    18
2017-01-20    19
Freq: D, dtype: int64

In [11]: sr['2017-01-01':'2017-01-09']
Out[11]:
2017-01-01    0
2017-01-02    1
2017-01-03    2
2017-01-04    3
2017-01-05    4
2017-01-06    5
2017-01-07    6
2017-01-08    7
2017-01-09    8
Freq: D, dtype: int64

In [12]: sr.resample('W').sum()
Out[12]:
2017-01-01     0
2017-01-08    28
2017-01-15    77
2017-01-22    85
Freq: W-SUN, dtype: int64

In [13]: # 一周数字总和

In [14]: sr.resample('M').sum()
Out[14]:
2017-01-31    190
Freq: M, dtype: int64

In [15]: # 一月数字总和

In [16]: sr.resample('M').mean()
Out[16]:
2017-01-31    9.5
Freq: M, dtype: float64

In [17]: sr.truncate(before='2017-01-03') # 还有个 after 可以使用
Out[17]:
2017-01-03     2
2017-01-04     3
2017-01-05     4
2017-01-06     5
2017-01-07     6
2017-01-08     7
2017-01-09     8
2017-01-10     9
2017-01-11    10
2017-01-12    11
2017-01-13    12
2017-01-14    13
2017-01-15    14
2017-01-16    15
2017-01-17    16
2017-01-18    17
2017-01-19    18
2017-01-20    19
Freq: D, dtype: int64

In [18]:

Pandas 文件读取

数据文件常用格式：csv

pandas读取文件：从。文件名、url、文件对戏那个中加载数据

read_csv 默认分隔符为逗号 “,”

read_table 默认分隔符为制表符 “\t”

read_csv、 read_table 函数主要参数：

sep 指定分隔符，可以使用正则表达式如 ‘\s+’

header=None 指定文件没有列名

names 指定列名

index_col 指定某列索引

skiprows 跳过某些行 [1,2,3 ] # 即表示为跳过1，2，3行

na_values 指定某些字符串表示缺失值 [‘None’,‘null’ ] # 指定None 和 null字符串被解释成NaN

parse_dates 指定某些列被解析成日期，类型为bool值或者列表[]

In [19]: # pd.read_csv("文件名.csv", index_col='') # index_col 表示指定某一列作
    ...: 为行索引

In [20]: # pd.read_csv("文件名.csv", index_col=0) # index_col 表示指定第一列作为
    ...: 行索引

In [21]: # pd.read_csv("文件名.csv", index_col=0, parse_dates=True) # parse_date
    ...: s将文件中所有第时间对象转换成

In [22]: # 把能解释成时间序列第列都解释出来

In [23]: # pd.read_csv("文件名.csv", index_col=0, parse_dates=['列'，‘列’]) # pa
    ...: rse_dates将文件中指定列名的时间对象转换成时间序列

In [24]:

In [24]: # 有些csv文件可能没有列名，当读取的时候会把第一行数据当作列名，这样就不
    ...: 太好了

In [25]: # pd.read_csv('文件名.csv', header=None) # 当指定 header=None 的时候 pd
    ...: 就会自己创建一个列名0123... 这样就不会影响数据了

In [26]: # pd.read_csv('文件名.csv', header=None, names=['列名1'， ‘列名2’]) #
    ...: 当指定 header=None 的时候 pd就会自己创建一个我们提供的names列表作为列名

In [27]: # read_table 和read_csv没有大区别，table默认的分隔符为制表符

In [28]:

写入到csv文件：to_csv 函数

主要参数

sep 指定文件分隔符

na_rep 指定缺失值转换到字符串，默认为空字符串

header=False 不输出列名一行

index=False 不输出行索引一列

columns 指定输出到列，传输列表

Pandas 支持其他的文件类型：

json、xml、html、数据库、pickle、excel…