Pandas DataFrame 数据选取和过滤

This would allow chaining operations like:

pd.read_csv('imdb.txt')
  .sort(columns='year') .filter(lambda x: x['year']>1990) # <---this is missing in Pandas .to_csv('filtered.csv')

For current alternatives see:

http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining

可以这样：

df = pd.read_csv('imdb.txt').sort(columns='year')
df[df['year']>1990].to_csv('filtered.csv')

# however, could potentially do something like this:

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .[lambda x: x['year']>1990]
  .to_csv('filtered.csv')
or

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .loc[lambda x: x['year']>1990]
  .to_csv('filtered.csv')

from:https://yangjin795.github.io/pandas_df_selection.html

Pandas 是 Python Data Analysis Library, 是基于 numpy 库的一个为了数据分析而设计的一个 Python 库。它提供了很多工具和方法，使得使用 python 操作大量的数据变得高效而方便。

本文专门介绍 Pandas 中对 DataFrame 的一些对数据进行过滤、选取的方法和工具。首先，本文所用的原始数据如下：

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

    Out[9]: 
                     A         B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320 2017-04-04 1.700309 0.287588 -0.012103 0.525291 2017-04-05 0.526615 -0.417645 0.405853 -0.835213 2017-04-06 1.143858 -0.326720 1.425379 0.531037

选取

通过 [] 来选取

选取一列或者几列：

df['A']
Out: 2017-04-01 0.522241 2017-04-02 2.104572 2017-04-03 0.480507 2017-04-04 1.700309 2017-04-05 0.526615 2017-04-06 1.143858

df[['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588 2017-04-05 0.526615 -0.417645 2017-04-06 1.143858 -0.326720

选取某一行或者几行：

df['2017-04-01':'2017-04-01'] Out: A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.03500

df['2017-04-01':'2017-04-03'] A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

loc, 通过行标签选取数据

df.loc['2017-04-01','A']

df.loc['2017-04-01'] Out: A 0.522241 B 0.495106 C -0.268194 D -0.035003

df.loc['2017-04-01':'2017-04-03'] Out: A B C D 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

df.loc['2017-04-01':'2017-04-04',['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588

df.loc[:,['A','B']] Out: A B 2017-04-01 0.522241 0.495106 2017-04-02 2.104572 -0.977768 2017-04-03 0.480507 1.215048 2017-04-04 1.700309 0.287588 2017-04-05 0.526615 -0.417645 2017-04-06 1.143858 -0.326720

iloc, 通过行号获取数据

df.iloc[2] Out: A 0.480507 B 1.215048 C 1.313314 D -0.072320

df.iloc[1:3] Out: A B C D 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-03 0.480507 1.215048 1.313314 -0.072320

df.iloc[1,1] df.iloc[1:3,1] df.iloc[1:3,1:2] df.iloc[[1,3],[2,3]] Out: C D 2017-04-02 -0.139632 -0.735926 2017-04-04 -0.012103 0.525291 df.iloc[[1,3],:] df.iloc[:,[2,3]]

iat, 获取某一个 cell 的值

df.iat[1,2] Out: -0.13963224781812655

过滤

使用 [] 过滤

[]中是一个boolean 表达式，凡是计算为 True 的行就会被选取。

df[df.A>1] Out: A B C D 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 2017-04-04 1.700309 0.287588 -0.012103 0.525291 2017-04-06 1.143858 -0.326720 1.425379 0.531037

df[df>1] Out: A B C D 2017-04-01 NaN NaN NaN NaN 2017-04-02 2.104572 NaN NaN NaN 2017-04-03 NaN 1.215048 1.313314 NaN 2017-04-04 1.700309 NaN NaN NaN 2017-04-05 NaN NaN NaN NaN 2017-04-06 1.143858 NaN 1.425379 NaN df[df.A+df.B>1.5] Out: A B C D 2017-04-03 0.480507 1.215048 1.313314 -0.072320 2017-04-04 1.700309 0.287588 -0.012103 0.525291

下面是一个更加复杂的例子，选取的是 index 在 '2017-04-01'中'2017-04-04'的，一行的数据的和大于1的行：

df.loc['2017-04-01':'2017-04-04',df.sum()>1]

还可以通过和 apply 方法结合，构造更加复杂的过滤，实现将某个返回值为 boolean 的方法作为过滤条件：

扫描二维码关注公众号，回复： 3823363 查看本文章

df[df.apply(lambda x: x['b'] > x['c'], axis=1)]

使用 isin

df['E']=['one', 'one','two','three','four','three'] A B C D E 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 one 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 one 2017-04-03 0.480507 1.215048 1.313314 -0.072320 two 2017-04-04 1.700309 0.287588 -0.012103 0.525291 three 2017-04-05 0.526615 -0.417645 0.405853 -0.835213 four 2017-04-06 1.143858 -0.326720 1.425379 0.531037 three df[df.E.isin(['one'])] Out: A B C D E 2017-04-01 0.522241 0.495106 -0.268194 -0.035003 one 2017-04-02 2.104572 -0.977768 -0.139632 -0.735926 one