pandas progressive operation, binning technique, the window function

cummax,cummin,cumprod,cumsum

Sometimes we need to find the first line begins to turn off the current line to a maximum, minimum, and the realization that multiplies, tired, and so on.

import pandas as pd


df = pd.DataFrame({"a": [10, 20, 15, 50, 40]})

# cummax:求出从第一行开始截止到当前行的最大值
# 第1行为10,第2行为20,第3行为15但是比20小所以还是20,第4行为50,同理第5行也是50
print(df["a"].cummax())
"""
0    10
1    20
2    20
3    50
4    50
Name: a, dtype: int64
"""

# 这个不需要解释了
print(df["a"].cummin())
"""
0    10
1    10
2    10
3    10
4    10
Name: a, dtype: int64
"""

# 对每一行实现累乘
print(df["a"].cumprod())
"""
0         10
1        200
2       3000
3     150000
4    6000000
Name: a, dtype: int64
"""

# 对每一行实现累加
print(df["a"].cumsum())
"""
0     10
1     30
2     45
3     95
4    135
Name: a, dtype: int64
"""

shift: vertical movement

import pandas as pd


df = pd.DataFrame({"a": range(1, 10)})
print(df)
"""
   a
0  1
1  2
2  3
3  4
4  5
5  6
6  7
7  8
8  9
"""

df["b"] = df["a"].shift(1)
df["c"] = df["a"].shift(-1)
print(df)
"""
   a    b    c
0  1  NaN  2.0
1  2  1.0  3.0
2  3  2.0  4.0
3  4  3.0  5.0
4  5  4.0  6.0
5  6  5.0  7.0
6  7  6.0  8.0
7  8  7.0  9.0
8  9  8.0  NaN
"""

We see that we use a column shift (n), it can achieve the effect of upward or downward translation. n is greater than 0, n denotes unit pan up, pan down n represents 0 less than n units. Since the pan, then NaN will inevitably occur.

Imagine a box, shift (1) a length of a block shifted upward, then framing portion is the new column.

If we have such a demand, calculates the difference between the current element and a column on an element, how to do it?

import pandas as pd


df = pd.DataFrame({"a": [10, 20, 15, 50, 40]})

print(df["a"] - df["a"].shift(1))
"""
0     NaN
1    10.0
2    -5.0
3    35.0
4   -10.0
Name: a, dtype: float64
"""

diff: subtracting the vertical direction

We have implemented this feature, you can use the manual translation shift after subtraction. But there is a more simple method, that is, the diff (n), n is greater than 0, represents the n-th row and the current row before subtraction, n is less than 0, represents the current row and the n-th row subtraction.

import pandas as pd


df = pd.DataFrame({"a": [10, 20, 15, 50, 40]})

df["b"] = df["a"].diff(1)
df["c"] = df["a"].diff(-1)
print(df)
"""
    a     b     c
0  10   NaN -10.0
1  20  10.0   5.0
2  15  -5.0 -35.0
3  50  35.0  10.0
4  40 -10.0   NaN
"""

pct_change: subtracting demand ratio in the vertical direction

And diff (n) a similar comparison, but diff (n) on the basis of the operation has made a one. After the cut is finished, with the difference divided by the value of the original subtracted.

import pandas as pd


df = pd.DataFrame({"a": [10, 20, 15, 50, 40]})

df["b"] = df["a"].diff(1)
df["b_pct"] = df["a"].pct_change(1)
df["c"] = df["a"].diff(-1)
df["c_pct"] = df["a"].pct_change(-1)
print(df)
"""
    a     b     b_pct     c     c_pct
0  10   NaN       NaN -10.0 -0.500000
1  20  10.0  1.000000   5.0  0.333333
2  15  -5.0 -0.250000 -35.0 -0.700000
3  50  35.0  2.333333  10.0  0.250000
4  40 -10.0 -0.200000   NaN       NaN
"""

cut: binning technique

Sometimes, we need to classify data. Such as test scores, less than 60 who are classified as failing, greater than or equal to 60 is less than 80 good, greater than or equal to 80 is less than or equal to 100 outstanding

import pandas as pd


df = pd.DataFrame({"a": [60, 50, 80, 96, 75]})

# bins:为一个数组,从小到大。
df["b"] = pd.cut(df["a"], bins=[0, 59, 79, 100])
# 还可以指定labels,注意:len(labels) == len(bins) - 1,因为bins如果有n个元素,那么会形成n-1个区间
df["c"] = pd.cut(df["a"], bins=[0, 59, 79, 100], labels=["不及格", "不错", "优秀"])
print(df)
"""
    a          b    c
0  60   (59, 79]   不错
1  50    (0, 59]  不及格
2  80  (79, 100]   优秀
3  96  (79, 100]   优秀
4  75   (59, 79]   不错
"""

We note that: the interval is left open and close the right, if need be changed to the right of the open interval, then you can add right = False, the default is True 

rolling: window function

Suppose we have a year of historical data, we need to find the average of once every eight days how to do it? For example: The first line is the average value of 1 to 8 days, the second line is the average of 2 to 9 days, third line is the average of 3 to 10 days.

import pandas as pd


df = pd.DataFrame({"a": [10, 20, 10, 60, 40, 20, 50]})

# 调用rolling(n)方法等于每n行开了一个窗口
# 然后就可以使用window求平均值了
# 这个n必须要大于0,否则报错
window = df["a"].rolling(2)
# 我们说n大于0是往上,为2的话表示每一行往上数,加上本身数两行。所以第一行就是NaN了,因为上面没有了。
print(window.mean())
"""
0     NaN
1    15.0
2    15.0
3    35.0
4    50.0
5    30.0
6    35.0
Name: a, dtype: float64
"""
# 同理n=3,表示往上数3行
# 那么第1行和第2行都会为NaN
print(df["a"].rolling(3).mean())
"""
0          NaN
1          NaN
2    13.333333
3    30.000000
4    36.666667
5    40.000000
6    36.666667
Name: a, dtype: float64
"""

# 当然不光可以求平均值,还可以求最大值,最小值,求和等等
df["b"] = df["a"].rolling(2).max()
df["c"] = df["a"].rolling(2).min()
df["d"] = df["a"].rolling(2).sum()
print(df)
"""
    a     b     c      d
0  10   NaN   NaN    NaN
1  20  20.0  10.0   30.0
2  10  20.0  10.0   30.0
3  60  60.0  10.0   70.0
4  40  60.0  40.0  100.0
5  20  40.0  20.0   60.0
6  50  50.0  20.0   70.0
"""

# 甚至还可以自定义函数
df["e"] = df["a"].rolling(2).sum()
# 每一个窗口可以理解为一个Series,df["a"].rolling(2).sum()等价于df["a"].rolling(2).agg("sum")
# 我们再单独加上一个1
df["f"] = df["a"].rolling(2).agg(lambda x: sum(x) + 1)
print(df[["e", "f"]])
"""
       e      f
0    NaN    NaN
1   30.0   31.0
2   30.0   31.0
3   70.0   71.0
4  100.0  101.0
5   60.0   61.0
6   70.0   71.0
"""

# 如果n=1,那么就是本身了
# 不管调用什么方法,都是它本身, 因为只有窗口大小为1
print((df["a"].rolling(1).sum() == df["a"].rolling(1).min()).all())  # True

Guess you like

Origin www.cnblogs.com/traditional/p/12234328.html