Table of contents
2. Indexing and selecting data
2.2 .iloc() selects by position
3.1 Percent change.pct_change()
1. String and text data
Series supports string processing methods, which can easily manipulate each element in the array. These methods automatically exclude missing and null values, which is perhaps their most important feature. These methods str
are accessed through the attributes of Series. Generally, the names of these operations are consistent with the built-in string methods, such as .lower();.upper();.len() and other basic methods.
Example:
import pandas as pd
import numpy as np
s = pd.Series([' Tom ',' xiaoming ',' john '])
s
#删除空格
s.str.strip()
#字符串分割
s.str.split('o')
#字符串拼接
s.str.cat(sep="<=>")
#辨别分类
s.str.get_dummies()
#字符串包含的内容
s.str.contains('m')
#字符串替换
s.str.replace('o',"dd")
#计数
s.str.count('i')
#对一系列字符串判断是否是数字
s = pd.Series([' Tom ','778899',' xiaoming ',' john '])
s.str.isnumeric()
Output result:
# 原始Series
0 Tom
1 xiaoming
2 john
dtype: object
# 删除空格
0 Tom
1 xiaoming
2 john
dtype: object
# 字符串以o进行分割
0 [ T, m ]
1 [ xia, ming ]
2 [ j, hn ]
dtype: object
# 字符串拼接
' Tom <=> xiaoming <=> john '
# 辨别分类
Tom john xiaoming
0 1 0 0
1 0 0 1
2 0 1 0
# 字符串包含的内容
0 True
1 True
2 False
dtype: bool
# 字符串替换
0 Tddm
1 xiaddming
2 jddhn
dtype: object
# 计数(字符串中i的数量)
0 0
1 2
2 0
dtype: int64
# 对一系列字符串判断是否是数字
0 False
1 True
2 False
3 False
dtype: bool
2. Indexing and selecting data
In pandas, in addition to using index subscript or column name index, you can also use .loc();.iloc() for data indexing.
2.1 .loc() selects by label
pandas provides a set of methods to achieve pure label-based indexing . This is a strict inclusion-based agreement. Every tag requested must be in the index or KeyError
it will be raised. When slicing, include the start and stop boundaries if present in the index . Integers are valid labels, but they refer to labels rather than positions .
This .loc
property is the main access method. The following are valid entries:
-
A single label, such as
5
or'a'
(Note that it5
is interpreted as the label of the index . This usage is not an integer position along the index.). -
A list or array of labels.
['a', 'b', 'c']
-
A labeled slice object
'a':'f'
(note that, contrary to usual Python slices, the start and stop are included in the index! See Labeled slices . -
A boolean array.
-
A
callable
, see Select by Callable .
Example: Randomly generate an eight-row torn data for manipulation
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8,4),index=['a','b','c','d','e','f','g','h'],columns=["A","B","C","D"])
df
# 输出结果
A B C D
a 0.529671 -0.076485 0.379469 1.494926
b -0.082312 -0.328869 0.175183 -0.798430
c 0.681922 0.741320 -0.910726 -2.176608
d 1.500632 -1.165229 0.316722 0.402977
e -2.044217 0.930242 0.433050 0.542472
f 1.332038 0.476599 1.661994 2.102483
g 0.488362 -1.667154 -0.651079 -0.049332
h -0.676308 0.904894 1.592176 0.409881
1. Select all contents of column AB (using slice)
#选择A.B列所有的内容,基于标签
df.loc[:,['A','B']]
#输出结果
A B
a 0.529671 -0.076485
b -0.082312 -0.328869
c 0.681922 0.741320
d 1.500632 -1.165229
e -2.044217 0.930242
f 1.332038 0.476599
g 0.488362 -1.667154
h -0.676308 0.904894
2. Select row ae, column after b (slice)
#选择a-e行,b以后的列
df.loc['a':'e','B':]
# 输出结果:
B C D
a -0.076485 0.379469 1.494926
b -0.328869 0.175183 -0.798430
c 0.741320 -0.910726 -2.176608
d -1.165229 0.316722 0.402977
e 0.930242 0.433050 0.542472
3. Take out the data greater than 1 in the a tag
# 取出a标签里大于一的数据
df.loc['a']>1
# 输出结果
A False
B False
C False
D True
Name: a, dtype: bool
4. Take out the content of the column where a is greater than 1
# 取出a大于1的那一列内容
df.loc[:,df.loc['a']>1]
#输出结果:
D
a 1.494926
b -0.798430
c -2.176608
d 0.402977
e 0.542472
f 2.102483
g -0.049332
h 0.409881
2.2 .iloc() selects by position
pandas provides a set of methods to obtain purely integer-based indexing . Semantics closely follow Python and NumPy slices. These are 0-based
indexes. When slicing, start bounds are included, upper bounds are excluded . Attempting to use a non-integer even for a valid label will raise IndexError
.
This .iloc
property is the main access method. The following are valid entries:
-
An integer, eg
5
. -
A list or array of integers.
[4, 3, 0]
-
A slice object with ints
1:7
. -
A boolean array.
-
A
callable
, see Select by Callable .
Example: dataframe following the above case
A B C D
a 0.529671 -0.076485 0.379469 1.494926
b -0.082312 -0.328869 0.175183 -0.798430
c 0.681922 0.741320 -0.910726 -2.176608
d 1.500632 -1.165229 0.316722 0.402977
e -2.044217 0.930242 0.433050 0.542472
f 1.332038 0.476599 1.661994 2.102483
g 0.488362 -1.667154 -0.651079 -0.049332
h -0.676308 0.904894 1.592176 0.409881
1. Index based on (row) position
# 基于(行)位置的索引
df.iloc[0]
#输出结果:
A 0.529671
B -0.076485
C 0.379469
D 1.494926
Name: a, dtype: float64
df.iloc[1]
#输出结果:
A -0.082312
B -0.328869
C 0.175183
D -0.798430
Name: b, dtype: float64
2. Take out the content after the second column in the third row
df.iloc[3:,1:]
#输出结果:
B C D
d -1.165229 0.316722 0.402977
e 0.930242 0.433050 0.542472
f 0.476599 1.661994 2.102483
g -1.667154 -0.651079 -0.049332
h 0.904894 1.592176 0.409881
2.3 Get data using attributes
For the above data, in pandas, the data can also be retrieved by attribute acquisition.
Example: Take out the data in column A and column D
#属性获取,取出A列内容
df.A
#输出结果:
a 1.310455
b -1.015628
c 1.281924
d 0.496812
e -1.733183
f 0.140338
g -0.179063
h -0.642013
Name: A, dtype: float64
df.D
#输出结果:
a -0.298131
b -1.141310
c -0.302760
d 1.188531
e -1.608952
f 0.437460
g -0.696010
h -0.525048
Name: D, dtype: float64
3. Statistical functions
Pandas provides a variety of statistical functions for users to use, such as percentage change.pct_change(); covariance.cov(); correlation.corr(); data ranking.rank() method
3.1 Percent change.pct_change()
Series
And DataFrame
there is a method .pct_change() to calculate the percentage change for a given number of periods ( use padding with NA/null values beforefill_method
calculating the percentage change ).
Basic syntax:
Series.pct_change()
or
DataFrame.pct_change(periods=行数)
Example:
import pandas as pd
import numpy as np
#创建基础Series
s = pd.Series([877,865,874,890,912])
s
# 输出结果:
0 877
1 865
2 874
3 890
4 912
dtype: int64
#创建基础dataframe
df = pd.DataFrame(np.random.randn(5, 4))
df
#输出结果
0 1 2 3
0 0.655875 -2.195588 -0.785019 1.122582
1 0.852057 -2.276063 1.528201 -0.167119
2 -1.057979 -0.396548 -0.915528 0.026226
3 -0.490155 1.803235 0.005851 -1.252117
4 0.946558 -2.680471 -0.055739 -0.624553
Get the percentage of change:
# 变化的百分比程度(波动变化)
s.pct_change()
#输出结果:
0 NaN
1 -0.013683
2 0.010405
3 0.018307
4 0.024719
dtype: float64
# 变化的百分比程度
df.pct_change(periods=1)
# 输出结果:
0 1 2 3
0 NaN NaN NaN NaN
1 0.299115 0.036653 -2.946706 -1.148870
2 -2.241677 -0.825775 -1.599088 -1.156933
3 -0.536707 -5.547331 -1.006391 -48.742482
4 -2.931143 -2.486479 -10.526903 -0.501202
3.2 Covariance.cov()
Series.cov() can be used to calculate the covariance between series (excluding missing values).
DataFrame.cov() computes pairwise covariance between series in a DataFrame, also excluding NA/null values.
Example:
#计算两个Series之间的协方差
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
#两个数据的协方差
s1.cov(s2)
#输出结果:
-0.0751790891671201
# 计算dataframe中数据的协方差
frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
frame.cov()
#输出结果:
a b c d e
a 1.000882 -0.003177 -0.002698 -0.006889 0.031912
b -0.003177 1.024721 0.000191 0.009212 0.000857
c -0.002698 0.000191 0.950735 -0.031743 -0.005087
d -0.006889 0.009212 -0.031743 1.002983 -0.047952
e 0.031912 0.000857 -0.005087 -0.047952 1.042487
DataFrame.cov
An optional min_periods
keyword is also supported that specifies the minimum number of observations required for each column pair in order to obtain valid results. For example
frame.cov(min_periods=12)
Observe at least 12 columns of data in the dataframe, and return NaN if there are not enough 12 columns.
3.3 Correlation.corr()
Correlation can be calculated using the .coor() method. Using this method
parameter, several methods of calculating the correlation are provided:
method name |
describe |
---|---|
|
standard correlation coefficient |
|
Kendall Tau correlation coefficient |
|
Spearman rank correlation coefficient |
Example:
1. Correlation between two series
s1 = pd.Series(np.random.randn(10))
s2 = s1*2
#相关性(s1与s2)
s1.corr(s2)
#输出结果:
0.9999999999999999
2. The correlation between the three sets of data (dataframe)
s1 = pd.Series(np.random.randn(10))
s2 = s1*2
s3 = pd.Series(np.random.randn(10))
df = pd.DataFrame({
's1':s1,
's2':s2,
's3':s3
})
df
# 输出dataframe
s1 s2 s3
0 -1.149359 -2.298718 0.742016
1 0.476084 0.952168 -0.375759
2 -0.998627 -1.997255 0.721653
3 1.047331 2.094663 -0.078039
4 0.444710 0.889420 -0.525895
5 -0.411778 -0.823557 -0.402789
6 -0.935911 -1.871822 -0.597614
7 -0.652570 -1.305140 0.636498
8 1.055361 2.110722 -0.763907
9 -1.222631 -2.445262 -0.153914
# 三者相关性
df.corr()
#输出结果:
s1 s2 s3
s1 1.000000 1.000000 -0.548589
s2 1.000000 1.000000 -0.548589
s3 -0.548589 -0.548589 1.000000
3.4 Data Ranking.rank()
The .rank() method generates a data rank where relations are assigned the rank mean of the group, for example:
s = pd.Series([877,865,874,890,912])
s
#输出结果:
0 877
1 865
2 874
3 890
4 912
dtype: int64
s.rank()
#输出结果:
0 3.0
1 1.0
2 2.0
3 4.0
4 5.0
dtype: float64
In a dataframe, rank() can rank rows ( axis=0
) or columns ( axis=1
). NaN
Values are excluded from ranking.
rank
Optionally takes ascending
an argument that defaults to true; if false, the data is reverse-ranked, with larger values assigning smaller ranks.
rank
Different draw methods are supported, method
specified by parameters:
average
: Average rank of the tied group
min
: the lowest rank in the group
max
: Highest rank in the group
first
: Ranks assigned in the order they appear in the array
4. Window functions
pandas includes a compact set of APIs for performing window operations - operations that perform aggregations over sliding partitions of values. The API functions similarly to the API groupby
, Series and DataFrame with the necessary parameters to call the windowing method, and then call the aggregation function.
pandas supports 4 types of window operations:
-
Tumbling windows: Generic fixed or variable sliding windows over values.
-
weighted window: weighted non-rectangular window provided by the library
scipy.signal
. -
Extended Window: The window over which the values are accumulated.
-
Exponentially Weighted Windows: Cumulative and exponentially weighted windows of values.
concept |
method |
returned object |
Support for time-based windows |
Support chained groupby |
Support table method |
Support online operation |
---|---|---|---|---|---|---|
Roller shutter |
|
|
Yes |
Yes |
Yes>1.3 |
No |
weighting window |
|
|
No |
No |
No |
No |
expand window |
|
|
No |
Yes |
Yes>1.3 |
No |
Exponentially weighted window |
|
|
No |
is >1.2 |
No |
Yes (since version 1.3) |
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4))
df
#输出结果:
0 1 2 3
0 2.599818 0.451315 -0.428038 0.035233
1 0.395523 -0.098377 0.059649 -0.489922
2 0.550164 -0.469461 1.193710 0.567562
3 1.483434 -0.793989 -0.738174 0.515078
4 0.395409 0.425578 -0.439963 -0.207277
5 -0.035479 -1.438315 -0.863333 -0.129948
6 -0.336889 -0.094188 -1.452638 0.083352
7 -0.626117 0.120990 -0.566740 0.665003
8 -1.437816 -0.112235 -0.232150 -0.099910
9 -0.582537 0.388641 1.008226 0.321893
1. .rolling() rolling shutter
# 滚动窗口求每三行之间的平均值
df.rolling(window=3).mean()
#输出结果:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 1.181835 -0.038841 0.275107 0.037625
3 0.809707 -0.453942 0.171729 0.197573
4 0.809669 -0.279291 0.005191 0.291788
5 0.614455 -0.602242 -0.680490 0.059284
6 0.007681 -0.368975 -0.918644 -0.084624
7 -0.332828 -0.470504 -0.960904 0.206135
8 -0.800274 -0.028478 -0.750509 0.216148
9 -0.882157 0.132465 0.069779 0.295662
2. .expanding to expand the window
#expanding
df.expanding(min_periods=3).mean()
#输出结果:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 1.181835 -0.038841 0.275107 0.037625
3 1.257235 -0.227628 0.021787 0.156988
4 1.084869 -0.096987 -0.070563 0.084135
5 0.898145 -0.320542 -0.202691 0.048455
6 0.721711 -0.288205 -0.381255 0.053440
7 0.553233 -0.237056 -0.404441 0.129885
8 0.332005 -0.223187 -0.385297 0.104352
9 0.240551 -0.162004 -0.245945 0.126106