Pandas-03 (string and text data, index and select data, statistical functions, window functions)

Table of contents

1. String and text data

2. Indexing and selecting data

2.1 .loc() selects by label

2.2 .iloc() selects by position

2.3 Get data using attributes

3. Statistical functions

3.1 Percent change.pct_change()

3.2 Covariance.cov()

3.3 Correlation.corr()

3.4 Data Ranking.rank()

4. Window functions


1. String and text data

Series supports string processing methods, which can easily manipulate each element in the array. These methods automatically exclude missing and null values, which is perhaps their most important feature. These methods  str are accessed through the attributes of Series. Generally, the names of these operations are consistent with the built-in string methods, such as .lower();.upper();.len() and other basic methods.

Example:

import pandas as pd
import numpy as np

s = pd.Series([' Tom ',' xiaoming ',' john '])
s

#删除空格
s.str.strip()

#字符串分割
s.str.split('o')

#字符串拼接
s.str.cat(sep="<=>")

#辨别分类
s.str.get_dummies()

#字符串包含的内容
s.str.contains('m')

#字符串替换
s.str.replace('o',"dd")

#计数
s.str.count('i')

#对一系列字符串判断是否是数字
s = pd.Series([' Tom ','778899',' xiaoming ',' john '])
s.str.isnumeric()

Output result:

# 原始Series
0          Tom 
1     xiaoming 
2         john 
dtype: object

# 删除空格
0         Tom
1    xiaoming
2        john
dtype: object

# 字符串以o进行分割
0         [ T, m ]
1    [ xia, ming ]
2        [ j, hn ]
dtype: object

# 字符串拼接
' Tom <=> xiaoming <=> john '

# 辨别分类
	Tom	john xiaoming
0	1	 0	   0
1	0	 0	   1
2	0	 1	   0

# 字符串包含的内容
0     True
1     True
2    False
dtype: bool

# 字符串替换
0          Tddm 
1     xiaddming 
2         jddhn 
dtype: object

# 计数(字符串中i的数量)
0    0
1    2
2    0
dtype: int64

# 对一系列字符串判断是否是数字
0    False
1     True
2    False
3    False
dtype: bool

2. Indexing and selecting data

In pandas, in addition to using index subscript or column name index, you can also use .loc();.iloc() for data indexing.

2.1 .loc() selects by label

pandas provides a set of methods to achieve pure label-based indexing . This is a strict inclusion-based agreement. Every tag requested must be in the index or KeyErrorit will be raised. When slicing, include the start and stop boundaries if present in the index . Integers are valid labels, but they refer to labels rather than positions .

This .locproperty is the main access method. The following are valid entries:

  • A single label, such as 5or 'a'(Note that it 5is interpreted as the label of the index . This usage is not an integer position along the index.).

  • A list or array of labels.['a', 'b', 'c']

  • A labeled slice object 'a':'f'(note that, contrary to usual Python slices, the start and stop are included in the index! See Labeled slices .

  • A boolean array.

  • callable, see Select by Callable .

Example: Randomly generate an eight-row torn data for manipulation

import pandas as pd
import numpy as np


df = pd.DataFrame(np.random.randn(8,4),index=['a','b','c','d','e','f','g','h'],columns=["A","B","C","D"])
df

#  输出结果
	A	            B	        C	       D
a	0.529671	-0.076485	0.379469	1.494926
b	-0.082312	-0.328869	0.175183	-0.798430
c	0.681922	0.741320	-0.910726	-2.176608
d	1.500632	-1.165229	0.316722	0.402977
e	-2.044217	0.930242	0.433050	0.542472
f	1.332038	0.476599	1.661994	2.102483
g	0.488362	-1.667154	-0.651079	-0.049332
h	-0.676308	0.904894	1.592176	0.409881

1. Select all contents of column AB (using slice)

#选择A.B列所有的内容,基于标签
df.loc[:,['A','B']]

#输出结果
        A	        B
a	0.529671	-0.076485
b	-0.082312	-0.328869
c	0.681922	0.741320
d	1.500632	-1.165229
e	-2.044217	0.930242
f	1.332038	0.476599
g	0.488362	-1.667154
h	-0.676308	0.904894

2. Select row ae, column after b (slice)

#选择a-e行,b以后的列
df.loc['a':'e','B':]


# 输出结果:
        B	        C	        D
a	-0.076485	0.379469	1.494926
b	-0.328869	0.175183	-0.798430
c	0.741320	-0.910726	-2.176608
d	-1.165229	0.316722	0.402977
e	0.930242	0.433050	0.542472

3. Take out the data greater than 1 in the a tag

# 取出a标签里大于一的数据
df.loc['a']>1

# 输出结果
A    False
B    False
C    False
D     True
Name: a, dtype: bool

4. Take out the content of the column where a is greater than 1

# 取出a大于1的那一列内容
df.loc[:,df.loc['a']>1]


#输出结果:
        D
a	1.494926
b	-0.798430
c	-2.176608
d	0.402977
e	0.542472
f	2.102483
g	-0.049332
h	0.409881

2.2 .iloc() selects by position

pandas provides a set of methods to obtain purely integer-based indexing . Semantics closely follow Python and NumPy slices. These are 0-basedindexes. When slicing, start bounds are included, upper bounds are excluded . Attempting to use a non-integer even for a valid label will raise IndexError.

This .ilocproperty is the main access method. The following are valid entries:

  • An integer, eg 5.

  • A list or array of integers.[4, 3, 0]

  • A slice object with ints 1:7.

  • A boolean array.

  • callable, see Select by Callable .

Example: dataframe following the above case

        A	        B	       C	       D
a	0.529671	-0.076485	0.379469	1.494926
b	-0.082312	-0.328869	0.175183	-0.798430
c	0.681922	0.741320	-0.910726	-2.176608
d	1.500632	-1.165229	0.316722	0.402977
e	-2.044217	0.930242	0.433050	0.542472
f	1.332038	0.476599	1.661994	2.102483
g	0.488362	-1.667154	-0.651079	-0.049332
h	-0.676308	0.904894	1.592176	0.409881

1. Index based on (row) position

# 基于(行)位置的索引
df.iloc[0]


#输出结果:
A    0.529671
B   -0.076485
C    0.379469
D    1.494926
Name: a, dtype: float64


df.iloc[1]

#输出结果:
A   -0.082312
B   -0.328869
C    0.175183
D   -0.798430
Name: b, dtype: float64

2. Take out the content after the second column in the third row

df.iloc[3:,1:]


#输出结果:
        B	        C	       D
d	-1.165229	0.316722	0.402977
e	0.930242	0.433050	0.542472
f	0.476599	1.661994	2.102483
g	-1.667154	-0.651079	-0.049332
h	0.904894	1.592176	0.409881

2.3 Get data using attributes

For the above data, in pandas, the data can also be retrieved by attribute acquisition.

Example: Take out the data in column A and column D

#属性获取,取出A列内容
df.A

#输出结果:
a    1.310455
b   -1.015628
c    1.281924
d    0.496812
e   -1.733183
f    0.140338
g   -0.179063
h   -0.642013
Name: A, dtype: float64



df.D

#输出结果:
a   -0.298131
b   -1.141310
c   -0.302760
d    1.188531
e   -1.608952
f    0.437460
g   -0.696010
h   -0.525048
Name: D, dtype: float64

3. Statistical functions

Pandas provides a variety of statistical functions for users to use, such as percentage change.pct_change(); covariance.cov(); correlation.corr(); data ranking.rank() method

3.1 Percent change.pct_change()

SeriesAnd DataFramethere is a method .pct_change() to calculate the percentage change for a given number of periods ( use padding with NA/null values ​​beforefill_method calculating the percentage change ).

Basic syntax:

Series.pct_change()
or
DataFrame.pct_change(periods=行数)

Example:

import pandas as pd
import numpy as np

#创建基础Series
s = pd.Series([877,865,874,890,912])
s

# 输出结果:
0    877
1    865
2    874
3    890
4    912
dtype: int64


#创建基础dataframe
df = pd.DataFrame(np.random.randn(5, 4))
df

#输出结果
        0	         1	        2	       3
0	0.655875	-2.195588	-0.785019	1.122582
1	0.852057	-2.276063	1.528201	-0.167119
2	-1.057979	-0.396548	-0.915528	0.026226
3	-0.490155	1.803235	0.005851	-1.252117
4	0.946558	-2.680471	-0.055739	-0.624553

Get the percentage of change:

# 变化的百分比程度(波动变化)
s.pct_change()


#输出结果:
0         NaN
1   -0.013683
2    0.010405
3    0.018307
4    0.024719
dtype: float64



# 变化的百分比程度
df.pct_change(periods=1)


# 输出结果:

        0	        1	         2	         3
0	     NaN	     NaN	     NaN	     NaN
1	0.299115	0.036653	-2.946706	-1.148870
2	-2.241677	-0.825775	-1.599088	-1.156933
3	-0.536707	-5.547331	-1.006391	-48.742482
4	-2.931143	-2.486479	-10.526903	-0.501202

3.2 Covariance.cov()

Series.cov() can be used to calculate the covariance between series (excluding missing values).

DataFrame.cov() computes pairwise covariance between series in a DataFrame, also excluding NA/null values.

Example:

#计算两个Series之间的协方差
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))


#两个数据的协方差
s1.cov(s2)


#输出结果:
-0.0751790891671201



# 计算dataframe中数据的协方差
frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])


frame.cov()

#输出结果:
       a         b         c         d         e
a  1.000882 -0.003177 -0.002698 -0.006889  0.031912
b -0.003177  1.024721  0.000191  0.009212  0.000857
c -0.002698  0.000191  0.950735 -0.031743 -0.005087
d -0.006889  0.009212 -0.031743  1.002983 -0.047952
e  0.031912  0.000857 -0.005087 -0.047952  1.042487

 DataFrame.covAn optional min_periodskeyword is also supported that specifies the minimum number of observations required for each column pair in order to obtain valid results. For example

frame.cov(min_periods=12)

Observe at least 12 columns of data in the dataframe, and return NaN if there are not enough 12 columns.

3.3 Correlation.corr()

Correlation can be calculated using the .coor() method. Using this methodparameter, several methods of calculating the correlation are provided:

method name

describe

pearson (default)

standard correlation coefficient

kendall

Kendall Tau correlation coefficient

spearman

Spearman rank correlation coefficient

Example:

1. Correlation between two series

s1 = pd.Series(np.random.randn(10))
s2 = s1*2

#相关性(s1与s2)
s1.corr(s2)

#输出结果:
0.9999999999999999

 2. The correlation between the three sets of data (dataframe)

s1 = pd.Series(np.random.randn(10))
s2 = s1*2
s3 = pd.Series(np.random.randn(10))
df = pd.DataFrame({
    's1':s1,
    's2':s2,
    's3':s3
})

df
# 输出dataframe
        s1	        s2	       s3
0	-1.149359	-2.298718	0.742016
1	0.476084	0.952168	-0.375759
2	-0.998627	-1.997255	0.721653
3	1.047331	2.094663	-0.078039
4	0.444710	0.889420	-0.525895
5	-0.411778	-0.823557	-0.402789
6	-0.935911	-1.871822	-0.597614
7	-0.652570	-1.305140	0.636498
8	1.055361	2.110722	-0.763907
9	-1.222631	-2.445262	-0.153914


# 三者相关性
df.corr()

#输出结果:
        s1	       s2	       s3
s1	1.000000	1.000000	-0.548589
s2	1.000000	1.000000	-0.548589
s3	-0.548589	-0.548589	1.000000

3.4 Data Ranking.rank()

The .rank() method generates a data rank where relations are assigned the rank mean of the group, for example:

s = pd.Series([877,865,874,890,912])
s

#输出结果:
0    877
1    865
2    874
3    890
4    912
dtype: int64



s.rank()
#输出结果:
0    3.0
1    1.0
2    2.0
3    4.0
4    5.0
dtype: float64

In a dataframe, rank() can rank rows (  axis=0) or columns (  axis=1). NaNValues ​​are excluded from ranking.

rankOptionally takes ascendingan argument that defaults to true; if false, the data is reverse-ranked, with larger values ​​assigning smaller ranks.

rankDifferent draw methods are supported, method specified by parameters:

  • average: Average rank of the tied group

  • min: the lowest rank in the group

  • max: Highest rank in the group

  • first: Ranks assigned in the order they appear in the array

4. Window functions

pandas includes a compact set of APIs for performing window operations - operations that perform aggregations over sliding partitions of values. The API functions similarly to the API groupby, Series and DataFrame with the necessary parameters to call the windowing method, and then call the aggregation function.

pandas supports 4 types of window operations:

  1. Tumbling windows: Generic fixed or variable sliding windows over values.

  2. weighted window: weighted non-rectangular window provided by the library scipy.signal.

  3. Extended Window: The window over which the values ​​are accumulated.

  4. Exponentially Weighted Windows: Cumulative and exponentially weighted windows of values.

concept

method

returned object

Support for time-based windows

Support chained groupby

Support table method

Support online operation

Roller shutter

rolling

Rolling

Yes

Yes

Yes>1.3

No

weighting window

rolling

Window

No

No

No

No

expand window

expanding

Expanding

No

Yes

Yes>1.3

No

Exponentially weighted window

ewm

ExponentialMovingWindow

No

is >1.2

No

Yes (since version 1.3)

 Example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,4))
df
#输出结果:
        0	        1	         2	       3
0	2.599818	0.451315	-0.428038	0.035233
1	0.395523	-0.098377	0.059649	-0.489922
2	0.550164	-0.469461	1.193710	0.567562
3	1.483434	-0.793989	-0.738174	0.515078
4	0.395409	0.425578	-0.439963	-0.207277
5	-0.035479	-1.438315	-0.863333	-0.129948
6	-0.336889	-0.094188	-1.452638	0.083352
7	-0.626117	0.120990	-0.566740	0.665003
8	-1.437816	-0.112235	-0.232150	-0.099910
9	-0.582537	0.388641	1.008226	0.321893

1. .rolling() rolling shutter

# 滚动窗口求每三行之间的平均值
df.rolling(window=3).mean()

#输出结果:

        0	        1	        2	      3
0	    NaN	       NaN	       NaN	      NaN
1	    NaN	       NaN	       NaN	      NaN
2	1.181835	-0.038841	0.275107	0.037625
3	0.809707	-0.453942	0.171729	0.197573
4	0.809669	-0.279291	0.005191	0.291788
5	0.614455	-0.602242	-0.680490	0.059284
6	0.007681	-0.368975	-0.918644	-0.084624
7	-0.332828	-0.470504	-0.960904	0.206135
8	-0.800274	-0.028478	-0.750509	0.216148
9	-0.882157	0.132465	0.069779	0.295662

 2. .expanding to expand the window

#expanding
df.expanding(min_periods=3).mean()

#输出结果:
        0	         1	        2	      3
0	     NaN	     NaN	     NaN	     NaN
1	     NaN	     NaN	     NaN	     NaN
2	1.181835	-0.038841	0.275107	0.037625
3	1.257235	-0.227628	0.021787	0.156988
4	1.084869	-0.096987	-0.070563	0.084135
5	0.898145	-0.320542	-0.202691	0.048455
6	0.721711	-0.288205	-0.381255	0.053440
7	0.553233	-0.237056	-0.404441	0.129885
8	0.332005	-0.223187	-0.385297	0.104352
9	0.240551	-0.162004	-0.245945	0.126106

Guess you like

Origin blog.csdn.net/damadashen/article/details/126901690