笔记——pandas学习1

导入pandas

import pandas as pd

1. 字典键将用作列标题,而每个列表中的值将用作的行DataFrame。

In [2]: df = pd.DataFrame({
   ...:     "Name": ["Braund, Mr. Owen Harris",
   ...:              "Allen, Mr. William Henry",
   ...:              "Bonnell, Miss. Elizabeth"],
   ...:     "Age": [22, 35, 58],
   ...:     "Sex": ["male", "male", "female"]}
   ...: )
df
Name Age Sex
0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female

2. DataFrame的每一列都是一个Series,并且每个Series都又是一个DataFrame

df["Age"]
0    22
1    35
2    58
Name: Age, dtype: int64
ages = pd.Series([22, 35, 58], name="Age")
ages
0    22
1    35
2    58
Name: Age, dtype: int64

3. 对数据进行操作

df["Age"].max()
58
ages.max()
58
df.describe()#提供了对数值数据的快速概述DataFrame。
#由于Name和Sex列是文本数据,因此默认情况下该describe()方法不考虑.
Age
count 3.000000
mean 38.333333
std 18.230012
min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000

读取和写入表格数据

1.read_csv()将存储为csv文件的数据读入

titanic=pd.read_csv("data/titanic.csv") #read_csv()将存储为csv文件的数据读入
titanic
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

2.使用head()查看前n行,tail()查看后m行

titanic.head(8)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
titanic.tail(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

3. dtypes属性:解释每种列数据类型

titanic.dtypes#数据类型DataFrame为整数(int64),浮点数(float63)和字符串(object)。
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

4.数据存储

  1. read_*函数用来读取数据
  2. to_*方法用于存储数据
  3. to_excel()方法将数据存储为excel文件。
titanic.to_excel('titanic.xlsx',sheet_name='passengers',index=False)#sheet_name名为passemgers,通过设置 index=False行索引标签不会保存在电子表格中。
titanic=pd.read_excel('titanic.xlsx',sheet_name='passengers')
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

5. 方法info()提供有关的技术信息

titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

小结:

  1. read_*函数支持将数据从许多不同的文件格式或数据源导入。

  2. 通过不同的to_*方法可以将数据导出 。

  3. head/ tail/ info方法和dtypes属性。

选择子集

1.选择单列

ages=titanic['Age']
ages.head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
type(titanic["Age"])#每一列DataFrame都是一个Series
pandas.core.series.Series
titanic["Age"].shape#单个列返回的对象是仍然是DataFrame
(891,)

2.选择多列

age_sex=titanic[["Age","Sex"]]
age_sex.head()
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
type(titanic[["Age","Sex"]])
pandas.core.frame.DataFrame
titanic[["Age","Sex"]].shape#返回了DataFrame891行和2列。请记住,a DataFrame是二维的,具有行和列的维
(891, 2)

过滤特定行

above_35=titanic[titanic["Age"]>35]
above_35.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
titanic["Age"]>35
0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool
above_35.shape
(217, 12)
class_23=titanic[titanic["Pclass"].isin([2,3])]
class_23.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

1.isin()条件函数

  1. 与条件表达式类似,返回布尔型
  2. 为在列表中的每一行某个不为空的值都返回一个True
    ( the isin() conditional function returns a True for each row the values are in the provided list.)
class_23=titanic[(titanic["Pclass"]==2)|(titanic["Pclass"]==3)]
class_23.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

注意

  1. 组合多个条件语句时,每个条件必须用括号括起来()。
  2. 不能使用 or,and只能使用|和&。

2.notna():为每一行的某一个not Null的值返回一个True

age_no_na=titanic[titanic["Age"].notna()]
age_no_na.shape
(714, 12)

选择特定的行和列

1. loc[] :行和列名

adult_names=titanic.loc[titanic["Age"]>35,"Name"]
adult_names.head()
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

2. iloc[]:基于位置

titanic.iloc[9:25,2:5]#取第10至25行和第3至5列
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
14 3 Vestrom, Miss. Hulda Amanda Adolfina female
15 2 Hewlett, Mrs. (Mary D Kingcome) female
16 3 Rice, Master. Eugene male
17 2 Williams, Mr. Charles Eugene male
18 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female
19 3 Masselmani, Mrs. Fatima female
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female

3.将表中元素进行替换

titanic.iloc[:3,3]="YuHongxia"#将前三行的第四列换成YuHongxia
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 YuHongxia male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 YuHongxia female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 YuHongxia female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

小结:

  1. 选择数据子集时,使用方括号[]不是()。

  2. 在括号内,可以使用单个列/行标签,列/行标签列表,标签切片,条件表达式或冒号。

  3. 选择特定的行和列时,如果使用行和列名可以使用loc

  4. 选择特定的行和列时,如果使用表中的位置可以使用iloc

  5. 可以基于loc/ iloc为选择分配新值。


猜你喜欢

转载自blog.csdn.net/chairon/article/details/107595717