pandas学习笔记(第一弹)

前言: 好久就想彻彻底底的从头到尾理一遍pandas的使用方法,一直嫌太麻烦就没有记录,今天终于下定决心要把pandas彻底理清楚,这只是一部分,后面的会继续发布在此专栏当中,写的可能有一些啰嗦,但可以拿来后面忘记用法时当文档使用;

1. DataFrame的结构

1.1 导入必要的包

import pandas as pd
import numpy as np

1.2 设置pandans显示的边界值

pd.set_option("max_columns",8,"max_rows",8)

1.3 从文件导入DataFrame

movies = pd.read_csv("./pandasLearnData/movie.csv")

1.4 显示前n行

# 默认显示前五行,head(n)表示显示前n行
movies.head()
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 ... 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN ... 12.0 7.1 NaN 0

5 rows × 28 columns

1.5 显示后n行

# 默认显示尾部五行,tail(n)表示尾部n行
movies.tail()
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
4911 Color Scott Smith 1.0 87.0 ... 470.0 7.7 NaN 84
4912 Color NaN 43.0 43.0 ... 593.0 7.5 16.00 32000
4913 Color Benjamin Roberds 13.0 76.0 ... 0.0 6.3 NaN 16
4914 Color Daniel Hsia 14.0 100.0 ... 719.0 6.3 2.35 660
4915 Color Jon Gunn 43.0 90.0 ... 23.0 6.6 1.85 456

5 rows × 28 columns

2. 访问DataFrame的组件

2.1 提取列索引

columns_index = movies.columns
columns_index
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

2.2 提取行索引

rows_index = movies.index
rows_index
RangeIndex(start=0, stop=4916, step=1)

2.3 提取数据

movies_data = movies.values
movies_data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

2.4 查看他们的类型

print("列索引类型:",type(columns_index))
print("行索引类型:",type(rows_index))
print("数据存放类型:",type(movies_data))
列索引类型: <class 'pandas.core.indexes.base.Index'>
行索引类型: <class 'pandas.core.indexes.range.RangeIndex'>
数据存放类型: <class 'numpy.ndarray'>

2.5 RangeIndex 是 Index的子类

issubclass(pd.RangeIndex,pd.Index)
True

2.6 获取行索引的值

rows_index.values
array([   0,    1,    2, ..., 4913, 4914, 4915])

2.7 获取列索引的值

columns_index.values
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)

3. DataFrame中的数据类型

3.1 获取数据集中的类型

movies.dtypes
color                      object
director_name              object
num_critic_for_reviews    float64
duration                  float64
                           ...   
actor_2_facebook_likes    float64
imdb_score                float64
aspect_ratio              float64
movie_facebook_likes        int64
Length: 28, dtype: object

3.2 统计数据集中每种类型的个数

movies.dtypes.value_counts()
float64    13
object     12
int64       3
dtype: int64

4. Series

4.1 选取数据集中的某一列作为Series

4.1.1 通过索引方式获取

director_name = movies["director_name"]
director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

4.1.2 通过属性名方式获取

director_name = movies.director_name
director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

4.2 获取Series名称

director_name.name
'director_name'

4.3 Series 转 DataFrame

director_name.to_frame().head()
director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker

5. 调用Series的方法

5.1 查看DataFrame和Series中的方法个数

dataframe_methods = set(dir(pd.DataFrame))
series_methods = set(dir(pd.Series))
print("DataFrame 方法个数:",len(dataframe_methods))
print("Series    方法个数:",len(series_methods))
print("Series 和 DataFrame 的公共方法个数:",len(dataframe_methods & series_methods))
DataFrame 方法个数: 451
Series    方法个数: 465
Series 和 DataFrame 的公共方法个数: 394

5.2 显示前n行

actor1_facebook_likes = movies["actor_1_facebook_likes"]
actor1_facebook_likes.head()
0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

5.3 统计Series中的值出现的个数

actor1_facebook_likes.value_counts()
1000.0     436
11000.0    206
2000.0     189
3000.0     150
          ... 
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

5.4 Series的大小

actor1_facebook_likes.size
4916

5.5 Series的shape

actor1_facebook_likes.shape
# (行数,[列])
(4916,)

5.6 Series中元素的个数

len(actor1_facebook_likes)
4916

5.7 Series中非空元素个数

actor1_facebook_likes.count()
# 跟前面总的元素个数进行比较发现少了7个所以有7个缺省值
4909

5.8 Series求中位分位数

# 默认 .5
actor1_facebook_likes.quantile()
982.0

5.9 Series求最小值

actor1_facebook_likes.min()
0.0

5.10 Series求最大值

actor1_facebook_likes.max()
640000.0

5.11 Series求均值

actor1_facebook_likes.mean()
6494.488490527602

5.12 Series求标准差

actor1_facebook_likes.std()
15106.986883848309

5.13 Series求中位数

actor1_facebook_likes.median()
982.0

5.14 Series求和

actor1_facebook_likes.sum()
31881444.0

5.15 Series的描述信息

actor1_facebook_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64
director_name.describe()
# count:非空元素个数
# unique:种类(也就是不同的元素个数)
# top:出现最多的
# freq:出现最高的次数
count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

5.16 Series中元素判断空值

# isnull() 会对Series中每个元素判断一次是否为空返回bool值
director_name.isnull()
0       False
1       False
2       False
3       False
        ...  
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

5.17 Series的非空值

# 和上面一个相反,表示非空值
actor1_facebook_likes.notnull()
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

5.18 Series填充空值

# 将空值填充为0并产看其大小
actor1_facebook_likes.fillna(0).size
4916

5.19 Series删除空值

# 删除Series中的空值并查看其大小
actor1_facebook_likes.dropna().size
4909

5.20 拓展

5.20.1 分位数

# 查看数据中 20% 这个比例线的值
actor1_facebook_likes.quantile(0.2)
510.0
# 产看 多个分位数
actor1_facebook_likes.quantile(np.linspace(0.1,1,10))
0.1       240.0
0.2       510.0
0.3       694.0
0.4       854.0
         ...   
0.7      8000.0
0.8     13000.0
0.9     18000.0
1.0    640000.0
Name: actor_1_facebook_likes, Length: 10, dtype: float64

5.20.2 统计Series中每个元素的个数并返回频率

director_name.value_counts(normalize=True)
Steven Spielberg    0.005401
Woody Allen         0.004570
Martin Scorsese     0.004155
Clint Eastwood      0.004155
                      ...   
Paul Bunnell        0.000208
Kelly Asbury        0.000208
Jeff Crook          0.000208
Khalil Sullins      0.000208
Name: director_name, Length: 2397, dtype: float64

5.20.3 判断Series中是否还存在缺省值

print("删除空值前:",director_name.hasnans)
print("删除空值后:",director_name.dropna().hasnans)
删除空值前: True
删除空值后: False

6. Series的算数运算

6.1 Series加法运算

6.1.1 原始数据展示

actor1_facebook_likes
0        1000.0
1       40000.0
2       11000.0
3       27000.0
         ...   
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.1.2 操作符方法

actor1_facebook_likes + 1
0        1001.0
1       40001.0
2       11001.0
3       27001.0
         ...   
4912      842.0
4913        1.0
4914      947.0
4915       87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.1.3 函数方法

actor1_facebook_likes.add(1)
0        1001.0
1       40001.0
2       11001.0
3       27001.0
         ...   
4912      842.0
4913        1.0
4914      947.0
4915       87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.2 Series减法运算

6.2.1 操作符方法

actor1_facebook_likes - 1
0         999.0
1       39999.0
2       10999.0
3       26999.0
         ...   
4912      840.0
4913       -1.0
4914      945.0
4915       85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.2.2 函数方法

# subtract 是全称 ,缩写全称一样的
actor1_facebook_likes.sub(1)
0         999.0
1       39999.0
2       10999.0
3       26999.0
         ...   
4912      840.0
4913       -1.0
4914      945.0
4915       85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.3 Series乘法运算

6.3.1 操作符方法

actor1_facebook_likes * 5
0         5000.0
1       200000.0
2        55000.0
3       135000.0
          ...   
4912      4205.0
4913         0.0
4914      4730.0
4915       430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.3.2 函数方法

# 全称 multiply()
actor1_facebook_likes.mul(5)
0         5000.0
1       200000.0
2        55000.0
3       135000.0
          ...   
4912      4205.0
4913         0.0
4914      4730.0
4915       430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.4 Series除法运算

6.4.1 操作符方法

actor1_facebook_likes / 10
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.1
4913       0.0
4914      94.6
4915       8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.4.2 函数方法

# 全称divide
actor1_facebook_likes.div(10)
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.1
4913       0.0
4914      94.6
4915       8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.5 Series 整除运算

6.5.1 操作符方法

actor1_facebook_likes // 10
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.0
4913       0.0
4914      94.0
4915       8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.5.2 函数方法

actor1_facebook_likes.floordiv(10)
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.0
4913       0.0
4914      94.0
4915       8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.6 Series取模运算

6.6.1 操作符方法

actor1_facebook_likes % 7
0       6.0
1       2.0
2       3.0
3       1.0
       ... 
4912    1.0
4913    0.0
4914    1.0
4915    2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.6.2 函数方法

# 全称mode
actor1_facebook_likes.mod(7)
0       6.0
1       2.0
2       3.0
3       1.0
       ... 
4912    1.0
4913    0.0
4914    1.0
4915    2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

7. Series比较运算

7.1 大于

7.1.1 操作符方法

actor1_facebook_likes > 5
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.1.2 函数方法

actor1_facebook_likes.gt(5)
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.2 大于等于

7.2.1 操作符方法

actor1_facebook_likes >= 5
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.2.2 函数方法

actor1_facebook_likes.ge(5)
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.3 小于

7.3.1 操作符方法

actor1_facebook_likes < 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.3.2 函数方法

actor1_facebook_likes.lt(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.4 小于等于

7.4.1 操作符方法

actor1_facebook_likes <= 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.4.2 函数方法

actor1_facebook_likes.le(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.5 等于

7.5.1 操作符方法

actor1_facebook_likes == 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913    False
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.5.2 函数方法

actor1_facebook_likes.eq(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913    False
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.6 不等于

7.6.1 操作符方法

actor1_facebook_likes != 5
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.6.2 函数方法

actor1_facebook_likes.ne(5)
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

8. Series扩展

8.1 astype() 类型转换

# 传入需要转换的类型
actor1_facebook_likes.astype(str)
0        1000.0
1       40000.0
2       11000.0
3       27000.0
         ...   
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: object

8.2 Series类型

actor1_facebook_likes.dtype
dtype('float64')

8.3 Serires的链式调用

# 填充空值为0 -> 除以10 -> 加上5 -> 转为整形 -> 求和
actor1_facebook_likes.fillna(0)\
.div(10)\
.add(5)\
.astype(int)\
.sum()
3211610

发布了35 篇原创文章 · 获赞 78 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/qq_42359956/article/details/105451514