Foreword: I have wanted to thoroughly review the use of pandas from beginning to end for a long time.It has been too troublesome to have no records.Today, I finally made up my mind to thoroughly understand pandas. This is only a part. Among them, there may be some long-winded writing, but it can be used as a document when forgetting the usage later;
1. The structure of the DataFrame
1.1 Import the necessary packages
import pandas as pd
import numpy as np
1.2 Set the boundary value displayed by pandans
pd.set_option("max_columns",8,"max_rows",8)
1.3 Import DataFrame from file
movies = pd.read_csv("./pandasLearnData/movie.csv")
1.4 Display the first n lines
# 默认显示前五行,head(n)表示显示前n行
movies.head()
color | director_name | num_critic_for_reviews | duration | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | ... | 936.0 | 7.9 | 1.78 | 33000 |
1 | Color | Up Verbinski | 302.0 | 169.0 | ... | 5000.0 | 7.1 | 2.35 | 0 |
2 | Color | Sam Mendes | 602.0 | 148.0 | ... | 393.0 | 6.8 | 2.35 | 85000 |
3 | Color | Christopher Nolan | 813.0 | 164.0 | ... | 23000.0 | 8.5 | 2.35 | 164000 |
4 | NaN | Doug Walker | NaN | NaN | ... | 12.0 | 7.1 | NaN | 0 |
5 rows × 28 columns
1.5 n lines after display
# 默认显示尾部五行,tail(n)表示尾部n行
movies.tail()
color | director_name | num_critic_for_reviews | duration | ... | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|
4911 | Color | Scott Smith | 1.0 | 87.0 | ... | 470.0 | 7.7 | NaN | 84 |
4912 | Color | NaN | 43.0 | 43.0 | ... | 593.0 | 7.5 | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | 13.0 | 76.0 | ... | 0.0 | 6.3 | NaN | 16 |
4914 | Color | Daniel Hsia | 14.0 | 100.0 | ... | 719.0 | 6.3 | 2.35 | 660 |
4915 | Color | Jon Gunn | 43.0 | 90.0 | ... | 23.0 | 6.6 | 1.85 | 456 |
5 rows × 28 columns
2. Access DataFrame components
2.1 Extract column index
columns_index = movies.columns
columns_index
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
dtype='object')
2.2 Extract row index
rows_index = movies.index
rows_index
RangeIndex(start=0, stop=4916, step=1)
2.3 Extracting data
movies_data = movies.values
movies_data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
...,
['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
2.4 View their types
print("列索引类型:",type(columns_index))
print("行索引类型:",type(rows_index))
print("数据存放类型:",type(movies_data))
列索引类型: <class 'pandas.core.indexes.base.Index'>
行索引类型: <class 'pandas.core.indexes.range.RangeIndex'>
数据存放类型: <class 'numpy.ndarray'>
2.5 RangeIndex is a subclass of Index
issubclass(pd.RangeIndex,pd.Index)
True
2.6 Get the value of the row index
rows_index.values
array([ 0, 1, 2, ..., 4913, 4914, 4915])
2.7 Get the value of the column index
columns_index.values
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes',
'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users',
'cast_total_facebook_likes', 'actor_3_name',
'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
'num_user_for_reviews', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
'aspect_ratio', 'movie_facebook_likes'], dtype=object)
3. Data types in DataFrame
3.1 Get the type in the data set
movies.dtypes
color object
director_name object
num_critic_for_reviews float64
duration float64
...
actor_2_facebook_likes float64
imdb_score float64
aspect_ratio float64
movie_facebook_likes int64
Length: 28, dtype: object
3.2 The number of each type in the statistical data set
movies.dtypes.value_counts()
float64 13
object 12
int64 3
dtype: int64
4. Series
4.1 Select a column in the data set as Series
4.1.1 Get by index
director_name = movies["director_name"]
director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
...
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
4.1.2 Obtained by property name
director_name = movies.director_name
director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
...
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
4.2 Get Series Name
director_name.name
'director_name'
4.3 Series 转 DataFrame
director_name.to_frame().head()
director_name | |
---|---|
0 | James Cameron |
1 | Up Verbinski |
2 | Sam Mendes |
3 | Christopher Nolan |
4 | Doug Walker |
5. Method of calling Series
5.1 View the number of methods in DataFrame and Series
dataframe_methods = set(dir(pd.DataFrame))
series_methods = set(dir(pd.Series))
print("DataFrame 方法个数:",len(dataframe_methods))
print("Series 方法个数:",len(series_methods))
print("Series 和 DataFrame 的公共方法个数:",len(dataframe_methods & series_methods))
DataFrame 方法个数: 451
Series 方法个数: 465
Series 和 DataFrame 的公共方法个数: 394
5.2 Display the first n lines
actor1_facebook_likes = movies["actor_1_facebook_likes"]
actor1_facebook_likes.head()
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
Name: actor_1_facebook_likes, dtype: float64
5.3 Count the number of values in the Series
actor1_facebook_likes.value_counts()
1000.0 436
11000.0 206
2000.0 189
3000.0 150
...
216.0 1
859.0 1
225.0 1
334.0 1
Name: actor_1_facebook_likes, Length: 877, dtype: int64
5.4 Series size
actor1_facebook_likes.size
4916
5.5 Series的shape
actor1_facebook_likes.shape
# (行数,[列])
(4916,)
5.6 Number of elements in Series
len(actor1_facebook_likes)
4916
5.7 Number of non-empty elements in Series
actor1_facebook_likes.count()
# 跟前面总的元素个数进行比较发现少了7个所以有7个缺省值
4909
5.8 Series find the median quantile
# 默认 .5
actor1_facebook_likes.quantile()
982.0
5.9 Series for minimum value
actor1_facebook_likes.min()
0.0
5.10 Series for maximum
actor1_facebook_likes.max()
640000.0
5.11 Series average
actor1_facebook_likes.mean()
6494.488490527602
5.12 Series seeking standard deviation
actor1_facebook_likes.std()
15106.986883848309
5.13 Series Positions
actor1_facebook_likes.median()
982.0
5.14 Series Summation
actor1_facebook_likes.sum()
31881444.0
5.15 Series description information
actor1_facebook_likes.describe()
count 4909.000000
mean 6494.488491
std 15106.986884
min 0.000000
25% 607.000000
50% 982.000000
75% 11000.000000
max 640000.000000
Name: actor_1_facebook_likes, dtype: float64
director_name.describe()
# count:非空元素个数
# unique:种类(也就是不同的元素个数)
# top:出现最多的
# freq:出现最高的次数
count 4814
unique 2397
top Steven Spielberg
freq 26
Name: director_name, dtype: object
5.16 Element judgment null value in Series
# isnull() 会对Series中每个元素判断一次是否为空返回bool值
director_name.isnull()
0 False
1 False
2 False
3 False
...
4912 True
4913 False
4914 False
4915 False
Name: director_name, Length: 4916, dtype: bool
5.17 Series of non-null values
# 和上面一个相反,表示非空值
actor1_facebook_likes.notnull()
0 True
1 True
2 True
3 True
...
4912 True
4913 True
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
5.18 Series filled with null values
# 将空值填充为0并产看其大小
actor1_facebook_likes.fillna(0).size
4916
5.19 Series delete null value
# 删除Series中的空值并查看其大小
actor1_facebook_likes.dropna().size
4909
5.20 Expansion
5.20.1 Quantile
# 查看数据中 20% 这个比例线的值
actor1_facebook_likes.quantile(0.2)
510.0
# 产看 多个分位数
actor1_facebook_likes.quantile(np.linspace(0.1,1,10))
0.1 240.0
0.2 510.0
0.3 694.0
0.4 854.0
...
0.7 8000.0
0.8 13000.0
0.9 18000.0
1.0 640000.0
Name: actor_1_facebook_likes, Length: 10, dtype: float64
5.20.2 Count the number of each element in the Series and return the frequency
director_name.value_counts(normalize=True)
Steven Spielberg 0.005401
Woody Allen 0.004570
Martin Scorsese 0.004155
Clint Eastwood 0.004155
...
Paul Bunnell 0.000208
Kelly Asbury 0.000208
Jeff Crook 0.000208
Khalil Sullins 0.000208
Name: director_name, Length: 2397, dtype: float64
5.20.3 Determine whether there are still default values in the Series
print("删除空值前:",director_name.hasnans)
print("删除空值后:",director_name.dropna().hasnans)
删除空值前: True
删除空值后: False
6. Series arithmetic operation
6.1 Series addition
6.1.1 Raw data display
actor1_facebook_likes
0 1000.0
1 40000.0
2 11000.0
3 27000.0
...
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.1.2 Operator methods
actor1_facebook_likes + 1
0 1001.0
1 40001.0
2 11001.0
3 27001.0
...
4912 842.0
4913 1.0
4914 947.0
4915 87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.1.3 Function method
actor1_facebook_likes.add(1)
0 1001.0
1 40001.0
2 11001.0
3 27001.0
...
4912 842.0
4913 1.0
4914 947.0
4915 87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.2 Series subtraction
6.2.1 Operator methods
actor1_facebook_likes - 1
0 999.0
1 39999.0
2 10999.0
3 26999.0
...
4912 840.0
4913 -1.0
4914 945.0
4915 85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.2.2 Function method
# subtract 是全称 ,缩写全称一样的
actor1_facebook_likes.sub(1)
0 999.0
1 39999.0
2 10999.0
3 26999.0
...
4912 840.0
4913 -1.0
4914 945.0
4915 85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.3 Series multiplication
6.3.1 Operator methods
actor1_facebook_likes * 5
0 5000.0
1 200000.0
2 55000.0
3 135000.0
...
4912 4205.0
4913 0.0
4914 4730.0
4915 430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.3.2 Function method
# 全称 multiply()
actor1_facebook_likes.mul(5)
0 5000.0
1 200000.0
2 55000.0
3 135000.0
...
4912 4205.0
4913 0.0
4914 4730.0
4915 430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.4 Series division
6.4.1 Operator methods
actor1_facebook_likes / 10
0 100.0
1 4000.0
2 1100.0
3 2700.0
...
4912 84.1
4913 0.0
4914 94.6
4915 8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.4.2 Function method
# 全称divide
actor1_facebook_likes.div(10)
0 100.0
1 4000.0
2 1100.0
3 2700.0
...
4912 84.1
4913 0.0
4914 94.6
4915 8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.5 Series division operation
6.5.1 Operator methods
actor1_facebook_likes // 10
0 100.0
1 4000.0
2 1100.0
3 2700.0
...
4912 84.0
4913 0.0
4914 94.0
4915 8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.5.2 Function method
actor1_facebook_likes.floordiv(10)
0 100.0
1 4000.0
2 1100.0
3 2700.0
...
4912 84.0
4913 0.0
4914 94.0
4915 8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.6 Series modulus operation
6.6.1 Operator methods
actor1_facebook_likes % 7
0 6.0
1 2.0
2 3.0
3 1.0
...
4912 1.0
4913 0.0
4914 1.0
4915 2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
6.6.2 Function method
# 全称mode
actor1_facebook_likes.mod(7)
0 6.0
1 2.0
2 3.0
3 1.0
...
4912 1.0
4913 0.0
4914 1.0
4915 2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
7. Series comparison operation
7.1 greater than
7.1.1 Operator methods
actor1_facebook_likes > 5
0 True
1 True
2 True
3 True
...
4912 True
4913 False
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.1.2 Function method
actor1_facebook_likes.gt(5)
0 True
1 True
2 True
3 True
...
4912 True
4913 False
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.2 Greater than or equal
7.2.1 Operator methods
actor1_facebook_likes >= 5
0 True
1 True
2 True
3 True
...
4912 True
4913 False
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.2.2 Function method
actor1_facebook_likes.ge(5)
0 True
1 True
2 True
3 True
...
4912 True
4913 False
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.3 Less than
7.3.1 Operator methods
actor1_facebook_likes < 5
0 False
1 False
2 False
3 False
...
4912 False
4913 True
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.3.2 Function methods
actor1_facebook_likes.lt(5)
0 False
1 False
2 False
3 False
...
4912 False
4913 True
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.4 Less than or equal to
7.4.1 Operator methods
actor1_facebook_likes <= 5
0 False
1 False
2 False
3 False
...
4912 False
4913 True
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.4.2 Function methods
actor1_facebook_likes.le(5)
0 False
1 False
2 False
3 False
...
4912 False
4913 True
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.5 equals
7.5.1 Operator methods
actor1_facebook_likes == 5
0 False
1 False
2 False
3 False
...
4912 False
4913 False
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.5.2 Function method
actor1_facebook_likes.eq(5)
0 False
1 False
2 False
3 False
...
4912 False
4913 False
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.6 not equal
7.6.1 Operator methods
actor1_facebook_likes != 5
0 True
1 True
2 True
3 True
...
4912 True
4913 True
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
7.6.2 Function method
actor1_facebook_likes.ne(5)
0 True
1 True
2 True
3 True
...
4912 True
4913 True
4914 True
4915 True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
8. Series expansion
8.1 astype () type conversion
# 传入需要转换的类型
actor1_facebook_likes.astype(str)
0 1000.0
1 40000.0
2 11000.0
3 27000.0
...
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: object
8.2 Series type
actor1_facebook_likes.dtype
dtype('float64')
8.3 Serires Chain Call
# 填充空值为0 -> 除以10 -> 加上5 -> 转为整形 -> 求和
actor1_facebook_likes.fillna(0)\
.div(10)\
.add(5)\
.astype(int)\
.sum()
3211610