Pandas study notes (first bomb)

Foreword: I have wanted to thoroughly review the use of pandas from beginning to end for a long time.It has been too troublesome to have no records.Today, I finally made up my mind to thoroughly understand pandas. This is only a part. Among them, there may be some long-winded writing, but it can be used as a document when forgetting the usage later;

1. The structure of the DataFrame

1.1 Import the necessary packages

import pandas as pd
import numpy as np

1.2 Set the boundary value displayed by pandans

pd.set_option("max_columns",8,"max_rows",8)

1.3 Import DataFrame from file

movies = pd.read_csv("./pandasLearnData/movie.csv")

1.4 Display the first n lines

# 默认显示前五行,head(n)表示显示前n行
movies.head()
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 ... 936.0 7.9 1.78 33000
1 Color Up Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN ... 12.0 7.1 NaN 0

5 rows × 28 columns

1.5 n lines after display

# 默认显示尾部五行,tail(n)表示尾部n行
movies.tail()
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
4911 Color Scott Smith 1.0 87.0 ... 470.0 7.7 NaN 84
4912 Color NaN 43.0 43.0 ... 593.0 7.5 16.00 32000
4913 Color Benjamin Roberds 13.0 76.0 ... 0.0 6.3 NaN 16
4914 Color Daniel Hsia 14.0 100.0 ... 719.0 6.3 2.35 660
4915 Color Jon Gunn 43.0 90.0 ... 23.0 6.6 1.85 456

5 rows × 28 columns

2. Access DataFrame components

2.1 Extract column index

columns_index = movies.columns
columns_index
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

2.2 Extract row index

rows_index = movies.index
rows_index
RangeIndex(start=0, stop=4916, step=1)

2.3 Extracting data

movies_data = movies.values
movies_data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

2.4 View their types

print("列索引类型:",type(columns_index))
print("行索引类型:",type(rows_index))
print("数据存放类型:",type(movies_data))
列索引类型: <class 'pandas.core.indexes.base.Index'>
行索引类型: <class 'pandas.core.indexes.range.RangeIndex'>
数据存放类型: <class 'numpy.ndarray'>

2.5 RangeIndex is a subclass of Index

issubclass(pd.RangeIndex,pd.Index)
True

2.6 Get the value of the row index

rows_index.values
array([   0,    1,    2, ..., 4913, 4914, 4915])

2.7 Get the value of the column index

columns_index.values
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)

3. Data types in DataFrame

3.1 Get the type in the data set

movies.dtypes
color                      object
director_name              object
num_critic_for_reviews    float64
duration                  float64
                           ...   
actor_2_facebook_likes    float64
imdb_score                float64
aspect_ratio              float64
movie_facebook_likes        int64
Length: 28, dtype: object

3.2 The number of each type in the statistical data set

movies.dtypes.value_counts()
float64    13
object     12
int64       3
dtype: int64

4. Series

4.1 Select a column in the data set as Series

4.1.1 Get by index

director_name = movies["director_name"]
director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

4.1.2 Obtained by property name

director_name = movies.director_name
director_name
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

4.2 Get Series Name

director_name.name
'director_name'

4.3 Series 转 DataFrame

director_name.to_frame().head()
director_name
0 James Cameron
1 Up Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker

5. Method of calling Series

5.1 View the number of methods in DataFrame and Series

dataframe_methods = set(dir(pd.DataFrame))
series_methods = set(dir(pd.Series))
print("DataFrame 方法个数:",len(dataframe_methods))
print("Series    方法个数:",len(series_methods))
print("Series 和 DataFrame 的公共方法个数:",len(dataframe_methods & series_methods))
DataFrame 方法个数: 451
Series    方法个数: 465
Series 和 DataFrame 的公共方法个数: 394

5.2 Display the first n lines

actor1_facebook_likes = movies["actor_1_facebook_likes"]
actor1_facebook_likes.head()
0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

5.3 Count the number of values ​​in the Series

actor1_facebook_likes.value_counts()
1000.0     436
11000.0    206
2000.0     189
3000.0     150
          ... 
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

5.4 Series size

actor1_facebook_likes.size
4916

5.5 Series的shape

actor1_facebook_likes.shape
# (行数,[列])
(4916,)

5.6 Number of elements in Series

len(actor1_facebook_likes)
4916

5.7 Number of non-empty elements in Series

actor1_facebook_likes.count()
# 跟前面总的元素个数进行比较发现少了7个所以有7个缺省值
4909

5.8 Series find the median quantile

# 默认 .5
actor1_facebook_likes.quantile()
982.0

5.9 Series for minimum value

actor1_facebook_likes.min()
0.0

5.10 Series for maximum

actor1_facebook_likes.max()
640000.0

5.11 Series average

actor1_facebook_likes.mean()
6494.488490527602

5.12 Series seeking standard deviation

actor1_facebook_likes.std()
15106.986883848309

5.13 Series Positions

actor1_facebook_likes.median()
982.0

5.14 Series Summation

actor1_facebook_likes.sum()
31881444.0

5.15 Series description information

actor1_facebook_likes.describe()
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64
director_name.describe()
# count:非空元素个数
# unique:种类(也就是不同的元素个数)
# top:出现最多的
# freq:出现最高的次数
count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

5.16 Element judgment null value in Series

# isnull() 会对Series中每个元素判断一次是否为空返回bool值
director_name.isnull()
0       False
1       False
2       False
3       False
        ...  
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

5.17 Series of non-null values

# 和上面一个相反,表示非空值
actor1_facebook_likes.notnull()
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

5.18 Series filled with null values

# 将空值填充为0并产看其大小
actor1_facebook_likes.fillna(0).size
4916

5.19 Series delete null value

# 删除Series中的空值并查看其大小
actor1_facebook_likes.dropna().size
4909

5.20 Expansion

5.20.1 Quantile

# 查看数据中 20% 这个比例线的值
actor1_facebook_likes.quantile(0.2)
510.0
# 产看 多个分位数
actor1_facebook_likes.quantile(np.linspace(0.1,1,10))
0.1       240.0
0.2       510.0
0.3       694.0
0.4       854.0
         ...   
0.7      8000.0
0.8     13000.0
0.9     18000.0
1.0    640000.0
Name: actor_1_facebook_likes, Length: 10, dtype: float64

5.20.2 Count the number of each element in the Series and return the frequency

director_name.value_counts(normalize=True)
Steven Spielberg    0.005401
Woody Allen         0.004570
Martin Scorsese     0.004155
Clint Eastwood      0.004155
                      ...   
Paul Bunnell        0.000208
Kelly Asbury        0.000208
Jeff Crook          0.000208
Khalil Sullins      0.000208
Name: director_name, Length: 2397, dtype: float64

5.20.3 Determine whether there are still default values ​​in the Series

print("删除空值前:",director_name.hasnans)
print("删除空值后:",director_name.dropna().hasnans)
删除空值前: True
删除空值后: False

6. Series arithmetic operation

6.1 Series addition

6.1.1 Raw data display

actor1_facebook_likes
0        1000.0
1       40000.0
2       11000.0
3       27000.0
         ...   
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.1.2 Operator methods

actor1_facebook_likes + 1
0        1001.0
1       40001.0
2       11001.0
3       27001.0
         ...   
4912      842.0
4913        1.0
4914      947.0
4915       87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.1.3 Function method

actor1_facebook_likes.add(1)
0        1001.0
1       40001.0
2       11001.0
3       27001.0
         ...   
4912      842.0
4913        1.0
4914      947.0
4915       87.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.2 Series subtraction

6.2.1 Operator methods

actor1_facebook_likes - 1
0         999.0
1       39999.0
2       10999.0
3       26999.0
         ...   
4912      840.0
4913       -1.0
4914      945.0
4915       85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.2.2 Function method

# subtract 是全称 ,缩写全称一样的
actor1_facebook_likes.sub(1)
0         999.0
1       39999.0
2       10999.0
3       26999.0
         ...   
4912      840.0
4913       -1.0
4914      945.0
4915       85.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.3 Series multiplication

6.3.1 Operator methods

actor1_facebook_likes * 5
0         5000.0
1       200000.0
2        55000.0
3       135000.0
          ...   
4912      4205.0
4913         0.0
4914      4730.0
4915       430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.3.2 Function method

# 全称 multiply()
actor1_facebook_likes.mul(5)
0         5000.0
1       200000.0
2        55000.0
3       135000.0
          ...   
4912      4205.0
4913         0.0
4914      4730.0
4915       430.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.4 Series division

6.4.1 Operator methods

actor1_facebook_likes / 10
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.1
4913       0.0
4914      94.6
4915       8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.4.2 Function method

# 全称divide
actor1_facebook_likes.div(10)
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.1
4913       0.0
4914      94.6
4915       8.6
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.5 Series division operation

6.5.1 Operator methods

actor1_facebook_likes // 10
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.0
4913       0.0
4914      94.0
4915       8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.5.2 Function method

actor1_facebook_likes.floordiv(10)
0        100.0
1       4000.0
2       1100.0
3       2700.0
         ...  
4912      84.0
4913       0.0
4914      94.0
4915       8.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.6 Series modulus operation

6.6.1 Operator methods

actor1_facebook_likes % 7
0       6.0
1       2.0
2       3.0
3       1.0
       ... 
4912    1.0
4913    0.0
4914    1.0
4915    2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

6.6.2 Function method

# 全称mode
actor1_facebook_likes.mod(7)
0       6.0
1       2.0
2       3.0
3       1.0
       ... 
4912    1.0
4913    0.0
4914    1.0
4915    2.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

7. Series comparison operation

7.1 greater than

7.1.1 Operator methods

actor1_facebook_likes > 5
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.1.2 Function method

actor1_facebook_likes.gt(5)
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.2 Greater than or equal

7.2.1 Operator methods

actor1_facebook_likes >= 5
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.2.2 Function method

actor1_facebook_likes.ge(5)
0        True
1        True
2        True
3        True
        ...  
4912     True
4913    False
4914     True
4915     True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.3 Less than

7.3.1 Operator methods

actor1_facebook_likes < 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.3.2 Function methods

actor1_facebook_likes.lt(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.4 Less than or equal to

7.4.1 Operator methods

actor1_facebook_likes <= 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.4.2 Function methods

actor1_facebook_likes.le(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913     True
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.5 equals

7.5.1 Operator methods

actor1_facebook_likes == 5
0       False
1       False
2       False
3       False
        ...  
4912    False
4913    False
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.5.2 Function method

actor1_facebook_likes.eq(5)
0       False
1       False
2       False
3       False
        ...  
4912    False
4913    False
4914    False
4915    False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.6 not equal

7.6.1 Operator methods

actor1_facebook_likes != 5
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

7.6.2 Function method

actor1_facebook_likes.ne(5)
0       True
1       True
2       True
3       True
        ... 
4912    True
4913    True
4914    True
4915    True
Name: actor_1_facebook_likes, Length: 4916, dtype: bool

8. Series expansion

8.1 astype () type conversion

# 传入需要转换的类型
actor1_facebook_likes.astype(str)
0        1000.0
1       40000.0
2       11000.0
3       27000.0
         ...   
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: object

8.2 Series type

actor1_facebook_likes.dtype
dtype('float64')

8.3 Serires Chain Call

# 填充空值为0 -> 除以10 -> 加上5 -> 转为整形 -> 求和
actor1_facebook_likes.fillna(0)\
.div(10)\
.add(5)\
.astype(int)\
.sum()
3211610

Published 35 original articles · won 78 · views 10,000+

Guess you like

Origin blog.csdn.net/qq_42359956/article/details/105451514