pandas处理数据简单的分为如下步骤:
读取数据-->分析数据-->处理数据-->导出数据
第一次主要是走一个流程
df1 = pd.read_csv('/path/xx.csv') # 通过pd.read_csv读数据,格式为dataframe
# df1.to_csv('df1.csv',index=False) # 把内容写到名为df1.csv的文件中,把索引序号去除
df1的内容为
Sep 2018 Sep 2017 Change Programming Language Ratings Change.1
0 1 1 NaN Java 17.436% +4.75%
1 2 2 NaN C 15.447% +8.06%
2 3 5 change Python 7.653% +4.67%
3 4 3 change C++ 7.394% +1.83%
4 5 8 change Visual Basic .NET 5.308% +3.33%
5 6 4 change C# 3.295% -1.48%
6 7 6 change PHP 2.775% +0.57%
7 8 7 change JavaScript 2.131% +0.11%
8 9 - change SQL 2.062% +2.06%
9 10 18 change Objective-C 1.509% +0.00%
df1.columns # 可以把读取的数据的行标签列出,通过该操作可以索引我们想要的内容
Index(['Sep 2018', 'Sep 2017', 'Change', 'Programming Language', 'Ratings','Change.1'],dtype='object')
df_new = DataFrame(df, columns=['Sep 2019','Sep 2018', 'Change', 'Programming Language']) #该操作可以从原始数据
#中提取想要的内容,并且可以添加新的列,值初始为nan
Sep 2019 Sep 2018 Change Programming Language
0 NaN 1 NaN Java
1 NaN 2 NaN C
2 NaN 3 change Python
3 NaN 4 change C++
4 NaN 5 change Visual Basic .NET
5 NaN 6 change C#
6 NaN 7 change PHP
7 NaN 8 change JavaScript
8 NaN 9 change SQL
9 NaN 10 change Objective-C
df_new['Sep 2019'] = range(10) # 给新插入的列赋值,也很有用
Sep 2019 Sep 2018 Change Programming Language
0 0 1 NaN Java
1 1 2 NaN C
2 2 3 change Python
3 3 4 change C++
4 4 5 change Visual Basic .NET
5 5 6 change C#
6 6 7 change PHP
7 7 8 change JavaScript
8 8 9 change SQL
9 9 10 change Objective-C
dataframe进行排序
df2 = df1.sort_values('A') # 按照某列数据值进行整体排序
df2.sort_index() # 按照索引排序
如何一步提取数据中我们想要的内容并根据某个特征排好序
关键代码:
下面先分布,在一步
df = pd.read_csv('./movie_metadata.csv') # 加载原始数据
df.shape # 查看原始数据的大小也就是形状,从打印可以看出该数据5043行,28列,也就是说有5043个样本,28
# 个特征
(5043, 28)
df.columns # 查看列标签
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
dtype='object')
df.head(5) # 看数据的前五行,太多不给显示了
# 从原始数据中提取imdb_score,director_name,movie_title,并按照imdb_score降序排序,一句代码实现
df_new = DataFrame(df,columns=['imdb_score','director_name','movie_title']).sort_values('imdb_score',ascending=False)
imdb_score director_name movie_title
2765 9.5 John Blanchard Towering Inferno
1937 9.3 Frank Darabont The Shawshank Redemption
3466 9.2 Francis Ford Coppola The Godfather
4409 9.1 John Stockwell Kickboxer: Vengeance
2824 9.1 NaN Dekalog
3207 9.1 NaN Dekalog
66 9.0 Christopher Nolan The Dark Knight
5043 rows × 3 columns
给出一步操作执行代码:
pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')
pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')
这一步从读--->筛选数据---->排序----->导出数据 一步完成,可以分开写,但是这样写的好处有利于培养我们对代码的敏感性
使用jupyter时使用这个很方便,边写边查看数据,例如:
pd.read_csv('movie_metadata.csv').columns进行查看标签栏,或者通过head()显示前五行,执行一次后再删除继续往下写
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
dtype='object')
# 此时再进行提取我们想要的内容,例如需要提取['movie_title','imdb_score','director_name']这三个特征的数据,同时使用head()进行查看
pd.read_csv('movie_metadata.csv')[['movie_title','imdb_score','director_name']].head() #默认显示前五行
movie_title | imdb_score | director_name | |
---|---|---|---|
0 | Avatar | 7.9 | James Cameron |
1 | Pirates of the Caribbean: At World's End | 7.1 | Gore Verbinski |
2 | Spectre | 6.8 | Sam Mendes |
3 | The Dark Knight Rises | 8.5 | Christopher Nolan |
4 | Star Wars: Episode VII - The Force Awakens ... | 7.1 | Doug Walker |
# 再下面对数据进行处理,例如进行排序,按照imdb_score进行排序,此时去除head()继续往下写即可
# 按照imdb_score排好序以后可以通过head(10)查看前10行的代码
pd.read_csv('movie_metadata.csv'[['movie_title','imdb_score','director_name']].sort_values('imdb_score',ascending=False).head(10)
imdb_score movie_title director_name
2765 9.5 Towering Inferno John Blanchard
1937 9.3 The Shawshank Redemption Frank Darabont
3466 9.2 The Godfather Francis Ford Coppola
4409 9.1 Kickboxer: Vengeance John Stockwell
2824 9.1 Dekalog NaN
3207 9.1 Dekalog NaN
66 9.0 The Dark Knight Christopher Nolan
2837 9.0 The Godfather: Part II Francis Ford Coppola
3481 9.0 Fargo NaN
339 8.9 The Lord of the Rings: The Return of the King Peter Jackson
# 数据处理完以后(当然还有很多需要处理如缺值),把处理好的数据写入硬盘
# 删除head(10),继续向下写,把数据写入硬盘,命名为imbd_ex.csv'
pd.read_csv('movie_metadata.csv'[['imdb_score','movie_title','director_name']].sort_values('imdb_score',ascending=False).to_csv('imbd_ex.csv')
2018/10/02 17:33 212,877 imbd.csv
2018/10/02 18:18 212,877 imbd_ex.csv
2017/11/13 19:09 1,494,688 movie_metadata.csv
存在一个文件为imbd_ex.csv,打开看看.
imdb_score | director_name | movie_title | |||||
2765 | 9.5 | John Blanchard | Towering Inferno聽 | ||||
1937 | 9.3 | Frank Darabont | The Shawshank Redemption聽 | ||||
3466 | 9.2 | Francis Ford Coppola | The Godfather聽 | ||||
4409 | 9.1 | John Stockwell | Kickboxer: Vengeance聽 | ||||
2824 | 9.1 | Dekalog聽 | |||||
3207 | 9.1 | Dekalog聽 | |||||
66 | 9 | Christopher Nolan | The Dark Knight聽 | ||||
2837 | 9 | Francis Ford Coppola | The Godfather: Part II聽 | ||||
3481 | 9 | Fargo聽 | |||||
339 | 8.9 | Peter Jackson | The Lord of the Rings: The Return of the King聽 | ||||
4822 | 8.9 | Sidney Lumet | 12 Angry Men聽 |
发现有原始数据的行序号,不想要怎么办,在写操作时加上index=False 即 to_csv('imbd_ex.csv',index=False),在打开看看
imdb_score | director_name | movie_title |
9.5 | John Blanchard | Towering Inferno聽 |
9.3 | Frank Darabont | The Shawshank Redemption聽 |
9.2 | Francis Ford Coppola | The Godfather聽 |
9.1 | John Stockwell | Kickboxer: Vengeance聽 |
9.1 | Dekalog聽 | |
9.1 | Dekalog聽 | |
9 | Christopher Nolan | The Dark Knight聽 |
9 | Francis Ford Coppola | The Godfather: Part II聽 |
9 | Fargo聽 | |
8.9 | Peter Jackson | The Lord of the Rings: The Return of the King聽 |
8.9 | Sidney Lumet | 12 Angry Men聽 |
8.9 | Sergio Leone | The Good, the Bad and the Ugly聽 |
8.9 | Quentin Tarantino | Pulp Fiction聽 |
8.9 | Steven Spielberg | Schindler's List聽 |
8.8 | David Fincher | Fight Club聽 |
8.8 | Robert Zemeckis | Forrest Gump聽 |
8.8 | Peter Jackson | The Lord of the Rings: The Fellowship of the Ring聽 |
8.8 | Irvin Kershner | Star Wars: Episode V - The Empire Strikes Back聽 |
此时会发现已经没有了,大家会不会想如果我打开这个文件还有序号吗?答案是肯定的,因为Dataframe就是由Series构成的,因此会重新生成序号供我们处理,
pd.read_csv('imbd_ex.csv').head(10)
imdb_score | movie_title | director_name | |
---|---|---|---|
0 | 9.5 | Towering Inferno | John Blanchard |
1 | 9.3 | The Shawshank Redemption | Frank Darabont |
2 | 9.2 | The Godfather | Francis Ford Coppola |
3 | 9.1 | Kickboxer: Vengeance | John Stockwell |
4 | 9.1 | Dekalog | NaN |
5 | 9.1 | Dekalog | NaN |
6 | 9.0 | The Dark Knight | Christopher Nolan |
7 | 9.0 | The Godfather: Part II | Francis Ford Coppola |
8 | 9.0 | Fargo | NaN |
9 | 8.9 | The Lord of the Rings: The Return of the King | Peter Jackson |
# 后面继续添加