pandas数据处理实践一(简单走一遍)

pandas处理数据简单的分为如下步骤:

读取数据-->分析数据-->处理数据-->导出数据

第一次主要是走一个流程

df1 = pd.read_csv('/path/xx.csv')   # 通过pd.read_csv读数据,格式为dataframe

# df1.to_csv('df1.csv',index=False) # 把内容写到名为df1.csv的文件中,把索引序号去除

df1的内容为

Sep 2018	Sep 2017	Change	Programming Language	Ratings	Change.1
0	1	1	NaN	Java	17.436%	+4.75%
1	2	2	NaN	C	15.447%	+8.06%
2	3	5	change	Python	7.653%	+4.67%
3	4	3	change	C++	7.394%	+1.83%
4	5	8	change	Visual Basic .NET	5.308%	+3.33%
5	6	4	change	C#	3.295%	-1.48%
6	7	6	change	PHP	2.775%	+0.57%
7	8	7	change	JavaScript	2.131%	+0.11%
8	9	-	change	SQL	2.062%	+2.06%
9	10	18	change	Objective-C	1.509%	+0.00%

df1.columns  # 可以把读取的数据的行标签列出,通过该操作可以索引我们想要的内容

Index(['Sep 2018', 'Sep 2017', 'Change', 'Programming Language', 'Ratings','Change.1'],dtype='object')

df_new = DataFrame(df, columns=['Sep 2019','Sep 2018', 'Change', 'Programming Language']) #该操作可以从原始数据

 #中提取想要的内容,并且可以添加新的列,值初始为nan

Sep 2019	Sep 2018	Change	Programming Language
0	NaN	1	NaN	Java
1	NaN	2	NaN	C
2	NaN	3	change	Python
3	NaN	4	change	C++
4	NaN	5	change	Visual Basic .NET
5	NaN	6	change	C#
6	NaN	7	change	PHP
7	NaN	8	change	JavaScript
8	NaN	9	change	SQL
9	NaN	10	change	Objective-C

df_new['Sep 2019'] = range(10) # 给新插入的列赋值,也很有用

Sep 2019	Sep 2018	Change	Programming Language
0	0	1	NaN	Java
1	1	2	NaN	C
2	2	3	change	Python
3	3	4	change	C++
4	4	5	change	Visual Basic .NET
5	5	6	change	C#
6	6	7	change	PHP
7	7	8	change	JavaScript
8	8	9	change	SQL
9	9	10	change	Objective-C

dataframe进行排序

df2 = df1.sort_values('A') # 按照某列数据值进行整体排序
df2.sort_index() # 按照索引排序

如何一步提取数据中我们想要的内容并根据某个特征排好序

 关键代码:

下面先分布,在一步

df = pd.read_csv('./movie_metadata.csv') # 加载原始数据

df.shape # 查看原始数据的大小也就是形状,从打印可以看出该数据5043行,28列,也就是说有5043个样本,28 
         # 个特征
    (5043, 28)

df.columns # 查看列标签
    Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

df.head(5) # 看数据的前五行,太多不给显示了

# 从原始数据中提取imdb_score,director_name,movie_title,并按照imdb_score降序排序,一句代码实现
df_new = DataFrame(df,columns=['imdb_score','director_name','movie_title']).sort_values('imdb_score',ascending=False)
imdb_score	director_name	movie_title
2765	9.5	John Blanchard	Towering Inferno
1937	9.3	Frank Darabont	The Shawshank Redemption
3466	9.2	Francis Ford Coppola	The Godfather
4409	9.1	John Stockwell	Kickboxer: Vengeance
2824	9.1	NaN	Dekalog
3207	9.1	NaN	Dekalog
66	9.0	Christopher Nolan	The Dark Knight

5043 rows × 3 columns

给出一步操作执行代码:

 pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')

pd.read_csv('movie_metadata.csv')[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False).to_csv('imbd.csv')

这一步从读--->筛选数据---->排序----->导出数据   一步完成,可以分开写,但是这样写的好处有利于培养我们对代码的敏感性

使用jupyter时使用这个很方便,边写边查看数据,例如:

pd.read_csv('movie_metadata.csv').columns进行查看标签栏,或者通过head()显示前五行,执行一次后再删除继续往下写

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

# 此时再进行提取我们想要的内容,例如需要提取['movie_title','imdb_score','director_name']这三个特征的数据,同时使用head()进行查看

pd.read_csv('movie_metadata.csv')[['movie_title','imdb_score','director_name']].head() #默认显示前五行

  movie_title imdb_score director_name
0 Avatar 7.9 James Cameron
1 Pirates of the Caribbean: At World's End 7.1 Gore Verbinski
2 Spectre 6.8 Sam Mendes
3 The Dark Knight Rises 8.5 Christopher Nolan
4 Star Wars: Episode VII - The Force Awakens  ... 7.1 Doug Walker

# 再下面对数据进行处理,例如进行排序,按照imdb_score进行排序,此时去除head()继续往下写即可

# 按照imdb_score排好序以后可以通过head(10)查看前10行的代码

pd.read_csv('movie_metadata.csv'[['movie_title','imdb_score','director_name']].sort_values('imdb_score',ascending=False).head(10)

        imdb_score	 movie_title	                       director_name
2765	9.5	         Towering Inferno	                   John Blanchard
1937	9.3	         The Shawshank Redemption	           Frank Darabont
3466	9.2	         The Godfather	Francis                Ford Coppola
4409	9.1	         Kickboxer: Vengeance	               John Stockwell
2824	9.1	         Dekalog	NaN
3207	9.1	         Dekalog	NaN
66	    9.0	         The Dark Knight	                   Christopher Nolan
2837	9.0	         The Godfather: Part II	Francis         Ford Coppola
3481	9.0	         Fargo	NaN
339	    8.9	         The Lord of the Rings: The Return of the King	Peter Jackson

# 数据处理完以后(当然还有很多需要处理如缺值),把处理好的数据写入硬盘

# 删除head(10),继续向下写,把数据写入硬盘,命名为imbd_ex.csv'

pd.read_csv('movie_metadata.csv'[['imdb_score','movie_title','director_name']].sort_values('imdb_score',ascending=False).to_csv('imbd_ex.csv')

2018/10/02  17:33           212,877 imbd.csv
2018/10/02  18:18           212,877 imbd_ex.csv
2017/11/13  19:09         1,494,688 movie_metadata.csv

存在一个文件为imbd_ex.csv,打开看看.

  imdb_score director_name movie_title      
2765 9.5 John Blanchard Towering Inferno聽                
1937 9.3 Frank Darabont The Shawshank Redemption聽    
3466 9.2 Francis Ford Coppola The Godfather聽      
4409 9.1 John Stockwell Kickboxer: Vengeance聽    
2824 9.1   Dekalog聽                  
3207 9.1   Dekalog聽                  
66 9 Christopher Nolan The Dark Knight聽      
2837 9 Francis Ford Coppola The Godfather: Part II聽    
3481 9   Fargo聽                  
339 8.9 Peter Jackson The Lord of the Rings: The Return of the King聽
4822 8.9 Sidney Lumet 12 Angry Men聽      

发现有原始数据的行序号,不想要怎么办,在写操作时加上index=False 即 to_csv('imbd_ex.csv',index=False),在打开看看

imdb_score director_name movie_title
9.5 John Blanchard Towering Inferno聽            
9.3 Frank Darabont The Shawshank Redemption聽
9.2 Francis Ford Coppola The Godfather聽
9.1 John Stockwell Kickboxer: Vengeance聽
9.1   Dekalog聽            
9.1   Dekalog聽            
9 Christopher Nolan The Dark Knight聽
9 Francis Ford Coppola The Godfather: Part II聽
9   Fargo聽            
8.9 Peter Jackson The Lord of the Rings: The Return of the King聽
8.9 Sidney Lumet 12 Angry Men聽
8.9 Sergio Leone The Good, the Bad and the Ugly聽
8.9 Quentin Tarantino Pulp Fiction聽
8.9 Steven Spielberg Schindler's List聽
8.8 David Fincher Fight Club聽
8.8 Robert Zemeckis Forrest Gump聽
8.8 Peter Jackson The Lord of the Rings: The Fellowship of the Ring聽
8.8 Irvin Kershner Star Wars: Episode V - The Empire Strikes Back聽

此时会发现已经没有了,大家会不会想如果我打开这个文件还有序号吗?答案是肯定的,因为Dataframe就是由Series构成的,因此会重新生成序号供我们处理,

pd.read_csv('imbd_ex.csv').head(10)

  imdb_score movie_title director_name
0 9.5 Towering Inferno John Blanchard
1 9.3 The Shawshank Redemption Frank Darabont
2 9.2 The Godfather Francis Ford Coppola
3 9.1 Kickboxer: Vengeance John Stockwell
4 9.1 Dekalog NaN
5 9.1 Dekalog NaN
6 9.0 The Dark Knight Christopher Nolan
7 9.0 The Godfather: Part II Francis Ford Coppola
8 9.0 Fargo NaN
9 8.9 The Lord of the Rings: The Return of the King Peter Jackson

# 后面继续添加

猜你喜欢

转载自blog.csdn.net/weixin_42398658/article/details/82926901