Data processing is ubiquitous, mastering common skills can do more with less. This series uses Pandas to carry out data processing and analysis, and summarizes commonly used and useful data analysis techniques.
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
The Pandas version I use is as follows, and by the way, I also import the Pandas library.
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
Make sure the interpreter and dataset are in the same directory before starting:
>>> import os
>>> os.chdir('D://source/dataset') # 这是我的数据集所在目录
>>> os.listdir() # 确认此目录已经存在 IMDB-Movie-Data 数据集
['drinksbycountry.csv', 'IMDB-Movie-Data.csv', 'movietweetings', 'titanic_eda_data.csv', 'titanic_train_data.csv']
Once the preparations are in place, the journey of data processing skills begins.
1 Pandas remove a column
Import Data
>>> df = pd.read_csv("IMDB-Movie-Data.csv")
>>> df.head(1) # 导入并显示第一行
Rank Title Genre ... Votes Revenue (Millions) Metascore
0 1 Guardians of the Galaxy Action,Adventure,Sci-Fi ... 757074 333.13 76.0
[1 rows x 12 columns]
Use the pop method to remove specified columns:
>>> meta = df.pop("Title").to_frame() # 移除 Title 列
Check if it has been removed:
>>> df.head(1) # df 变为 11列
Rank Genre ... Revenue (Millions) Metascore
0 1 Action,Adventure,Sci-Fi ... 333.13 76.0
[1 rows x 11 columns]
2 Count the number of title words
Get meta after pop, and display the first 3 lines of meta:
>>> meta.head(3)
Title
0 Guardians of the Galaxy
1 Prometheus
2 Split
Titles are made up of words separated by spaces.
# .str.count(" ") + 1 得到单词个数
>>> meta["words_count"] = meta["Title"].str.count(" ") + 1
>>> meta.head(3) # words_count 列代表单词个数
Title words_count
0 Guardians of the Galaxy 4
1 Prometheus 1
2 Split 1
3 Genre frequency statistics
The following counts the frequency of the movie Genre,
>>> vc = df["Genre"].value_counts()
The top 5 of the movie Genre are shown below. The highest frequency is Action, Adventure, and Sci-Fi with 50 occurrences, followed by Drama with 48 occurrences:
>>> vc.head()
Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Name: Genre, dtype: int64
Pie chart showing Top5:
>>> import matplotlib.pyplot as plt
>>> vc[:5].plot(kind='pie')
<matplotlib.axes._subplots.AxesSubplot object at 0x000001D65B114948>
>>> plt.show()
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group