3 Tips for Data Processing

Data processing is ubiquitous, mastering common skills can do more with less. This series uses Pandas to carry out data processing and analysis, and summarizes commonly used and useful data analysis techniques.

recommended article

The Pandas version I use is as follows, and by the way, I also import the Pandas library.

>>> import pandas as pd
>>> pd.__version__
'0.25.1'

Make sure the interpreter and dataset are in the same directory before starting:

>>> import os
>>> os.chdir('D://source/dataset') # 这是我的数据集所在目录
>>> os.listdir() # 确认此目录已经存在 IMDB-Movie-Data 数据集
['drinksbycountry.csv', 'IMDB-Movie-Data.csv', 'movietweetings', 'titanic_eda_data.csv', 'titanic_train_data.csv']

Once the preparations are in place, the journey of data processing skills begins.

1 Pandas remove a column

Import Data

>>> df = pd.read_csv("IMDB-Movie-Data.csv")
>>> df.head(1) # 导入并显示第一行
   Rank                    Title                    Genre  ...   Votes Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  Action,Adventure,Sci-Fi  ...  757074             333.13      76.0

[1 rows x 12 columns]

Use the pop method to remove specified columns:

>>> meta = df.pop("Title").to_frame() # 移除 Title 列

Check if it has been removed:

>>> df.head(1) # df 变为 11列
   Rank                    Genre  ... Revenue (Millions) Metascore
0     1  Action,Adventure,Sci-Fi  ...             333.13      76.0

[1 rows x 11 columns]

2 Count the number of title words

Get meta after pop, and display the first 3 lines of meta:

>>> meta.head(3)
                     Title
0  Guardians of the Galaxy
1               Prometheus
2                    Split

Titles are made up of words separated by spaces.

# .str.count(" ") + 1 得到单词个数 
>>> meta["words_count"] = meta["Title"].str.count(" ") + 1 
>>> meta.head(3) # words_count 列代表单词个数
                     Title  words_count
0  Guardians of the Galaxy            4
1               Prometheus            1
2                    Split            1

3 Genre frequency statistics

The following counts the frequency of the movie Genre,

>>> vc = df["Genre"].value_counts()

The top 5 of the movie Genre are shown below. The highest frequency is Action, Adventure, and Sci-Fi with 50 occurrences, followed by Drama with 48 occurrences:

>>> vc.head()
Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

Pie chart showing Top5:

>>> import matplotlib.pyplot as plt
>>> vc[:5].plot(kind='pie')
<matplotlib.axes._subplots.AxesSubplot object at 0x000001D65B114948>
>>> plt.show()

picture

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/123869850