This is the 8th article in the series on using Excel to learn Python
I want to use a complete case to explain the entire process and basic knowledge of Python data analysis. In fact, take a data set as an example. The data set is a short-term rental data set on Tianchi. The back-end reply: short-term rental data set, you can get it.
First think about the process of data analysis, the first step is to obtain data, so the content of this section is to obtain data and basic operations on data.
1. Data Import
1.1 Import .xlsx file
To import an Excel file with .xlsx suffix, you can use the pd.read_excel (path) method
# 导入.xlsx文件
df_review = pd.read_excel(r"D:\个人\data\reviews.xlsx")
df_review
result:
df_review data contains two fields, listing_id and date
The indispensable parameter when reading data is the path. The file path is written differently under different operating systems. Usually there are two ways to write the path under the windows operating system:
Backslash "\" : Right-click the file, select properties, you can see its location, the default is to use \ to indicate, because the backslash "\" is defined as an escape symbol in Python, it is written When you need to add an escape character r at the front of the path, r "D:\personal\data\reviews.xlsx"
Slash "/" : no need to add r, all use / to write: "D:/personal/data/reviews.xlsx"
Two ways to see personal habits.
sheet_name parameter
For .xlsx files, there may be multiple sheet tables, so you can also set the sheet_name parameter to specify the imported sheet table, you can pass in the name of the sheet table , or you can specify it in increasing order from 0 , without specifying the sheet table The first sheet is the default.
# 指定Sheet表
df_review = pd.read_excel(r"D:\个人\data\reviews.xlsx",sheet_name = 0) # 指定名字或顺序
df_review
1.2 Import .csv file
Use pd.read_csv (path) method to import .csv format files
# 导入csv文件
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list
result:
df_list data mainly includes: landlord ID, landlord name, latitude and longitude, room type, price, minimum rentable days, number of comments, last comment time, percentage of monthly comments, rentable houses, annual rentable time and other fields
Specify encoding format
There is an important knowledge point for .csv files, which is the encoding format. Especially when importing files, you need to understand the encoding format of the file to avoid garbled characters. So how do you know what type of encoding the file is? Open with notepad++ software, the lower right corner will display the encoding format of the file, such as the listings.csv file just imported, which is utf-8 encoding. When writing the encoding, the case is common, and utf-8 can also be written as utf8.
You can use the encoding parameter to set the encoding format. Python's default encoding format is utf-8.
Chinese garbled problem
For garbled characters caused by Chinese characters in the file path, you can add the parameter engine to avoid it.
# 避免出现乱码
df_list = pd.read_csv(r"D:\个人\data\listings.csv",engine = "python")
df_list
result:
image.png
Specify row index
If the row index is not specified, the column that increases from 0 is used as the row index, or the id column can be specified as the row index, and the parameter index_col is passed in
# 指定行索引
df_list = pd.read_csv(r"D:\个人\data\listings.csv",index_col = "id")
df_list.head()
Result: As you can see, the id column becomes the row index column.
Specify column index
By default, the first row is the column index, which can also be specified, using the header parameter, header = 0, which means specifying the column index of the first row.
# 指定列索引
df_list = pd.read_csv(r"D:\个人\data\listings.csv",header = 0)
df_list.head()
result:
Specify import columns
Sometimes we want to import only the specified columns, then pass in the usecols parameter
# 指定导入1、4列
df_list = pd.read_csv(r"D:\个人\data\listings.csv",usecols = [0,3])
df_list.head()
result:
2. Basic operations on data
After importing the data, you need to have a general understanding of the data, such as how many rows and columns in the data set, what is the data type of each field, and whether there are null values, etc.
Preview
You don’t have to run out the data completely, just look at the first few rows and use the head method to get the first 5 rows of data
# 预览数据
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list.head()
result:
Numbers can also be passed in head(), such as previewing the first 10 rows of data
# 预览数据
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list.head(10)
View data dimensions
The data set has several rows and columns, use shape
# 查看数据集维度
df_list.shape
Result: You can see that the df_list data set has 28452 rows and 16 columns
View data type
Use dtypes to view the data types of all fields in the dataset
# 数据类型
df_list.dtypes
result:
You can also view the data type of a field separately
# 单独查看某个字段的数据类型
df_list["host_id"].dtypes
result:
Write at the back