Python data analysis combat: get data

This is the 8th article in the series on using Excel to learn Python

I want to use a complete case to explain the entire process and basic knowledge of Python data analysis. In fact, take a data set as an example. The data set is a short-term rental data set on Tianchi. The back-end reply: short-term rental data set, you can get it.

First think about the process of data analysis, the first step is to obtain data, so the content of this section is to obtain data and basic operations on data.

1. Data Import

1.1 Import .xlsx file

To import an Excel file with .xlsx suffix, you can use the pd.read_excel (path) method

# 导入.xlsx文件
df_review = pd.read_excel(r"D:\个人\data\reviews.xlsx")
df_review

result:

image
df_review data contains two fields, listing_id and date


The indispensable parameter when reading data is the path. The file path is written differently under different operating systems. Usually there are two ways to write the path under the windows operating system:

  • Backslash "\" : Right-click the file, select properties, you can see its location, the default is to use \ to indicate, because the backslash "\" is defined as an escape symbol in Python, it is written When you need to add an escape character r at the front of the path, r "D:\personal\data\reviews.xlsx"

  • Slash "/" : no need to add r, all use / to write: "D:/personal/data/reviews.xlsx"
    Two ways to see personal habits.

sheet_name parameter

For .xlsx files, there may be multiple sheet tables, so you can also set the sheet_name parameter to specify the imported sheet table, you can pass in the name of the sheet table , or you can specify it in increasing order from 0 , without specifying the sheet table The first sheet is the default.

# 指定Sheet表
df_review = pd.read_excel(r"D:\个人\data\reviews.xlsx",sheet_name = 0# 指定名字或顺序
df_review

1.2 Import .csv file

Use pd.read_csv (path) method to import .csv format files

# 导入csv文件
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list

result:image.png

df_list data mainly includes: landlord ID, landlord name, latitude and longitude, room type, price, minimum rentable days, number of comments, last comment time, percentage of monthly comments, rentable houses, annual rentable time and other fields

Specify encoding format

There is an important knowledge point for .csv files, which is the encoding format. Especially when importing files, you need to understand the encoding format of the file to avoid garbled characters. So how do you know what type of encoding the file is? Open with notepad++ software, the lower right corner will display the encoding format of the file, such as the listings.csv file just imported, which is utf-8 encoding. When writing the encoding, the case is common, and utf-8 can also be written as utf8.


You can use the encoding parameter to set the encoding format. Python's default encoding format is utf-8.image.png


Chinese garbled problem

For garbled characters caused by Chinese characters in the file path, you can add the parameter engine to avoid it.

# 避免出现乱码
df_list = pd.read_csv(r"D:\个人\data\listings.csv",engine = "python")
df_list

result:image.png

image.png

Specify row index

If the row index is not specified, the column that increases from 0 is used as the row index, or the id column can be specified as the row index, and the parameter index_col is passed in

# 指定行索引
df_list = pd.read_csv(r"D:\个人\data\listings.csv",index_col = "id")
df_list.head()

Result: As you can see, the id column becomes the row index column.

image

Specify column index

By default, the first row is the column index, which can also be specified, using the header parameter, header = 0, which means specifying the column index of the first row.

# 指定列索引
df_list = pd.read_csv(r"D:\个人\data\listings.csv",header = 0)
df_list.head()

result:

image

Specify import columns

Sometimes we want to import only the specified columns, then pass in the usecols parameter

# 指定导入1、4列
df_list = pd.read_csv(r"D:\个人\data\listings.csv",usecols = [0,3])
df_list.head()

result:

image

2. Basic operations on data

After importing the data, you need to have a general understanding of the data, such as how many rows and columns in the data set, what is the data type of each field, and whether there are null values, etc.

Preview

You don’t have to run out the data completely, just look at the first few rows and use the head method to get the first 5 rows of data

# 预览数据
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list.head()

result:image.png

Numbers can also be passed in head(), such as previewing the first 10 rows of data

# 预览数据
df_list = pd.read_csv(r"D:\个人\data\listings.csv")
df_list.head(10)

View data dimensions

The data set has several rows and columns, use shape

# 查看数据集维度
df_list.shape

Result: You can see that the df_list data set has 28452 rows and 16 columns

image

View data type

Use dtypes to view the data types of all fields in the dataset

# 数据类型
df_list.dtypes

result:

image
You can also view the data type of a field separately


# 单独查看某个字段的数据类型
df_list["host_id"].dtypes

result:

imageWrite at the back


Guess you like

Origin blog.51cto.com/15064638/2598063