Python data analysis: exploratory analysis

Write in front

If you forget the previous article, you can take a look and deepen your impression:
Pandas data processing
Python data analysis actual combat: missing value processing
Python data analysis actual combat: obtaining data

Then you can enter today's text

1. Descriptive statistical analysis

In Excel, you can use the [Descriptive Statistics] function in the [Data Analysis] function to view the commonly used statistical indicators of the data set, but here can only be used to perform statistics on numerical data.

image

In pandas, you can use the describe method to do a descriptive statistical analysis on the entire data set. Of course, the results are only available for numerical data here, and non-numerical data is not within the statistical scope.

# 描述性统计分析
df_list.describe()

The results are as follows, you can see that count (count), mean (mean), std (standard deviation), min (minimum), max (maximum), 25%, 50%, 75% respectively represent 3/4 digits , Median and 1/4 digits.

image

Transpose

Because there are too many fields, it can be transposed here for easy viewing. Use .T transpose

# 行列转置
df_list.describe().T

The result is shown in the figure, which is more in line with the habit of a table. You can see that only numeric data can be counted, and character data cannot be counted.image.png

It is observed that the minimum, 1/4 digit, median, and 3/4 digits of the minimum occupancy (minimum_nights) field are all 1, indicating that most listings require 1 day for the minimum occupancy. The same conclusion applies to the number of reviews per month (reviews_per_month) field

Two, group analysis

The pivot table in Excel can realize the function of data grouping calculation.image.png

Take a look at what values ​​the neighborhood_new field has, and use the value_counts method to count the number of occurrences

# 数值计数
df_list["neighborhood_new"].value_counts()

As a result, you can see how many districts and counties are classified in the neighborhood_new field and their appearances are sorted in descending order. You can see that Chaoyang District has the most listings and Pinggu District has the least.image.png


You can also use the groupby method to achieve group counting

# 分组
df_list.groupby("neighborhood_new")["neighborhood_new"].count()

The result is the sameimage.png

You can also group the room_type_new column to see the results

df_list["room_type_new"].value_counts()

It can be seen that there are three categories of room types, with the most Entire homes and the least Shared rooms.image.png

Three, cross analysis

Grouping

Group by region and count the level of housing prices in different regions. The groupby method is also used for grouping, but you can use agg method to use multiple aggregation methods at once.

df_list.groupby("neighborhood_new")["price"].agg(["max","min","mean","count"])

The results are shown in the figure. The neighborhood_new field is grouped, and the maximum and minimum average values ​​of the grouped prices are calculated and counted. It can be seen that the average house price in Huairou District is the highest and Fengtai District is the lowest.

image
Group the room types and sort the results in descending order of mean value


r_p = df_list.groupby("room_type_new")["price"].agg(["max","min","mean","count"]).reset_index()
r_p.sort_values("mean",ascending = False)

The result is shown in the figure. The average house price of the whole rent is the highest, and the shared rent is the lowest, which is a reasonable result.

image

perspective

To make a perspective of the room type and area, use the pivot_table method, and the pivot table in Excel is a type of operation. The first parameter is the data to be pivoted, and the values ​​parameter is the value area in the Excel pivot table, which means For the fields to be summarized, the index parameter is the row area in the Excel pivot table, the columns parameter is the column area, and the aggfuc parameter is the type of values ​​to be summarized.

pd.pivot_table(df_list,values = "price",index = "neighborhood_new",
                columns = "room_type_new",aggfunc = "mean",margins = True)

The result is shown in the figure, you can see the price distribution of the whole rent, shared rent, and single room in each area.

image

Four, correlation analysis

Correlation analysis is used to describe the results of the correlation between variables. It is represented by the correlation coefficient r, r>0 means positive correlation, r<0 means negative correlation, and the closer the absolute value of r is to 1, the higher the correlation. In Excel, you can directly calculate the correlation coefficient of each field by using the [Correlation Coefficient] function in the [Data Analysis] tool.

Corr function can be used in python to calculate the correlation coefficient between data, calculate the entire data table, and take 4 decimal places for the result

# 计算相关系数
df_list.corr().round(4)

The results are as follows, you can get the correlation coefficient between each column.

image

But what we are actually most concerned about here is the correlation between them and the price, that is, the part marked in red in the figure, you can sort the values ​​in this column.

image

Numerical sort

Numerical sorting is to arrange the entire data table in ascending or descending order of the specified column, using the sort_values ​​method. Select the price column of the data frame after calculating the correlation coefficient in descending order. The first parameter is which column to sort, and the second parameter ascending = False is descending order. The default is True ascending order.

# 数值排序
corr_p = df_list.corr().round(4)
corr_p["price"].sort_values(ascending = False)

The results are as follows. It can be seen that the correlation between house prices and latitude and longitude (latitude, longitude) is the highest. In addition to other variables, the number of days available for reservation (availability_365) and the price are most positively correlated, followed by the number of reviews per month (reviews_per_month). ) And price are negatively correlated.

image

Write at the back



Guess you like

Origin blog.51cto.com/15064638/2598046