Python data analysis and visualization (fifteen) Data analysis and visualization practical articles (Douyin user data analysis, second-hand housing data analysis)

Practical articles on python data analysis, analyze the data of examples, and understand the frequently used knowledge content in data analysis through data operation cases.

Douyin user data analysis

1. Understand the data

Data field meaning

Understand the data content and ensure that the data source is normal, safe and legal. Understand the meaning of each field. Column A is the serial number ID, which is not continuous and can be deleted without much meaning; column B uid is the user id who watches the video; column C user_city is the city where the user is located, which is replaced by numbers; D Column intem_id is the ID of the work; column E author_id is the ID of the author who published the work; column F item_city is the city where the author of the video is released; column G channel is the source of the video. Now the source of the video is not only on the APP, but also in other The video can be pushed on the website or on the video; the H column finish indicates whether the video work has been completely browsed; the I column like indicates whether the work is liked; the J column music_id indicates the music used; the K column duration_time indicates the duration of the work; the L column real_time is the real release time of the work; M column H is the current time, specific to the hour; N column date is the day before the release.
Commonly used codes can be directly copied and used

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyecharts 

%matplotlib inline  # 每一行默认输出
plt.style.use('ggplot')  # 风格的设置

plt.rcParams['font.family'] = 'SimHei'  # 设置中文字体
plt.rcParams['axes.unicode_minus'] = False # 坐标轴支持负号

When importing data, the data content is relatively large, and when it exceeds the range of the excel or wps file, the open data will be missing. During data processing, if the processed data ranges from several hundred to ten thousand, excel and wps can perform corresponding operations; the level of analyzed data is thousands to hundreds of thousands, beyond the range of excel and wps tables, the data It cannot be displayed normally, so it is necessary to use pandas for analysis; if the data is in the billion level, it is necessary to use big data analysis.
insert image description here
After importing the data, preview the data, and you can see that there are more than 1 million pieces of data. The previous viewing information will display the number of data. If the data has no missing values, the number of data will no longer be displayed. You can use describe to count the information of the values ​​in the table, and check whether there are abnormal values. If the data in the table is a string, it will not be displayed.
insert image description here
insert image description here

2. Data processing

Data processing becomes data cleaning and feature engineering in machine learning. In exploratory data analysis (EDA), if the algorithm is not used, only the corresponding data processing needs to be done, including data cleaning.
insert image description here
If you change the original data of the form, you can add inplace=True to the parameter; if you don’t want to change the original data, you can reassign the changed content to the variable name of the original data.
insert image description here

3. Analyze data

Analyze data through visual means and charts. Visualization is often used in exploratory data analysis (EDA), and charts are used to display; in confirmatory data analysis, it is necessary to use statistical knowledge to verify assumptions, use algorithms to predict, and build models.
insert image description here
Before drawing the graph, prepare the required x-axis and y-axis data. Analysis of the daily play volume, daily user volume, daily author volume, and daily work volume are related to time. The x-axis is time, and the y-axis is information about playback, users, authors, and works, which can be calculated by grouping by date.
insert image description here
insert image description here
Among the top 50 works, the relationship between the number and the play rate and like rate;
insert image description here
insert image description here

4 Conclusion

4.1 Analysis of the relationship between the daily playback volume, daily user volume, daily author volume, and daily work volume and time

The daily playback volume, daily user volume, daily author volume, and daily work volume maintain the same trend over time. They all increased steadily in the early stage. From 10-20 to 10-29, there was a sharp increase, and then there was a decline The trend may be that the platform conducts activities during this period to attract users to publish and watch works. The number of works, authors, and authors will all increase dramatically during this period, and users will return to normal levels after the event ends.

4.2 The relationship between the quantity and the play rate and like rate

The number is directly proportional to the play rate; there is no obvious relationship between the number of works and the like rate.

Data analysis of second-hand housing on a certain platform

Use pandas for data processing, use pyecharts to make visual charts, analyze the basic characteristics of the second-hand housing on the market and the distribution of housing sources, and explore the laws behind the second-hand housing.

1. Import library, read data

Commonly used data can be directly understood, while professional data requires professional knowledge and professional background in advance.
insert image description here

view information

Check the data statistics and basic information, the floor, area, price, and year are listed as values, and the elevator column is missing.
insert image description here

2. Data processing

missing value

The elevator column is missing 8257 pieces of data, and there are deletion and filling operations for missing values. Check the data in the elevator column, and check the unique values ​​​​of the elevator column, which are "with elevator", "no elevator" and NaN. Certain factors can be filled with NaN as third-party data, such as filling with "unknown".
insert image description here
Look at the data to find the unique value of the orientation, and find that the orientation of the house has the same value. For example, "southwest" and "south west" indicate the same orientation, and the data can be replaced to unify the values. Using groupby to count the number of second-hand houses in each urban area, it is found that Fengtai, Changping, Chaoyang, and Haidian have the largest number of second-hand houses.
insert image description here

data conversion

Convert the data to a list for easy charting.
insert image description here

3. Visual Analysis

3.1 Distribution map of the number of second-hand houses in each urban area

Take out the name of each district, splice the string "district", instantiate the map class, pass in the key-value pair, and draw the map. You can easily view the housing data of each district by moving the mouse, and dragging the heat map on the left can make the filtered area displayed on the map in different colors.
insert image description here

3.2 The average price of second-hand housing in various urban areas

For the column name, you need to copy it directly, in case there are space-like characters in the column name, you will not find it if you enter it directly in the code.
insert image description here
Plot the graph with the area as the x-axis and the number of houses and the average price as the y-axis respectively.
insert image description here
insert image description here

3.3 Top 15 second-hand houses with the highest prices

insert image description here
insert image description here

3.4 The scatter diagram of the total price and area of ​​second-hand housing

It shows that the housing area is concentrated below 400 square meters and the price is below 30 million.
insert image description here

3.5 Pie chart of house orientation

Most of the houses are oriented north-south.
insert image description here
insert image description here

3.6 The histogram of decoration and the rose diagram of whether there is an elevator

A rose diagram is also an irregular donut diagram.
insert image description here
insert image description here
insert image description here

3.7 Column chart of second-hand housing floor distribution

It can be seen from the data that the transaction volume of buildings on the 6th floor is the highest.
insert image description here
insert image description here

3.8 Histogram of housing area distribution

The area of ​​each suite is a continuous value and cannot be grouped, because the area of ​​each suite is mostly different, and the interval can be used for panel division.
insert image description here
insert image description here

4. Analysis conclusion

Analyzing the second-hand housing data from different angles, it can be concluded from the chart that
the number of second-hand housing in each urban area is the largest in Fengtai, Changping, Chaoyang, and Haidian, accounting for half of the total second-hand housing ;
From the average selling price, it can be seen that the average selling price of Fengtai, Changping, Chaoyang, and Haidian is more than 8 million
;
It can be seen from the situation that there are more renovated houses, which means that the houses I live in are sold more;
most of the houses on sale are 6 floors; most of the houses are within 150 square meters.

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/127780284