Practical articles on python data analysis, analyze the data of examples, and understand the frequently used knowledge content in data analysis through data operation cases.
Douyin user data analysis
1. Understand the data
Data field meaning
Understand the data content and ensure that the data source is normal, safe and legal. Understand the meaning of each field. Column A is the serial number ID, which is not continuous and can be deleted without much meaning; column B uid is the user id who watches the video; column C user_city is the city where the user is located, which is replaced by numbers; D Column intem_id is the ID of the work; column E author_id is the ID of the author who published the work; column F item_city is the city where the author of the video is released; column G channel is the source of the video. Now the source of the video is not only on the APP, but also in other The video can be pushed on the website or on the video; the H column finish indicates whether the video work has been completely browsed; the I column like indicates whether the work is liked; the J column music_id indicates the music used; the K column duration_time indicates the duration of the work; the L column real_time is the real release time of the work; M column H is the current time, specific to the hour; N column date is the day before the release.
Commonly used codes can be directly copied and used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyecharts
%matplotlib inline # 每一行默认输出
plt.style.use('ggplot') # 风格的设置
plt.rcParams['font.family'] = 'SimHei' # 设置中文字体
plt.rcParams['axes.unicode_minus'] = False # 坐标轴支持负号
When importing data, the data content is relatively large, and when it exceeds the range of the excel or wps file, the open data will be missing. During data processing, if the processed data ranges from several hundred to ten thousand, excel and wps can perform corresponding operations; the level of analyzed data is thousands to hundreds of thousands, beyond the range of excel and wps tables, the data It cannot be displayed normally, so it is necessary to use pandas for analysis; if the data is in the billion level, it is necessary to use big data analysis.
After importing the data, preview the data, and you can see that there are more than 1 million pieces of data. The previous viewing information will display the number of data. If the data has no missing values, the number of data will no longer be displayed. You can use describe to count the information of the values in the table, and check whether there are abnormal values. If the data in the table is a string, it will not be displayed.
2. Data processing
Data processing becomes data cleaning and feature engineering in machine learning. In exploratory data analysis (EDA), if the algorithm is not used, only the corresponding data processing needs to be done, including data cleaning.
If you change the original data of the form, you can add inplace=True to the parameter; if you don’t want to change the original data, you can reassign the changed content to the variable name of the original data.
3. Analyze data
Analyze data through visual means and charts. Visualization is often used in exploratory data analysis (EDA), and charts are used to display; in confirmatory data analysis, it is necessary to use statistical knowledge to verify assumptions, use algorithms to predict, and build models.
Before drawing the graph, prepare the required x-axis and y-axis data. Analysis of the daily play volume, daily user volume, daily author volume, and daily work volume are related to time. The x-axis is time, and the y-axis is information about playback, users, authors, and works, which can be calculated by grouping by date.
Among the top 50 works, the relationship between the number and the play rate and like rate;
4 Conclusion
4.1 Analysis of the relationship between the daily playback volume, daily user volume, daily author volume, and daily work volume and time
The daily playback volume, daily user volume, daily author volume, and daily work volume maintain the same trend over time. They all increased steadily in the early stage. From 10-20 to 10-29, there was a sharp increase, and then there was a decline The trend may be that the platform conducts activities during this period to attract users to publish and watch works. The number of works, authors, and authors will all increase dramatically during this period, and users will return to normal levels after the event ends.
4.2 The relationship between the quantity and the play rate and like rate
The number is directly proportional to the play rate; there is no obvious relationship between the number of works and the like rate.
Data analysis of second-hand housing on a certain platform
Use pandas for data processing, use pyecharts to make visual charts, analyze the basic characteristics of the second-hand housing on the market and the distribution of housing sources, and explore the laws behind the second-hand housing.
1. Import library, read data
Commonly used data can be directly understood, while professional data requires professional knowledge and professional background in advance.
view information
Check the data statistics and basic information, the floor, area, price, and year are listed as values, and the elevator column is missing.
2. Data processing
missing value
The elevator column is missing 8257 pieces of data, and there are deletion and filling operations for missing values. Check the data in the elevator column, and check the unique values of the elevator column, which are "with elevator", "no elevator" and NaN. Certain factors can be filled with NaN as third-party data, such as filling with "unknown".
Look at the data to find the unique value of the orientation, and find that the orientation of the house has the same value. For example, "southwest" and "south west" indicate the same orientation, and the data can be replaced to unify the values. Using groupby to count the number of second-hand houses in each urban area, it is found that Fengtai, Changping, Chaoyang, and Haidian have the largest number of second-hand houses.
data conversion
Convert the data to a list for easy charting.
3. Visual Analysis
3.1 Distribution map of the number of second-hand houses in each urban area
Take out the name of each district, splice the string "district", instantiate the map class, pass in the key-value pair, and draw the map. You can easily view the housing data of each district by moving the mouse, and dragging the heat map on the left can make the filtered area displayed on the map in different colors.
3.2 The average price of second-hand housing in various urban areas
For the column name, you need to copy it directly, in case there are space-like characters in the column name, you will not find it if you enter it directly in the code.
Plot the graph with the area as the x-axis and the number of houses and the average price as the y-axis respectively.
3.3 Top 15 second-hand houses with the highest prices
3.4 The scatter diagram of the total price and area of second-hand housing
It shows that the housing area is concentrated below 400 square meters and the price is below 30 million.
3.5 Pie chart of house orientation
Most of the houses are oriented north-south.
3.6 The histogram of decoration and the rose diagram of whether there is an elevator
A rose diagram is also an irregular donut diagram.
3.7 Column chart of second-hand housing floor distribution
It can be seen from the data that the transaction volume of buildings on the 6th floor is the highest.
3.8 Histogram of housing area distribution
The area of each suite is a continuous value and cannot be grouped, because the area of each suite is mostly different, and the interval can be used for panel division.
4. Analysis conclusion
Analyzing the second-hand housing data from different angles, it can be concluded from the chart that
the number of second-hand housing in each urban area is the largest in Fengtai, Changping, Chaoyang, and Haidian, accounting for half of the total second-hand housing ;
From the average selling price, it can be seen that the average selling price of Fengtai, Changping, Chaoyang, and Haidian is more than 8 million
;
It can be seen from the situation that there are more renovated houses, which means that the houses I live in are sold more;
most of the houses on sale are 6 floors; most of the houses are within 150 square meters.