1. Purpose of analysis: Conduct data analysis based on past e-commerce transaction data to discover laws and problems to guide business
2. Data
Import library
Import Data
After loading the data, the first step is to use the describe and info methods to see the approximate distribution of the data
Load device_type
3. Data cleaning
orderId
orderId is the only value in a system
Let's see if there are duplicate values
If there are duplicate values, it is generally processed last, because other columns may affect which duplicate records are deleted
Process the other columns first
userId
The userId only needs to see whether the value is in the normal range from the above describe and info
For order data, a user may have multiple orders, and duplicate values are reasonable
productId
The minimum value of productId is 0, first look at the number of records with a value of 0
177 records, the number is not large, it may be caused by the goods being put on the shelves
cityId
cityId is similar to userId, the values are in the normal range, no need to deal with
price has no null value, and all are greater than 0, pay attention to the unit is a minute, turn it into yuan
payMoney
payMoney has a negative value, and placing an order cannot be a negative value, so here the record for negative values should be deleted
Delete records with negative values
Units become yuan
channelId
channelId According to the result of info, some null data may be short bugs and other reasons, the channelId field was not passed when placing the order
When the amount of data is large, deleting a small number of null records will not affect the statistical results. Delete directly here
The value of deviceType can be seen in the device_type.txt file, no problem, no need to deal with
Neither createTime nor payTime are null, but we need to count the 2016 data, so we must delete the non-2016
Go back and delete the duplicate records of orderId
Delete the productId is 0
After data cleaning, start analysis
4. Data processing and analysis
First look at the overall situation of the data
The total number of orders, total order users, total sales, the number of products with turnover
The analysis of data can be considered from two aspects, one is the dimension and the other is the indicator. The dimension can be regarded as the x-axis, the indicator can be regarded as the y-axis, one dimension can be used to analyze multiple indicators, and the same dimension
Dimensionality reduction
By productId
Let's take a look at the top ten and last ten of product sales
Sales
Look at the intersection of sales and sales of the last 100, if sales and sales are not good, these products need to see if they want to optimize or remove
price
For the price, you can look at the distribution of the prices of all commodities, so that you can know what prices of commodities sell best
There are no commodities in many price ranges. If you have data from competitors, you can see if you need to fill in the commodities.
Corresponding price
Order time analysis
The order quantity distribution by hour can be promoted by time
There are more orders at 12, 13 and 14 noon, which should be during the lunch break, and then around 20 pm
According to the week, the most orders are placed on Saturday, followed by Friday and Sunday
How long after the order is paid
Most of the payment is completed within ten minutes, indicating that users rarely hesitate and the purpose of purchase is very strong
Monthly turnover