E-commerce data analysis, super detailed book!

077c2929b8259836107e23bbc863f489.jpeg

Through specific project cases, learn how to analyze data and business problems.

The following is the homework of the community members. At the end of this article, I will give suggestions for modifying the project, so as to improve your analytical thinking through this project.


1. Background introduction

This is a Brazilian e-commerce public dataset produced by Olist Store. The dataset contains information on 100,000 orders placed in various Brazilian markets from 2016 to 2018.

The dataset contains 9 files, namely:

e924f8b3231876a2e9fe25546539a23a.jpeg

1)olist_customers_dataset.csv

This dataset contains information about customers and their locations. Use it to identify unique customers in an order dataset and find where an order was delivered.

2)olist_geolocation_dataset.csv

This dataset contains Brazilian postal codes and their latitude/longitude coordinate information. Use it to draw maps and find out the distance between sellers and customers.

3) olist_Order Items Dataset.csv
This dataset includes data about items purchased in each order.

4) olist_Payments Dataset.csv
This dataset contains data about payment options for orders.

5) olist_Order Reviews Dataset.csv
This dataset includes data about reviews made by customers.

6) olist_Order Dataset.csv
This is the core dataset. You can find all other information from each order.

7) olist_Products Dataset.csv
This dataset includes data about products sold by Olist

8) olist_Sellers Dataset.csv
This dataset includes data about sellers who fulfill orders on Olist. Use it to find seller locations and determine which seller completed a sale for each product.

9) product_Category Name Translation.csv
translates product names from Portuguese to English

The detailed introduction of each data set field can be seen from the data source:

https://www.datafountain.cn/dataSets/22/details

2. Ask questions

Observe the trend of the key indicators of the data set to expose the problem points contained in it, and then evaluate the operation of the olist platform and the direction for improvement. The analysis ideas are expanded from the following three dimensions:

f85fcca083bf2f406a640dea63eaa75e.jpeg

1. Platform sales

What is the product category with the most orders on the platform?

What is the product category with the least orders?

Which price range has the most orders?

The trend of order changes and transaction volume changes?

Changes in customer unit price?

And predict future orders based on the information from 2016--2017/2017--2018.

2. Logistics delivery performance

What is the average time for product delivery and what is the on-time rate?

Freight situation? Based on this information, logistics methods and delivery methods can be improved.

3. User information

Geographical distribution of the number of users?

What about user reviews?

What are the commonly used payment methods?

What is the consumption situation of different consumption groups?

How to improve the operation of the platform based on user evaluations.

3. Data cleaning

According to the problem to be analyzed, the data is cleaned. There are many tables in this data set (a total of 9), first check the corresponding table according to the content to be analyzed. For example: I want to check the review information and scoring status, so I go to the file olist_Order Reviews Dataset.csv. Have a general understanding of the information carried in each form.

Classify 9 tables, select subsets, rename column names and file names, and organize them clearly and standardizedly.

Watch for duplicates, outliers, and missing values ​​in your data. No duplicates were found because each order number is unique. Outliers exist, and missing values ​​also exist, as follows:

628cdc4ee006dcf1e8654da39681557f.jpeg

The gray part is the missing value, and the part below the gray is the outlier (because the actual delivery time cannot be earlier than the shipping time). For this part of missing values ​​and outliers, we will delete them. First, this is an anonymous public dataset and the data source cannot be traced. Second, these missing values ​​and outliers have little effect on the order information with a large base.

Consistency processing on the data: the content in this data set is relatively consistent, and there is no need to re-consistency processing.

Use the IF function to get the judgment whether it is on time

0873bf5b2e87e731474ee9db1cb15e38.jpeg

In this way, according to the difference between the delivery time and the delivery time, the IF function can be used to judge whether the order arrives on time. For convenience, it can be used in the pivot table to judge whether the order is on time.

4. Analysis

Use excel pivot table, vlookup function and other functions to get the result of the problem to be analyzed. Let's take a look at the problems we want to study at the beginning.

f4e63dcfc3f0eec62ad8c41337ca19f6.jpeg

1. Platform sales

1) Transaction quota information

f4f1971db3a8c5f554aa209afb632e08.jpeg bd9f6b80eb3025b2111afa277ef36dfc.jpeg 121e130be8e44729f0e11204701d67ac.jpeg

2) Order variation

ac356d341275b0e2a39d21002db808c0.jpeg

3) Customer unit price:

e004cbd9786915eb9d25c0ef0abf3620.jpeg

4) Order status of different product categories

8535981d50d90112ed6b504919171099.jpeg

2. Logistics delivery performance

In this item, only the orders that have been successfully delivered to the customer are counted. Orders that have not been shipped or have been canceled are not counted

3c5e17f5b8d859dcf736cac41c6aaa3b.jpeg

3. User information

1) Orders by state and geographical distribution of users

49bc02f88c51b0ad32836a1a1bbc871f.jpeg

2) User evaluation

What is the score of platform satisfaction? Using descriptive statistical analysis to express the score overview and the trend over time

91226ea82bdf19b36c9d7a859982c1ee.jpeg

3) Analyze the content of 1-2 points of customer comments (word cloud display)

ac175b768ec5f386510f5a5d38ecc2c6.jpeg

4) Consumption group situation + proportion of payment method

745ad063161d35c09b30c58506bc3ad7.jpeg

5. Analysis summary and suggestions

1. The annual sales and orders of the platform itself have increased significantly. However, from the perspective of quarterly and monthly breakdowns, it is currently in a state of slow growth and needs timely adjustments to acquire users who are not in the heat map area.

In addition, the low-consumption group is huge, but the consumption of the high-consumption group also needs the guidance of the platform, and there is still a lot of room for the high-consumption group to rise.

2. The logistics delivery situation is not optimistic, because the freight price is slightly high, and users pay high freight fees but cannot enjoy the matching delivery service, resulting in a significant increase in the evaluation of 1-2 points. The platform needs to jointly discuss countermeasures and make timely adjustments to the operation of the logistics side.

3. Customer satisfaction has declined slightly, mainly in three directions:

1) The quality of the product itself, the platform needs to strictly control the products sold

2) For logistics delivery issues, if the price is not lowered, the service quality must be improved accordingly. If the service quality is difficult to meet, the corresponding freight price, the platform is best to assist customers to solve it from the perspective of operation

3) To optimize the application of the platform itself, this part can be combined with the AARRR model to explore the customer churn rate in each link, and optimize customer purchase and after-sales experience. Improve retention and repurchase rates

The above is from the project after the second revision of the community members

(https://zhuanlan.zhihu.com/p/61309012)


The following are project revision comments:

[Question] Teacher, if my data set is analyzed according to the AARRR funnel model (analysis method), there are many relevant data that are missing, what should I do?

【answer】

1. The analysis method is not limited to the AARRR funnel model, and not all analyzes must be analyzed according to an analysis method template. The choice of analysis method is to choose according to the data and problems. Different problems use different analysis methods.

2. Each of your analysis dimensions is independent, without linking them for analysis. The data needs to be considered from multiple angles, not only analyzed independently, but also viewed together from multiple dimensions.

3. The opinion given for the first time: At the beginning of the analysis, I didn’t write the analysis ideas, and I didn’t know what I was analyzing. This time I saw that you added it, which is very good.

The misunderstanding of many people when they first start learning data analysis is that they clean the data as soon as they come up, and they have no idea of ​​​​analysis. At the end of the analysis, they don’t know what they are analyzing.

The normal process of analyzing data at work is that when receiving a task, first communicate with business personnel the meaning behind each business name, and then think about the relationship between indicators. There will be a professional meeting to discuss the entire analysis idea, and then find the data according to the purpose of the analysis. If the data is not enough, the data engineer will set up a buried point to collect relevant data.

Therefore, the idea of ​​analysis is to determine before starting the analysis, and then go to the data to analyze the problem.

4. Opinion given for the first time: The background of the PPT is too obvious. Add a mask to cover the background and make the text more prominent. This piece has been modified.

What needs to be improved is that when writing project articles, it is different from the scene of using ppt to make analysis reports.

When using ppt to make an analysis report, the users you are facing are listening to you. You don’t need to put too many words on the ppt, you can just explain the charts in the pictures clearly.

But when writing a project article, the users you are facing are looking at the content you wrote, so don’t put the analysis conclusion in the ppt at this time, but use the ppt to display your visual chart, and then use text to describe the chart What is the conclusion of the analysis to be expressed.

The process of doing a project is a process of continuous optimization and learning. I hope that the discussions and suggestions within the community can help community members to better improve the project.

a856341048aec095093c35a5ee95adfe.jpeg ⬇️Click "Read the original text"

 Sign up for free Data analysis training camp

Guess you like

Origin blog.csdn.net/zhongyangzhong/article/details/129457972