project description
Based on e-commerce data, the data processing process is introduced in detail, combined with hive data warehouse and spark development, to realize big data analysis in multiple ways.
Data sources can be obtained through log collection, crawlers, and databases, and imported into the data warehouse after data cleaning and conversion, and data summary is obtained through data analysis in the data warehouse, and used for corporate decision-making. This project is based on the following table categories to analyze the e-commerce data warehouse, which is divided into orders (user behavior table), trains (order table), products (product table), departments (category table), order_products__prior (user history behavior table), to achieve multidimensional Degree warehouse analysis.
Data warehouse concept:
Data WareHouse (Data WareHouse), abbreviated as DW, provides a strategic collection of all system data support for the decision-making process of the enterprise . Through the data analysis in the data warehouse, it helps the enterprise improve business processes, control costs, and improve product quality.
The data warehouse is not the final destination of the data, but to prepare for the final destination of the data. These preparations are for the data: cleaning, escaping, classification, reorganization, merging, splitting, and statistics.
[infobox title="一. Data sheet"]
1.orders.csv (positioning in data warehouse: user behavior table)
order_id: order number
user_id: user id
eval_set: the behavior of the order (generated in history or required for training)
order_number: the order of the user's purchase order
order_dow: order day of week, the order was purchased on the day of the week (0-6)
order_hour_of_day: In which hour period the order was generated (0-23)
days_since_prior_order: indicates the number of days between the next order and the previous order
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
431534,1,prior,5,4,15,28.0
2.trains.csv
order_id: order number
product_id: product ID
add_to_cart_order: Add the position of the shopping cart
reordered: whether this order is repurchased (1 means yes 0 means no)
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
3.products.csv (data warehouse positioning: product dimension table)
product_id: Product ID
product_name: product name
aisle_id: shelf id
department_id: which category the product data belongs to, daily necessities, or daily necessities, etc.
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
5,Green Chile Anytime Sauce,5,13
4.departments.csv (category dimension table)
department_id: department id, category id
department: category name
department_id,department
1,frozen
2,other
3,bakery
5.order_products__prior.csv (user historical behavior data)
order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
[/infobox]
[infobox title="II. Data Analysis"]
1. Create tables for orders and trains and import the data into hive?
Build the orders table
create table badou.orders(
order_id string
,user_id string
,eval_set string
,order_number string
,order_dow string
,order_hour_of_day string
,days_since_prior_order string)
row format delimited fields terminated by ','
lines terminated by '\n'
1. Load local data overwrite overwrite into append
load data local inpath '/badou20/03hive/data/orders.csv'
overwrite into table orders
select * from orders limit 10;
hive> select * from orders limit 10;
OK
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 08
2398795 1 prior 2 3 07 15.0
473747 1 prior 3 3 12 21.0
2254736 1 prior 4 4 07 29.0
431534 1 prior 5 4 15 28.0
2. Load hdfs data (no local)
load data inpath '/orders.csv'
overwrite into table orders
Build the trains table
create table badou.trains(
order_id string,
product_id string,
add_to_cart_order string,
reordered string
)
row format delimited fields terminated by ','
lines terminated by '\n'
load data local inpath '/badou20/03hive/data/order_products__train.csv'
overwrite into table trains
2. How to remove the dirty data in the first row of the table? (The first line of the original data is the column name, which should be deleted when importing)
Method 1: shell command
Idea: Before loading data, deal with abnormal data sed '1d' orders.csv
head -10 orders.csv > tmp.csv
cat tmp.csv
sed '1d' tmp.csv > tmp_res.csv
cat tmp_res.csv
Linux sed command | Rookie Tutorial
Method 2: HQL (hive sql)
insert overwrite table badou.orders
select * from orders where order_id !='order_id'
insert overwrite table badou.trains
select * from trains where order_id !='order_id'
3. How many orders (count(distinct)) does each (groupb y group ) user have ?
user_id order_id => user_id order_cnt
Grouping: categorize different categories, commonly used group by
Result: order count => order_cnt
select user_id, ordert_cnt two columns
The second column can be written below
, count(distinct order_id) order_cnt
--,count(*) order_cnt
--,count(1) order_cnt
--,count(order_id) order_cnt
Complete statement:
select
user_id
, count(distinct order_id) order_cnt
from orders
group by user_id
order by order_cnt desc
limit 10
Result: Two jobs, Total MapReduce CPU Time Spent: 1 minutes 4 seconds 370 msec
133983 100
181936 100
14923 100
55827 100
4. What is the average number of items per user for an order?
I purchased 2 orders today, one is 10 products, the other is 4 products
(10+4) How many products correspond to an order / 2
Result: A user purchased several products=7
a. First use the priors table to calculate how many items are in an order? Corresponding to 10,4
Note: When using aggregate functions (count, sum, avg, max, min), use group by together
select
order_id,count(distinct product_id) pro_cnt
from priors
group by order_id
limit 10;
b. Associate the priors table with the order table through order_id, and bring the number of products in step a to the user
Result: the amount of products corresponding to the user
select
od.user_id, t.pro_cnt
from orders od
inner join (
select
order_id, count(1) as pro_cnt
from priors
group by order_id
limit 10000
) t
on od.order_id=t.order_id
limit 10;
c. For step b, the sum of the amount of goods corresponding to the user is summed
select
od.user_id, sum(t.pro_cnt) as sum_prods
from orders od
inner join (
select
order_id, count(1) as pro_cnt
from priors
group by order_id
limit 10000
) t
on od.order_id=t.order_id
group by od.user_id
limit 10;
d. Calculate the average
Result: the user’s product quantity / the user’s order quantity
select
od.user_id
, sum(t.pro_cnt) / count(1) as sc_prod
, avg(pro_cnt) as avg_prod
from orders od
inner join (
select
order_id, count(1) as pro_cnt
from priors
group by order_id
limit 10000
) t
on od.order_id=t.order_id
group by od.user_id
limit 10;
inner join: multiple tables for inner join
where: extract the data we care about
5. What is the distribution of purchase orders for each user in a week ( column to row )? dow => day of week 0-6 means Monday to Sunday
order_dow
orderday, pro_cnt
2020-12-19 1000000
2020-12-18 1000010
user_id, dow0, dow1, dow2, dow3,dow4,dow5,dow6
1 0 3 2 2 4 0 0
2 0 5 5 2 1 1 0
Note: In actual development, it must be the first to use small batch data to verify, verify the correctness of the code logic, and then run all! !
user_id order_dow
1 0 sum=0+1=1
1 0 sum=1+1=2
1 1
2 1
method one:
select
user_id
, sum(case when order_dow='0' then 1 else 0 end) dow0
, sum(case when order_dow='1' then 1 else 0 end) dow1
, sum(case when order_dow='2' then 1 else 0 end) dow2
, sum(case when order_dow='3' then 1 else 0 end) dow3
, sum(case when order_dow='4' then 1 else 0 end) dow4
, sum(case when order_dow='5' then 1 else 0 end) dow5
, sum(case when order_dow='6' then 1 else 0 end) dow6
from orders
-- where user_id in ('1','2','3')
group by user_id
method one:
select
user_id
, sum(if( order_dow='0',1,0)) dow0
, sum(if( order_dow='1',1,0)) dow1
, sum(if( order_dow='2',1,0)) dow2
, sum(if( order_dow='3',1,0)) dow3
, sum(if( order_dow='4',1,0)) dow4
, sum(if( order_dow='5',1,0)) dow5
, sum(if( order_dow='6',1,0)) dow6
from orders
where user_id in ('1','2','3')
group by user_id
Accuracy of sampling verification results:
user_id dow0 dow1 dow2 dow3 dow4 dow5 dow6
1 0 3 2 2 4 0 0
2 0 6 5 2 1 1 0
Classroom requirements: Check which products each user has purchased in a certain period of time?
Analysis: user_id, product_id
orders : order_id, user_id
trains:order_id, product_id
select
ord.user_id, tr.product_id
from orders ord
inner join trains tr
on ord.order_id=tr.order_id
where order_hour_of_day = '10'
limit 10
CREATE TABLE `udata`(
`user_id` string,
`item_id` string,
`rating` string,
`timestamp` string)
ROW FORMAT DELIMITED
Note: Timestamp keyword, use ``
881250949 -- > 1997-12-04 23:55:49
The udata table is marked by timestamp:
Requirements: When recommending, I want to know when is the closest or furthest time from now?
select
max(`timestamp`) max_timestamp, min(`timestamp`) min_timestamp
from udata
max_timestamp min_timestamp
893286638 874724710
Requirement: Get the specific comment days of a certain user. As a result, on which days the user is active, it may be ①The user is really active ②The user may be swiping orders, review the reviews
user_id ['2020-12-19','2020-12-18',....]
24*60*60
collect_list: Do not remove duplicates, collect all user_ids
select collect_list('1,2,3')
select
user_id, collect_list(cast(days as int)) as day_list
from
(select
user_id
, (cast(893286638 as bigint) - cast(`timestamp` as bigint)) / (24*60*60) * rating as days
from udata
) t
group by user_id
limit 10;
Requirements: What are the quantities of goods purchased by users greater than 100?
Union all: data is merged, but the data is not de-duplicated. Note that the field type and the number of fields before and after union all must be consistent
union: data merge and de-duplication
method one:
select
user_id, count(distinct product_id) pro_cnt
from
(
- Order training data scene integration of two new and old system data
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_id
union all
- Order history data
select
a.user_id,b.product_id
from orders as a
left join priors b
on a.order_id=b.order_id
) t
group by user_id
having pro_cnt >= 100
limit 10;
Method 2: Introduce the with keyword, function: the logic involved is very complicated, and the nesting relationship is particularly used, which improves code readability and facilitates troubleshooting
Modified by with can be understood as a temporary table or temporary data set
with user_pro_cnt_tmp as (
select * from
(---Order training data
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_id
union all
- Order history data
select
a.user_id,b.product_id
from orders as a
left join priors b
on a.order_id=b.order_id
) t
)
--, order_pro_tmp as (
--), ....
select
user_id
, count(distinct product_id) pro_cnt
from user_pro_cnt_tmp
group by user_id
having pro_cnt >= 100
limit 10;
[/infobox]