Hive e-commerce data warehouse combat

project description

Based on e-commerce data, the data processing process is introduced in detail, combined with hive data warehouse and spark development, to realize big data analysis in multiple ways.

Data sources can be obtained through log collection, crawlers, and databases, and imported into the data warehouse after data cleaning and conversion, and data summary is obtained through data analysis in the data warehouse, and used for corporate decision-making. This project is based on the following table categories to analyze the e-commerce data warehouse, which is divided into orders (user behavior table), trains (order table), products (product table), departments (category table), order_products__prior (user history behavior table), to achieve multidimensional Degree warehouse analysis.

Data warehouse concept:

Data WareHouse (Data WareHouse), abbreviated as DW, provides a strategic collection of all system data support for the decision-making process of the enterprise . Through the data analysis in the data warehouse, it helps the enterprise improve business processes, control costs, and improve product quality.

The data warehouse is not the final destination of the data, but to prepare for the final destination of the data. These preparations are for the data: cleaning, escaping, classification, reorganization, merging, splitting, and statistics.
Insert picture description here

[infobox title="一. Data sheet"]

1.orders.csv (positioning in data warehouse: user behavior table)

order_id: order number

user_id: user id

eval_set: the behavior of the order (generated in history or required for training)

order_number: the order of the user's purchase order

order_dow: order day of week, the order was purchased on the day of the week (0-6)

order_hour_of_day: In which hour period the order was generated (0-23)

days_since_prior_order: indicates the number of days between the next order and the previous order

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

2539329,1,prior,1,2,08,

2398795,1,prior,2,3,07,15.0

473747,1,prior,3,3,12,21.0

2254736,1,prior,4,4,07,29.0

431534,1,prior,5,4,15,28.0

2.trains.csv

order_id: order number

product_id: product ID

add_to_cart_order: Add the position of the shopping cart

reordered: whether this order is repurchased (1 means yes 0 means no)

order_id,product_id,add_to_cart_order,reordered

1,49302,1,1

1,11109,2,1

1,10246,3,0

1,49683,4,0

3.products.csv (data warehouse positioning: product dimension table)

product_id: Product ID

product_name: product name

aisle_id: shelf id

department_id: which category the product data belongs to, daily necessities, or daily necessities, etc.

product_id,product_name,aisle_id,department_id

1,Chocolate Sandwich Cookies,61,19

2,All-Seasons Salt,104,13

3,Robust Golden Unsweetened Oolong Tea,94,7

4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1

5,Green Chile Anytime Sauce,5,13

4.departments.csv (category dimension table)

department_id: department id, category id

department: category name

department_id,department

1,frozen

2,other

3,bakery

5.order_products__prior.csv (user historical behavior data)

order_id,product_id,add_to_cart_order,reordered

2,33120,1,1

2,28985,2,1

2,9327,3,0

[/infobox]

[infobox title="II. Data Analysis"]

1. Create tables for orders and trains and import the data into hive?

Build the orders table

create table badou.orders(

order_id string

,user_id string

,eval_set string

,order_number string

,order_dow string

,order_hour_of_day string

,days_since_prior_order string)

row format delimited fields terminated by ','

lines terminated by '\n'

1. Load local data overwrite overwrite into append

load data local inpath '/badou20/03hive/data/orders.csv'

overwrite into table orders

select * from orders limit 10;

hive> select * from orders limit 10;

OK

order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order

2539329 1 prior 1 2 08

2398795 1 prior 2 3 07 15.0

473747 1 prior 3 3 12 21.0

2254736 1 prior 4 4 07 29.0

431534 1 prior 5 4 15 28.0

2. Load hdfs data (no local)

load data inpath '/orders.csv'

overwrite into table orders

Build the trains table

create table badou.trains(

order_id string,

product_id string,

add_to_cart_order string,

reordered string)

row format delimited fields terminated by ','

lines terminated by '\n'

load data local inpath '/badou20/03hive/data/order_products__train.csv' 

overwrite into table trains

2. How to remove the dirty data in the first row of the table? (The first line of the original data is the column name, which should be deleted when importing)

Method 1: shell command

Idea: Before loading data, deal with abnormal data sed '1d' orders.csv

head -10 orders.csv > tmp.csv

cat tmp.csv

sed '1d' tmp.csv > tmp_res.csv

cat tmp_res.csv

Linux sed command | Rookie Tutorial

Method 2: HQL (hive sql)

insert overwrite table badou.orders

select * from orders where order_id !='order_id'

insert overwrite table badou.trains

select * from trains where order_id !='order_id'

3. How many orders (count(distinct)) does each (groupb y group ) user have ?

user_id order_id => user_id order_cnt

Grouping: categorize different categories, commonly used group by

Result: order count => order_cnt

select user_id, ordert_cnt two columns

The second column can be written below

, count(distinct order_id) order_cnt

--,count(*) order_cnt

--,count(1) order_cnt

--,count(order_id) order_cnt

Complete statement:

select

user_id

, count(distinct order_id) order_cnt

from orders

group by user_id

order by order_cnt desc

limit 10

Result: Two jobs, Total MapReduce CPU Time Spent: 1 minutes 4 seconds 370 msec

133983 100

181936 100

14923 100

55827 100

4. What is the average number of items per user for an order?

I purchased 2 orders today, one is 10 products, the other is 4 products

(10+4) How many products correspond to an order / 2

Result: A user purchased several products=7

a. First use the priors table to calculate how many items are in an order? Corresponding to 10,4

Note: When using aggregate functions (count, sum, avg, max, min), use group by together

select

order_id,count(distinct product_id) pro_cnt

from priors

group by order_id

limit 10;

b. Associate the priors table with the order table through order_id, and bring the number of products in step a to the user

Result: the amount of products corresponding to the user

select

od.user_id, t.pro_cnt

from orders od

inner join (

select

order_id, count(1) as pro_cnt

from priors

group by order_id

limit 10000

) t

on od.order_id=t.order_id

limit 10;

c. For step b, the sum of the amount of goods corresponding to the user is summed

select

od.user_id, sum(t.pro_cnt) as sum_prods

from orders od

inner join (

select

order_id, count(1) as pro_cnt

from priors

group by order_id

limit 10000

) t

on od.order_id=t.order_id

group by od.user_id

limit 10;

d. Calculate the average

Result: the user’s product quantity / the user’s order quantity

select

od.user_id

, sum(t.pro_cnt) / count(1) as sc_prod

, avg(pro_cnt) as avg_prod

from orders od

inner join (

select

order_id, count(1) as pro_cnt

from priors

group by order_id

limit 10000

) t

on od.order_id=t.order_id

group by od.user_id

limit 10;

inner join: multiple tables for inner join

where: extract the data we care about

5. What is the distribution of purchase orders for each user in a week ( column to row )? dow => day of week 0-6 means Monday to Sunday

order_dow

orderday, pro_cnt

2020-12-19  1000000

2020-12-18  1000010

user_id, dow0, dow1, dow2, dow3,dow4,dow5,dow6

1         0     3      2    2    4    0    0

2         0     5      5    2    1    1    0

Note: In actual development, it must be the first to use small batch data to verify, verify the correctness of the code logic, and then run all! !

user_id order_dow

1    0   sum=0+1=1

1    0   sum=1+1=2

1    1

2    1

method one:

select

user_id

, sum(case when order_dow='0' then 1 else 0 end) dow0

, sum(case when order_dow='1' then 1 else 0 end) dow1

, sum(case when order_dow='2' then 1 else 0 end) dow2

, sum(case when order_dow='3' then 1 else 0 end) dow3

, sum(case when order_dow='4' then 1 else 0 end) dow4

, sum(case when order_dow='5' then 1 else 0 end) dow5

, sum(case when order_dow='6' then 1 else 0 end) dow6

from orders

-- where user_id in ('1','2','3')

group by user_id

method one:

select

user_id

, sum(if( order_dow='0',1,0)) dow0

, sum(if( order_dow='1',1,0)) dow1

, sum(if( order_dow='2',1,0)) dow2

, sum(if( order_dow='3',1,0)) dow3

, sum(if( order_dow='4',1,0)) dow4

, sum(if( order_dow='5',1,0)) dow5

, sum(if( order_dow='6',1,0)) dow6

from orders

where user_id in ('1','2','3')

group by user_id

Accuracy of sampling verification results:

user_id dow0    dow1    dow2    dow3    dow4    dow5    dow6

1         0      3       2        2       4       0       0

2         0      6       5        2       1       1       0

Classroom requirements: Check which products each user has purchased in a certain period of time?

Analysis: user_id, product_id

orders : order_id, user_id

trains:order_id, product_id

select

ord.user_id, tr.product_id

from orders ord

inner join trains tr

on ord.order_id=tr.order_id

where order_hour_of_day = '10'

limit 10

CREATE TABLE `udata`(

`user_id` string,

`item_id` string,

`rating` string,

`timestamp` string)

ROW FORMAT DELIMITED

Note: Timestamp keyword, use ``

881250949 -- > 1997-12-04 23:55:49

The udata table is marked by timestamp:

Requirements: When recommending, I want to know when is the closest or furthest time from now?

select

max(`timestamp`) max_timestamp, min(`timestamp`) min_timestamp

from udata

max_timestamp   min_timestamp

893286638   874724710

Requirement: Get the specific comment days of a certain user. As a result, on which days the user is active, it may be ①The user is really active ②The user may be swiping orders, review the reviews

user_id ['2020-12-19','2020-12-18',....]

24*60*60

collect_list: Do not remove duplicates, collect all user_ids

select collect_list('1,2,3')

select

user_id, collect_list(cast(days as int)) as day_list

from

(select

user_id

, (cast(893286638 as bigint) - cast(`timestamp` as bigint)) / (24*60*60) * rating as days

from udata

) t

group by user_id

limit 10;

Requirements: What are the quantities of goods purchased by users greater than 100?

Union all: data is merged, but the data is not de-duplicated. Note that the field type and the number of fields before and after union all must be consistent

union: data merge and de-duplication

method one:

select

user_id, count(distinct product_id) pro_cnt

from

(

- Order training data scene integration of two new and old system data

select

a.user_id,b.product_id

from orders as a

left join trains b

on a.order_id=b.order_id

union all

- Order history data

select

a.user_id,b.product_id

from orders as a

left join priors b

on a.order_id=b.order_id

) t

group by user_id

having pro_cnt >= 100

limit 10;

Method 2: Introduce the with keyword, function: the logic involved is very complicated, and the nesting relationship is particularly used, which improves code readability and facilitates troubleshooting

Modified by with can be understood as a temporary table or temporary data set

with user_pro_cnt_tmp as (

select * from

(---Order training data

select

a.user_id,b.product_id

from orders as a

left join trains b

on a.order_id=b.order_id

union all

- Order history data

select

a.user_id,b.product_id

from orders as a

left join priors b

on a.order_id=b.order_id

) t

)

--, order_pro_tmp as (

--), ....

select

user_id

, count(distinct product_id) pro_cnt

from user_pro_cnt_tmp

group by user_id

having pro_cnt >= 100

limit 10;

[/infobox]

 

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/112861864