Hive achieve PV site analysis

Original link:

https://www.toutiao.com/i6773241257528394248/

Before we did "java mapreduce achieve PV site analysis," this time we can analyze some demand indicators with hive

Proposed requirements: statistical analysis of each period of 24 hours and uv pv

analysis:

(1) pv Statistics Total page views count (url)

(2) uv statistics to re-count (distinct guid)

(3) obtain the time field, date and hour (partition table)

The final result expected

Hive achieve PV site analysis

 

Next, pay attention to each stage: acquisition phase, washing phase, analysis phase.

Prepare data, view data dictionary to understand the structure and meaning of data (omitted here to show data and data dictionary), can be considered at this time the data has been completed the acquisition (acquisition phase), the general staff will be handed over by the acquisition of data into our hands on.

Hive achieve PV site analysis

 

Login beeline client

Start the server: bin / hiveserver2 &

Hive achieve PV site analysis

 

Start the client

bin/beeline -u jdbc:hive2://mastercdh:10000 -n root -p password

Hive achieve PV site analysis

 

According to the data dictionary, create a data table

Create a database

Hive achieve PV site analysis

 

Create a data table

create table track_log_source(

id string,

url string,

referer string,

keyword string,

type string,

guid string,

pageId string,

moduleId string,

link a string,

attachedInfo string,

sessionId string,

trackerU string,

trackerType string,

ip string,

trackerSrc string,

cookie string,

order code string,

trackTime string,

endUserId string,

firstLink string,

sessionViewNo string,

productId string,

curMerchantId string,

provinceId string,

cityId string,

fee string,

edmActivity string,

edmEmail string,

edmJobId string,

ieVersion string,

platform string,

internalKeyword string,

resultSum string,

currentPage string,

linkPosition string,

buttonPosition string

)row format delimited fields terminated by '\t';

Hive achieve PV site analysis

 

Prepare data

Hive achieve PV site analysis

 

The prepared data import

load data local inpath '/data/test/data1' into table track_log_source;

load data local inpath '/data/test/data2' into table track_log_source;

Hive achieve PV site analysis

 

View next

Hive achieve PV site analysis

 

After the acquisition is complete, data needs to be cleaned, such as done before "MapReduce for data deduplication"

According to previous analysis, we create a table, we need to be extracted from the field of

create table track_log_qingxi(

id string,

url string,

guid string,

date string,

hour string

)row format delimited fields terminated by '\t';

Hive achieve PV site analysis

 

Insert data

insert into table track_log_qingxi select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from track_log_source;

Hive achieve PV site analysis

 

Partition table: The time field is partitioned

create table track_log_part1(

id string,

url string,

guid string

)partitioned by(date string,hour string)

row format delimited fields terminated by '\t';

Hive achieve PV site analysis

 

Insert data

insert into table track_log_part1 partition(date='20150828',hour='18') select id,url,guid from track_log_qingxi where date='28' and hour='18';

insert into table track_log_part1 partition(date='20150828',hour='19') select id,url,guid from track_log_qingxi where date='28' and hour='19';

Hive achieve PV site analysis

 

Write the words, every time the conditions required to complete, is very inconvenient

Let's look at a concept: dynamic partitioning

In the first hive hive-site.xml configuration file, there are two attributes

Indicates whether to enable dynamic partitioning (this is enabled by default)

<property>

<name>hive.exec.dynamic.partition</name>

<value>true</value>

</property>

Use dynamic partitioning, need to set up a non-strict mode

<property>

<name>hive.exec.dynamic.partition.mode</name>

<value>strict</value>

</property>

We use the command changes, not directly configured

set hive.exec.dynamic.partition.mode=nonstrict;

Hive achieve PV site analysis

 

Then we re-create the partition table

create table track_log_part2(

id string,

url string,

guid string

)partitioned by(date string,hour string)

row format delimited fields terminated by '\t';

Hive achieve PV site analysis

 

Re-insert (this place is the use of dynamic characteristics of the partition)

insert into table track_log_part2 partition(date,hour) select * from track_log_qingxi;

Hive achieve PV site analysis

 

View Data found automatically help us separate, so if it is more than time, you will automatically

Hive achieve PV site analysis

 

data analysis

PV View

select date,hour,count(url) pv from track_log_part2 group by date,hour;

Hive achieve PV site analysis

 

UV analysis

select date,hour,count(distinct guid) uv from track_log_part2 group by date,hour;

Hive achieve PV site analysis

 

The final results into a final result table

create table result as select date,hour,count(url) pv,count(distinct guid) uv from track_log_part2 group by date,hour;

Hive achieve PV site analysis

 

Data output

The final result is saved in mysql

Create a table in mysql

create table track_pv_uv_save(

date varchar(30),

hour varchar(30),

pv varchar(30),

uv varchar(30),

primary key (date,hour)

);

Hive achieve PV site analysis

 

sqoop embodiment (hive-mysql)

bin/sqoop export \

--connect jdbc:mysql://mastercdh:3306/track_log_mysql \

--username root \

--password password \

--table track_pv_uv_save \

--export-you /user/hive/warehouse/exp_track_log.db/result \

-m 1 \

--input-fields-terminated-by '\001'

Hive achieve PV site analysis

 

Check in mysql

Hive achieve PV site analysis

 

We can download data to a local

bin/hdfs dfs -get /user/hive/warehouse/exp_track_log.db/result/000000_0 /data/test

Hive achieve PV site analysis

 

View the next data

Hive achieve PV site analysis

 

View the next data is not a problem

Hive achieve PV site analysis

Guess you like

Origin www.cnblogs.com/bqwzy/p/12535810.html