Data Download Link:
https://pan.baidu.com/s/1OGyO2jFj393-Dcq3eosbjA&shfl=sharepset
extraction code: jtdi
Data Case (whichever two files to):
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387157643","commentCount":"682","content":"喂!2014。。。2014!喂。。。","createTime":"1387086483","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww1.sinaimg.cn/square/47119b17jw1ebkc9b07x9j218g0xcair.jpg","http://ww4.sinaimg.cn/square/47119b17jw1ebkc9ebakij218g0xc113.jpg","http://ww2.sinaimg.cn/square/47119b17jw1ebkc9hml7dj218g0xcgt6.jpg","http://ww3.sinaimg.cn/square/47119b17jw1ebkc9kyakyj218g0xcqb3.jpg"],"praiseCount":"1122","reportCount":"671","source":"iPhone客户端","userId":"1192336151","videourl":[],"weiboId":"3655768039404271","weiboUrl":"http://weibo.com/1192336151/AnoMrDstN"}]
Field Description:
A total of 19 fields
Whether beCommentWeiboId comment
whether beForwardWeiboId is forward microblogging
catchTime crawl time
commentCount commented
content content
createTime Created
info1 information field 1
INFO2 information field 2
Info3 information field 3
mlevel NO the Sure
musicurl music links
pic_list photo list (can have multiple)
praiseCount points number of praise
reportCount forward the number of
source data source
userId user the above mentioned id
videoURL video link
weiboId microblogging the above mentioned id
weiboUrl microblogging URL
Functional Requirements:
- When construction of the table, to build an external table
- Data storage directory: hdfs: // hadoop01: 9000 / data / weibo
1. Create a table Hive weibo_json (json string), a field table only, all imported data and verifies the query data before 5
Create a table json
create external table if not exists weibo_json(json string) location "/data/weibo";
Download Data
load data local inpath "/home/hadoop/hive_data/weibo.json" into table weibo_json;
Check the data
select json
from weibo_json limit 5;
2. End parse json format data weibo_json them to have 19 field weibo table, write the necessary SQL statements
Create a table weibo
create table if not exists weibo(
beCommentWeiboId string,
beForwardWeiboId string,
catchTime string,
commentCount int,
content string,
createTime string,
info1 string,
info2 string,
info3 string,
mlevel string,
musicurl string,
pic_list string,
praiseCount int,
reportCount int,
source string,
userId string,
videourl string,
weiboId string,
weiboUrl string
) row format delimited fields terminated by '\t';
Insert data
insert into table weibo
select
get_json_object(json,'$[0].beCommentWeiboId') beCommentWeiboId,
get_json_object(json,'$[0].beForwardWeiboId') beForwardWeiboId,
get_json_object(json,'$[0].catchTime') catchTime,
get_json_object(json,'$[0].commentCount') commentCount,
get_json_object(json,'$[0].content') content,
get_json_object(json,'$[0].createTime') createTime,
get_json_object(json,'$[0].info1') info1,
get_json_object(json,'$[0].info2') info2,
get_json_object(json,'$[0].info3') info3,
get_json_object(json,'$[0].mlevel') mlevel,
get_json_object(json,'$[0].musicurl') musicurl,
get_json_object(json,'$[0].pic_list') pic_list,
get_json_object(json,'$[0].praiseCount') praiseCount,
get_json_object(json,'$[0].reportCount') reportCount,
get_json_object(json,'$[0].source') source,
get_json_object(json,'$[0].userId') userId,
get_json_object(json,'$[0].videourl') videourl,
get_json_object(json,'$[0].weiboId') weiboId,
get_json_object(json,'$[0].weiboUrl') weiboUrl
from weibo_json;
3. The total number of Weibo users and independent statistics
The affected fields
- weiboId BiHiroshi id
- userId user id
Number of unique users is going to be re-useId
select count(weiboId)c1,count(distinct userId)c2
from weibo;
4. All of the statistical number of times the user is forwarded Twitter and outputs top5 user, and gives the number of
The affected fields
- reportCount forwarding number
- userId user id
As used herein, the number of the sum function and to seek, and sorted according to the results, descending from the former five
select
sum(reportCount) sums
from weibo
group by userId order by sums desc limit 5;
5. The number of microblogging statistics with pictures
The affected fields
- weiboId BiHiroshi id
- pic_list photo list
With pictures microblogging is pic_list field with http, and to determine whether included with instr
select count(weiboId) sum
from weibo
where instr(pic_list,'http')>0;
6. Statistical unique users using iphone micro-Bo
The affected fields
- userId user id
- source data source
Similar to the fifth question
select count(distinct userId)c1
from weibo
where instr(source,'iPhone客户端')>0;
7. Like the number of points the number of micro-blog, and forwarding summed and the sum is in descending order, taking the first 10 rows, and the total number of output userid
The affected fields
- userId user id
- Like the number of points praiseCount
- reportCount forwarding number
UserId and then press the sort of results in reverse order
select userid,count(praiseCount)+count(reportCount) sums
from weibo
group by userId
order by sums desc limit 10;
8. statistical number of users in micro-blog user ID and the comment data is less than the number of information sources 1000, which was placed in view, then the data source is statistical views "ipad client" in
The affected fields
- userId user id
- commentCount number Comments
- source data source
By topic requires the use of a view, similar to back with the fifth title
create view weibo_view
as select userId,source
from weibo
where commentCount<1000;
select count(userId) c
from weibo_view
where instr(source,'iPad客户端')>0;
9. Use a custom function statistics microblogging content appears the most "iphone" the number of users, user id and the final result output frequency (Note: this number is "iphone" number of occurrences, the number of micro-Bo "iphone" not appear)
The affected fields
- userId user id
- content content
Custom Class
MyUDF the extends the UDF {class public
public static int the evaluate (String S) {
int COUNT = 0; // counter
int index = 0; // index
// returns the first occurrence of field iphone subscripts appear, if -1 DESCRIPTION field of no iphone
the while ((s.toLowerCase.indexOf index = ( "iphone", index)) = -. 1!) {
COUNT ++;
index + = "iphone" .length (); // Save from one e field cycle begins
}
return COUNT;
}
/ * public static void main (String [] args) {
String S = "iphone1212iphone421iphone";
System.out.println (the evaluate (S));
} * /
This class labeled jar package and added to linux
add jar /home/hadoop/hive_data/test.jar;
Create a temporary function in the hive, the attention here is the full name of the class
create temporary function mysum as "udf.MyUDF";
Can be used directly
select userId,mysum(content)c
from weibo
order by c desc limit 1;
ID and the number of micro-Bo 10. The request made to it daily maximum number of times that guy
The affected fields
- userId user id
- Created createTime
- weiboId BiHiroshi id
Analysis of the subject can know that the topic requires us to press useId group, and then press day packet
Because here we want to sort time turn into type bigint
select
userId,from_unixtime(cast(createTime as bigint),"yyyy-MM-dd") day,count(weiboId) totalCount
from weibo
group by userId,from_unixtime(cast(createTime as bigint),"yyyy-MM-dd")
order by totalCount desc limit 1;
11. Determine all numbers are multiple references (the same photo appeared in a number of micro-Bo, even if more than one multiple) photos
The affected fields
- weiboId BiHiroshi id
- pic_list photo list
We can find pic_list field, some for [] empty, some with "," split indicates multiple, so here we want to use the burst function will burst open more fields
Fields explained
(substr (pic_list, 2, length (pic_list) -2)
represents cut pic_list, beginning from the second cut subscripts (H), the last field of this url, i.e. out url
(Split (substr (pic_list, 2, length (pic_list) -2) , ",")
represented, a plurality of divided URL
the explode (split (substr (pic_list, 2, length (pic_list) -2) , ",") )
the plurality of array burst open field into a single field
to create a temporary table to store weiboId (association key), urlcreate table pic_temp as
select
weiboId,ps.pic_url pic_url
from weibo lateral view explode(split(substr(pic_list,2,length(pic_list)-2),",")) ps as pic_url where length(pic_list)>2;
The number of statistical url and find the photo was repeatedly cited
select
count(*) totalCount
from
(select
pic_url,count(distinct weiboId) totalCount
from pic_temp
group by pic_url
having totalCount>1) a;