Hive case of microblogging

Data Download Link:

https://pan.baidu.com/s/1OGyO2jFj393-Dcq3eosbjA&shfl=sharepset
extraction code: jtdi

Data Case (whichever two files to):

[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387157643","commentCount":"682","content":"喂!2014。。。2014!喂。。。","createTime":"1387086483","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww1.sinaimg.cn/square/47119b17jw1ebkc9b07x9j218g0xcair.jpg","http://ww4.sinaimg.cn/square/47119b17jw1ebkc9ebakij218g0xc113.jpg","http://ww2.sinaimg.cn/square/47119b17jw1ebkc9hml7dj218g0xcgt6.jpg","http://ww3.sinaimg.cn/square/47119b17jw1ebkc9kyakyj218g0xcqb3.jpg"],"praiseCount":"1122","reportCount":"671","source":"iPhone客户端","userId":"1192336151","videourl":[],"weiboId":"3655768039404271","weiboUrl":"http://weibo.com/1192336151/AnoMrDstN"}]

Field Description:

A total of 19 fields

Whether beCommentWeiboId comment
whether beForwardWeiboId is forward microblogging
catchTime crawl time
commentCount commented
content content
createTime Created
info1 information field 1
INFO2 information field 2
Info3 information field 3
mlevel NO the Sure
musicurl music links
pic_list photo list (can have multiple)
praiseCount points number of praise
reportCount forward the number of
source data source
userId user the above mentioned id
videoURL video link    
weiboId microblogging the above mentioned id
weiboUrl microblogging URL

Functional Requirements:

  • When construction of the table, to build an external table
  • Data storage directory: hdfs: // hadoop01: 9000 / data / weibo

1. Create a table Hive weibo_json (json string), a field table only, all imported data and verifies the query data before 5

Create a table json

create external table if not exists weibo_json(json string) location "/data/weibo";

Download Data

load data local inpath "/home/hadoop/hive_data/weibo.json" into table weibo_json;

Check the data

select json 
from weibo_json limit 5;

2. End parse json format data weibo_json them to have 19 field weibo table, write the necessary SQL statements

Create a table weibo

create table if not exists weibo(
 beCommentWeiboId string,
 beForwardWeiboId string,
 catchTime string,
 commentCount int,
 content string,
 createTime string,
 info1 string, 
 info2 string, 
 info3 string,
 mlevel string, 
 musicurl string, 
 pic_list string, 
 praiseCount int,
 reportCount int, 
 source string, 
 userId string, 
 videourl string,
 weiboId string, 
 weiboUrl string 
) row format delimited fields terminated by '\t';

Insert data

insert into table weibo 
select 
get_json_object(json,'$[0].beCommentWeiboId') beCommentWeiboId,
get_json_object(json,'$[0].beForwardWeiboId') beForwardWeiboId,
get_json_object(json,'$[0].catchTime') catchTime,
get_json_object(json,'$[0].commentCount') commentCount,
get_json_object(json,'$[0].content') content, 
get_json_object(json,'$[0].createTime') createTime,
get_json_object(json,'$[0].info1') info1, 
get_json_object(json,'$[0].info2') info2,
get_json_object(json,'$[0].info3') info3,
get_json_object(json,'$[0].mlevel') mlevel,
get_json_object(json,'$[0].musicurl') musicurl,
get_json_object(json,'$[0].pic_list') pic_list,
get_json_object(json,'$[0].praiseCount') praiseCount,
get_json_object(json,'$[0].reportCount') reportCount,
get_json_object(json,'$[0].source') source,
get_json_object(json,'$[0].userId') userId,
get_json_object(json,'$[0].videourl') videourl,
get_json_object(json,'$[0].weiboId') weiboId,
get_json_object(json,'$[0].weiboUrl') weiboUrl
from weibo_json;

3. The total number of Weibo users and independent statistics

The affected fields

  • weiboId BiHiroshi id
  • userId user id

Number of unique users is going to be re-useId

select count(weiboId)c1,count(distinct userId)c2 
from weibo;

4. All of the statistical number of times the user is forwarded Twitter and outputs top5 user, and gives the number of

The affected fields

  • reportCount forwarding number
  • userId user id

As used herein, the number of the sum function and to seek, and sorted according to the results, descending from the former five

select 
sum(reportCount) sums 
from weibo 
group by userId order by sums desc limit 5;

5. The number of microblogging statistics with pictures

The affected fields

  • weiboId BiHiroshi id
  • pic_list photo list

With pictures microblogging is pic_list field with http, and to determine whether included with instr

select    count(weiboId) sum 
from weibo 
where instr(pic_list,'http')>0;

6. Statistical unique users using iphone micro-Bo

The affected fields

  • userId user id
  • source data source

Similar to the fifth question

select count(distinct userId)c1 
from weibo 
where instr(source,'iPhone客户端')>0;

7. Like the number of points the number of micro-blog, and forwarding summed and the sum is in descending order, taking the first 10 rows, and the total number of output userid

The affected fields

  • userId user id
  • Like the number of points praiseCount
  • reportCount forwarding number

UserId and then press the sort of results in reverse order

select userid,count(praiseCount)+count(reportCount) sums 
from weibo 
group by userId 
order by sums desc limit 10;

8. statistical number of users in micro-blog user ID and the comment data is less than the number of information sources 1000, which was placed in view, then the data source is statistical views "ipad client" in

The affected fields

  • userId user id
  • commentCount number Comments
  • source data source

By topic requires the use of a view, similar to back with the fifth title

create view weibo_view 
as select userId,source 
from weibo 
where commentCount<1000;

select count(userId) c 
from weibo_view 
where instr(source,'iPad客户端')>0;

9. Use a custom function statistics microblogging content appears the most "iphone" the number of users, user id and the final result output frequency (Note: this number is "iphone" number of occurrences, the number of micro-Bo "iphone" not appear)

The affected fields

  • userId user id
  • content content

Custom Class

MyUDF the extends the UDF {class public
    public static int the evaluate (String S) {
        int COUNT = 0; // counter
        int index = 0; // index
        // returns the first occurrence of field iphone subscripts appear, if -1 DESCRIPTION field of no iphone
        the while ((s.toLowerCase.indexOf index = ( "iphone", index)) = -. 1!) {
           COUNT ++;
           index + = "iphone" .length (); // Save from one e field cycle begins
        }
        return COUNT;
    }
            
    / * public static void main (String [] args) {
        String S = "iphone1212iphone421iphone";
        System.out.println (the evaluate (S));
    } * /

This class labeled jar package and added to linux

add jar /home/hadoop/hive_data/test.jar;

Create a temporary function in the hive, the attention here is the full name of the class

create temporary function mysum as "udf.MyUDF";

Can be used directly

select userId,mysum(content)c 
from weibo 
order by c desc limit 1;

ID and the number of micro-Bo 10. The request made to it daily maximum number of times that guy

The affected fields

  • userId user id
  • Created createTime
  • weiboId BiHiroshi id

Analysis of the subject can know that the topic requires us to press useId group, and then press day packet

Because here we want to sort time turn into type bigint

select 
userId,from_unixtime(cast(createTime as bigint),"yyyy-MM-dd") day,count(weiboId) totalCount 
from weibo 
group by userId,from_unixtime(cast(createTime as bigint),"yyyy-MM-dd") 
order by totalCount desc limit 1;

11. Determine all numbers are multiple references (the same photo appeared in a number of micro-Bo, even if more than one multiple) photos

The affected fields

  • weiboId BiHiroshi id
  • pic_list photo list

We can find pic_list field, some for [] empty, some with "," split indicates multiple, so here we want to use the burst function will burst open more fields

Fields explained

(substr (pic_list, 2, length  (pic_list) -2)
 represents cut pic_list, beginning from the second cut subscripts (H), the last field of this url, i.e. out url

(Split (substr (pic_list, 2, length (pic_list) -2) , ",")
 represented, a plurality of divided URL

the explode (split (substr (pic_list, 2, length (pic_list) -2) , ",") )
the plurality of array burst open field into a single field

to create a temporary table to store weiboId (association key), url

create table pic_temp as 
select 
weiboId,ps.pic_url pic_url 
from weibo lateral view explode(split(substr(pic_list,2,length(pic_list)-2),",")) ps as pic_url where length(pic_list)>2;

The number of statistical url and find the photo was repeatedly cited

select 
count(*) totalCount 
from 
(select 
pic_url,count(distinct weiboId) totalCount 
from pic_temp 
group by pic_url 
having totalCount>1) a;

发布了183 篇原创文章 · 获赞 126 · 访问量 7万+

Guess you like

Origin blog.csdn.net/young_0609/article/details/103000976