大数据流程练习题

本题是一个综合练习题目总共包括以下部分:
1.数据的预处理阶段
2.数据的入库操作阶段
3.数据的分析阶段
4.数据保存到数据库阶段
5.数据的查询显示阶段
给出数据格式表和数据示例,请先阅读数据说明,再做相应题目。

数据说明:
表1-1 视频表
字段 备注 详细描述
在这里插入图片描述
表1-2 用户表
字段 备注 字段类型
在这里插入图片描述

原始数据:

qR8WRLrO2aQ:mienge:406:People &
Blogs:599:2788:5:1:0:4UUEKhr6vfA:zvDPXgPiiWI:TxP1eXHJQ2Q:k5Kb1K0zVxU:hLP_mJIMNFg:tzNRSSTGF4o:BrUGfqJANn8:OVIc-mNxqHc:gdxtKvNiYXc:bHZRZ-1A-qk:GUJdU6uHyzU:eyZOjktUb5M:Dv15_9gnM2A:lMQydgG1N2k:U0gZppW_-2Y:dUVU6xpMc6Y:ApA6VEYI8zQ:a3_boc9Z_Pc:N1z4tYob0hM:2UJkU2neoBs

预处理之后的数据:

qR8WRLrO2aQ:mienge:406:People,Blogs:599:2788:5:1:0:4UUEKhr6vfA,zvDPXgPiiWI,TxP1eXHJQ2Q,k5Kb1K0zVxU,hLP_mJIMNFg,tzNRSSTGF4o,BrUGfqJANn8,OVIc-mNxqHc,gdxtKvNiYXc,bHZRZ-1A-qk,GUJdU6uHyzU,eyZOjktUb5M,Dv15_9gnM2A,lMQydgG1N2k,U0gZppW_-2Y,dUVU6xpMc6Y,ApA6VEYI8zQ,a3_boc9Z_Pc,N1z4tYob0hM,2UJkU2neoBs

1、对原始数据进行预处理,格式为上面给出的预处理之后的示例数据。通过观察原始数据形式,可以发现,每个字段之间使用“:”分割,视频可以有多个视频类别,类别之间&符号分割,且分割的两边有空格字符,同时相关视频也是可以有多个,多个相关视频也是用“:”进行分割。为了分析数据时方便,我们首先进行数据重组清洗操作。
即:将每条数据的类别用“,”分割,同时去掉两边空格,多个“相关视频id”也使用“,”进行分割
2、把预处理之后的数据进行入库到hive中
2.1创建数据库和表

	创建数据库名字为:video
	创建原始数据表:
	视频表:video_ori  用户表:video_user_ori
	创建ORC格式的表:
	视频表:video_orc 用户表:video_user_orc

给出创建原始表语句
创建video_ori视频表:

create table video_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

创建video_user_ori用户表:

create table video_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "," 
stored as textfile;

请写出ORC格式的建表语句:
创建video_orc表

create table video_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as orcfile;

创建video_user_orc表:

扫描二维码关注公众号,回复: 9916897 查看本文章
create table video_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "," 
stored as orcfile;

2.2分别导入预处理之后的视频数据到原始表video_ori和导入原始用户表的数据到video_user_ori中
请写出导入语句:
video_ori:

load data local inpath '/opt/video_new.txt' into table video_ori ;

video_user_ori:

load data local inpath '/opt/user.txt' into table video_user_ori;

2.3从原始表查询数据并插入对应的ORC表中
请写出插入语句:
video_orc:

INSERT INTO TABLE video_orc SELECT * FROM video_ori; 

video_user_orc:

INSERT INTO TABLE video_user_orc SELECT * FROM video_user_ori; 

3、对入库之后的数据进行hivesql查询操作
3.1从视频表中统计出视频评分为5分的视频信息,把查询结果保存到/export/rate.txt
请写出sql语句:

insert overwrite local directory "/export/rate.txt"   row format delimited 
fields terminated by ":"
collection items terminated by ","
select * from video_ori where rate = 5 ;

3.2从视频表中统计出评论数大于100条的视频信息,把查询结果保存到/export/comments.txt
请写出sql语句:

insert overwrite local directory "/export/comments.txt"  row format delimited 
fields terminated by ":"
collection items terminated by ","
select * from video_ori where comments>100 ;

4、把hive分析出的数据保存到hbase中
4.1创建hive对应的数据库外部表
请写出创建rate外部表的语句:

Create external table rate(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

请写出创建comments外部表的语句:

Create external table comments(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by ":"
collection items terminated by ","
stored as textfile;

4.2加载第3步的结果数据到外部表中
请写出加载语句到rate表:

load data local inpath '/export/rate.txt/000000_0' overwrite   into table rate;

请写出加载语句到comments表:

load data local inpath '/export/comments.txt' overwrite    into table comments;

4.3创建hive管理表与HBase进行映射
给出此步骤的语句Hive中的rate,comments两个表分别对应hbase中的hbase_rate,hbase_comments两个表,创建hbase_rate表并进行映射:

create table hbase_rate( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)  
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category") 
tblproperties("hbase.table.name" = "hbase_rate");

创建hbase_comments表并进行映射:

create table hbase_comments( key string,videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)  
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
with serdeproperties("hbase.columns.mapping" ="info:videoId,info:uploader,info:age,info:category,info:length,info:views,info:rate,info:category,info:category,info:category") 
tblproperties("hbase.table.name" = "hbase_comments");

4.4请写出通过insert overwrite select,插入hbase_rate表的语句

insert overwrite table hbase_rate2 select row_number() over (),rate.* from rate;

请写出通过insert overwrite select,插入hbase_comments表的语句

insert overwrite table hbase_comments select row_number() over (),comments.* from comments;

5.通过hbaseapi进行查询操作
5.1请使用hbaseapi 对 表,按照通过startRowKey=1和endRowKey=100进行扫描查询出结果。

  public void rowKeyFilter() throws IOException {
        Configuration conf = new Configuration();
        conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
        Connection connection = ConnectionFactory.createConnection(conf);
        //读取表
        Table mytable = connection.getTable(TableName.valueOf("hbase_rate2"));

        //全表扫描
        Scan scan = new Scan();
        //区间扫描
        scan.setStartRow("1".getBytes());
        scan.setStopRow("100".getBytes());


        ResultScanner scanner = mytable.getScanner(scan);
        //result  是与一行数据(有多个列族,多个列)
        for (Result result : scanner) {
            System.out.println("rowkey -->" + Bytes.toString(result.getRow()));
            System.out.println("age:" + Bytes.toString(result.getValue("cf".getBytes(), "age".getBytes())));
    
        }
        //关闭连接
        connection.close();
    
    }

5.2请使用hbaseapi对hbase_comments表,只查询comments列的值。

  public void search() throws Exception {
        Configuration conf = new Configuration();
        conf.set("hbase.zookeeper.quorum", "node01:2181,node02:2181,node03:2181");
        Connection connection = ConnectionFactory.createConnection(conf);
        Table mytable = connection.getTable(TableName.valueOf("hbase_comments"));

        Scan scan = new Scan();
    
        ResultScanner scanner = mytable.getScanner(scan);
        for (Result result : scanner) {
            Cell[] cells = result.rawCells();
            for (Cell cell : cells) {
                if (Bytes.toString(CellUtil.cloneQualifier(cell)).equals("comments")) {
                    System.out.println(Bytes.toString(CellUtil.cloneFamily(cell))+":"+Bytes.toString(CellUtil.cloneQualifier (cell))+"-"+Bytes.toString(CellUtil.cloneValue(cell)));
                }
            }
        }
        connection.close();
    }
发布了36 篇原创文章 · 获赞 246 · 访问量 21万+

猜你喜欢

转载自blog.csdn.net/weixin_45749011/article/details/103867660