1.数据的预处理阶段
2.数据的入库操作阶段
3.数据的分析阶段
4.数据保存到数据库阶段
5.数据的查询显示阶段  使用 HBaseAPi查询 （这里就不写了 重要的是上面的离线流程）

原始数据：

qR8WRLrO2aQ:mienge:406:People & Blogs:599:2788:5:1:0:4UUEKhr6vfA:zvDPXgPiiWI:TxP1eXHJQ2Q:k5Kb1K0zVxU:hLP_mJIMNFg:tzNRSSTGF4o:BrUGfqJANn8:OVIc-mNxqHc:gdxtKvNiYXc:bHZRZ-1A-qk:GUJdU6uHyzU:eyZOjktUb5M:Dv15_9gnM2A:lMQydgG1N2k:U0gZppW_-2Y:dUVU6xpMc6Y:ApA6VEYI8zQ:a3_boc9Z_Pc:N1z4tYob0hM:2UJkU2neoBs

预处理之后的数据：

qR8WRLrO2aQ:mienge:406:People,Blogs:599:2788:5:1:0:4UUEKhr6vfA,zvDPXgPiiWI,TxP1eXHJQ2Q,k5Kb1K0zVxU,hLP_mJIMNFg,tzNRSSTGF4o,BrUGfqJANn8,OVIc-mNxqHc,gdxtKvNiYXc,bHZRZ-1A-qk,GUJdU6uHyzU,eyZOjktUb5M,Dv15_9gnM2A,lMQydgG1N2k,U0gZppW_-2Y,dUVU6xpMc6Y,ApA6VEYI8zQ,a3_boc9Z_Pc,N1z4tYob0hM,2UJkU2neoBs

对原始数据进行预处理，格式为上面给出的预处理之后的示例数据。

通过观察原始数据形式，可以发现，每个字段之间使用“:”分割，视频可以有多个视频类别，类别之间&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个，多个相关视频也是用“:”进行分割。为了分析数据时方便，我们首先进行数据重组清洗操作。

即：将每条数据的类别用“，”分割，同时去掉两边空格，多个“相关视频id”也使用“,”进行分割

数据的预处理阶段：

实现效果【截图】

实现代码【代码与截图】：

package MapReduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.IOException;

/**
 * Created by 一个蔡狗 on 2020/1/2.
 */
public class DPMap {

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = new Job(conf, "DPjob");
        job.setJarByClass(DPMap.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("E:\\video.txt"));
        job.setMapperClass(DPmap.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(DPReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("E:\\video2"));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }


    static class DPmap extends Mapper<LongWritable, Text, LongWritable, Text> {

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] split = value.toString().split(":");
            if (split.length > 10) {
                String Categoryold = split[3];
                String Categorynew = "";
                String Relatedidsold = "";
                String Relateidsnew = "";
                Relatedidsold = split[9];
                if (split.length == 10) {
                    //处理 视频类别
                    Categorynew = Util.replacedata(Categoryold);
                    //数据相关视频 ID
                    Relateidsnew = Relatedidsold;
                } else if (split.length > 10) {
                    //处理视频类别
                    Categorynew = Util.replacedata(Categoryold);
                    //数据相关视频 ID
                    Relatedidsold = value.toString().substring(value.toString().indexOf(split[9]));
                    Relateidsnew = "";
                }
                for (int i = 9; i < split.length; i++) {
                    Relateidsnew += split[i] + ",";
                }
                Relateidsnew.substring(0, Relateidsnew.lastIndexOf(","));


                String datas = "";
                datas = value.toString().replace(Categoryold, Categorynew);
                String data2 = "";
                data2 = datas.replace(Relatedidsold, Relateidsnew);
                context.write(new LongWritable(111), new Text(data2));
            }


        }

    }


    static class DPReduce extends Reducer<LongWritable, Text, Text, LongWritable> {

        @Override
        protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //遍历value  进行 输出
            for (Text value : values) {
                context.write(value, null);
            }


        }
    }


}

Map代码截图:

Map代码：

 static class DPmap extends Mapper<LongWritable, Text, LongWritable, Text> {

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] split = value.toString().split(":");
            if (split.length > 10) {
                String Categoryold = split[3];
                String Categorynew = "";
                String Relatedidsold = "";
                String Relateidsnew = "";
                Relatedidsold = split[9];
                if (split.length == 10) {
                    //处理 视频类别
                    Categorynew = Util.replacedata(Categoryold);
                    //数据相关视频 ID
                    Relateidsnew = Relatedidsold;
                } else if (split.length > 10) {
                    //处理视频类别
                    Categorynew = Util.replacedata(Categoryold);
                    //数据相关视频 ID
                    Relatedidsold = value.toString().substring(value.toString().indexOf(split[9]));
                    Relateidsnew = "";
                }
                for (int i = 9; i < split.length; i++) {
                    Relateidsnew += split[i] + ",";
                }
                Relateidsnew.substring(0, Relateidsnew.lastIndexOf(","));


                String datas = "";
                datas = value.toString().replace(Categoryold, Categorynew);
                String data2 = "";
                data2 = datas.replace(Relatedidsold, Relateidsnew);
                context.write(new LongWritable(111), new Text(data2));
            }


        }

    }

Reduce代码截图：

Reduce代码：

 static class DPReduce extends Reducer<LongWritable, Text, Text, LongWritable> {

        @Override
        protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //遍历value  进行 输出
            for (Text value : values) {
                context.write(value, null);
            }


        }
    }

Util代码：

package MapReduce;

/**
 * Created by 一个蔡狗 on 2020/1/2.
 */
public class Util {


    public  static  String replacedata   (String Category){
        String  Categorys="";
        if (Category.contains("&")){
            String[] Categorylist = Category.split("&");
            for (String OneCategory : Categorylist) {
                String trim = OneCategory.trim();
                Categorys+=trim+",";
            }
            return  Category.substring(0,Categorys.lastIndexOf(","));
        }
        return  Category;
    }




}

驱动代码：

   public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "MyMapperReduceDriver");

        job.setJarByClass(MyMApperReduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("E:\\video.txt"));
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(MyReduce.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("E:\\video2"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

把预处理之后的数据进行入库到 hive 中

创建数据库和表

数据入库效果【截图】：

数据入库命令【命令】

创建数据库名字为：video

创建原始数据表：

视频表：video_ori 用户表：video_user_ori

创建ORC格式的表：

视频表：video_orc 用户表：video_user_orc

给出创建原始表语句

创建video_ori视频表：

create table video_ori(

videoId string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

row format delimited

fields terminated by ":"

collection items terminated by ","

stored as textfile;

创建video_user_ori用户表：

create table video_user_ori(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by ","

stored as textfile;

请写出ORC格式的建表语句：

create table video_orc(

videoId string,

uploader string,

age string,

category string,

length string,

views string,

rate string,

ratings string,

comments string,

relatedId string)

row format delimited

fields terminated by ":"

stored as ORC;




create table video_user_orc(

uploader string,

videos string,

friends string)

row format delimited

fields terminated by ","

stored as ORC;

分别导入预处理之后的视频数据到原始表video_ori和导入原始用户表的数据到video_user_ori中

video_ori：

load data local inpath '/opt/part-r-00000' ovewrite  into table video_ori;

video_user_ori：

load data local inpath '/opt/user.txt' into table video_user_ori;

从原始表查询数据并插入对应的ORC表中

video_orc：

insert into table video_orc select * from video_ori;

video_user_orc：

insert into table video_user_orc select * from video_user_ori;

对入库之后的数据进行 hivesql 查询操作

从视频表中统计出视频评分为5分的视频信息，把查询结果保存到/export/rate.txt

#! bin/bash

hive -e "select * from video.video_orc where rate=5 " > /export/rate.txt

从视频表中统计出评论数大于100条的视频信息,把查询结果保存到/export/comments.txt

#! bin/bash
hive -e "select * from video.video_orc where comments >100 " >/export/comments.txt

创建hive对应的数据库外部表

创建rate外部表的语句：

create external table rate(

videoId string,

uploader string,

age string,

category string,

length string,

views string,

rate string,

ratings string,

comments string,

relatedId string)

row format delimited

fields terminated by "\t"

stored as textfile;

创建comments外部表的语句：

create external table comments(

videoId string,

uploader string,

age string,

category string,

length string,

views string,

rate string,

ratings string,

comments string,

relatedId string)

row format delimited

fields terminated by "\t"

stored as textfile;

加载第3步的结果数据到外部表中

数据加载语句

load data local inpath '/opt/5.txt' into table rate;

load data local inpath '/opt/100.txt' into table comments;

创建hive hbase映射表

Hive中的rate，comments两个表分别对应hbase中的hbase_rate，hbase_comments两个表

create table video.hbase_rate(

videoId string,

uploader string,

age string,

category string,

length string,

views string,

rate string,

ratings string,

comments string,

relatedId string)

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties("hbase.columns.mapping" = "cf:uploader,cf:age,cf:category,cf:length,cf:views,cf:rate,cf:ratings,cf:comments,cf:relatedId")

tblproperties("hbase.table.name" = "hbase_rate");

create table video.hbase_comments(

videoId string,

uploader string,

age string,

category string,

length string,

views string,

rate string,

ratings string,

comments string,

relatedId string)

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties("hbase.columns.mapping" = "cf:uploader,cf:age,cf:category,cf:length,cf:views,cf:rate,cf:ratings,cf:comments,cf:relatedId")

tblproperties("hbase.table.name" = "hbase_comments");

请写出通过insert overwrite select，插入hbase_rate表的语句

insert into table hbase_rate select * from rate;

insert into table hbase_comments select * from comments;

神说要有光，于是就有了我

发布了177 篇原创文章 · 获赞 288 · 访问量 25万+

私信关注

大数据离线流程（小练习二）

对原始数据进行预处理，格式为上面给出的预处理之后的示例数据。

数据的预处理阶段：

实现效果【截图】

实现代码【代码与截图】：

Map代码截图:

Map代码：

Reduce代码截图：

Reduce代码：

Util代码：

驱动代码：

把预处理之后的数据进行入库到 hive 中

创建数据库和表

数据入库命令【命令】

给出创建原始表语句

请写出ORC格式的建表语句：

分别导入预处理之后的视频数据到原始表video_ori和导入原始用户表的数据到video_user_ori中

从原始表查询数据并插入对应的ORC表中

对入库之后的数据进行 hivesql 查询操作

从视频表中统计出视频评分为5分的视频信息，把查询结果保存到/export/rate.txt

从视频表中统计出评论数大于100条的视频信息,把查询结果保存到/export/comments.txt

创建hive对应的数据库外部表

加载第3步的结果数据到外部表中

创建hive hbase映射表

Hive中的rate，comments两个表分别对应hbase中的hbase_rate，hbase_comments两个表

请写出通过insert overwrite select，插入hbase_rate表的语句

猜你喜欢

大数据离线流程（小练习二）

对原始数据进行预处理，格式为上面给出的预处理之后的示例数据。

数据的预处理阶段 ：

实现效果【截图】

实现代码【代码与截图】：

Map代码截图:

Map代码 ：

Reduce代码截图：

Reduce代码：

Util代码：

驱动代码：

把预处理之后的数据进行入库到 hive 中

创建数据库和表

数据入库命令【命令】

给出创建原始表语句

请写出ORC格式的建表语句：

分别导入预处理之后的视频数据到原始表video_ori和导入原始用户表的数据到video_user_ori中

从原始表查询数据并插入对应的ORC表中

对入库之后的数据进行 hivesql 查询操作

从视频表中统计出视频评分为5分的视频信息，把查询结果保存到/export/rate.txt

从视频表中统计出评论数大于100条的视频信息,把查询结果保存到/export/comments.txt

创建hive对应的数据库外部表

加载第3步的结果数据到外部表中

创建hive hbase映射表

Hive中的rate，comments两个表分别对应hbase中的hbase_rate，hbase_comments两个表

请写出通过insert overwrite select，插入hbase_rate表的语句

猜你喜欢

数据的预处理阶段：

Map代码：