The idea of MapReduce

Project actual combat case: Sogou log query analysis

data:

1. The overall architecture of the e-commerce big data platform
1. Big data (Hadoop, Spark, Hive) is a data warehouse implementation. The
core problem: data storage, data calculation
What is a data warehouse? The traditional way to solve big data is that a database
generally only does queries

2、大数据平台整体的架构
    部署:Apache、Ambari(HDP)、CDH

2. Using the waterfall model in the project (Software Engineering: Methodology)
1. How many stages of the waterfall model?
2. The tasks completed in each stage

3. Analysis and processing using MapReduce (Java program)
1. The basic principle of MapReduce (programming model)
( ) Source of ideas: Google's paper: MapReduce problem PageRank (page ranking)
(
) First split, then merge----- > Distributed Computing

2、使用MapReduce进行日志分析

Fourth, use Spark for analysis and processing (Scala language, Java language)
1. The advantages and architecture of Spark
2. Use Scala to develop Spark tasks for log analysis
bin/spark-shell --master spark://bigdata11:7077

    val rdd1 = sc.textFile("hdfs://mydemo71:8020/myproject/data/SogouQ1.txt")
    val rdd2=rdd1.map(_.split("\t")).filter(_.length==6)
    rdd2.count()
    val rdd3=rdd2.filter(_(3).toInt==1).filter(_(4).toInt==2)
    rdd3.count()
    rdd3.take(3)

V. Use Hive (Hive) for Analysis and Processing
1. What is Hive? Features? The Hive architecture
is a data warehouse based on HDFS, and the
SQL statement
is a translator: SQL ----> MapReduce (Spark task)

2、使用Hive进行查询操作![](http://i2.51cto.com/images/blog/201805/06/c1b98f5698511262bb9b56d0174a331d.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)
        ① 创建Hive对应的表
        create table sogoulog(accesstime string,useID string,keyword string,no1 int,clickid int,url string) row format delimited fields terminated by ',';

    **  ② 将原始数据进行清洗:因为有些不满足长度为6
        val rdd1 = sc.textFile("hdfs://mydemo71:8020/myproject/data/SogouQ1.txt")
        val rdd2=rdd1.map(_.split("\t")).filter(_.length==6)
        val rdd3 = rdd2.map(x=>x.mkString(","))   这里需要注意转成字符串
        rdd3.saveAsTextFile("hdfs://mydemo71:8020/myproject/cleandata/sogou")

    **  ③ 将清洗后的数据导入Hive
        load data inpath '/myproject/cleandata/sogou/part-00000' into table sogoulog;
        load data inpath '/myproject/cleandata/sogou/part-00001' into table sogoulog;

        ④ 使用SQL查询满足条件的数据(只显示前10条)**
        select * from sogoulog where no1=1 and clickid=2 limit 10;**
  查询10号部门 工资大于2000的员工
      很多人都知道我有大数据培训资料,都天真的以为我有全套的大数据开发、hadoop、spark等视频学习资料。我想说你们是对的,我的确有大数据开发、hadoop、spark的全套视频资料。

If you are interested in big data development, you can join the group to receive free learning materials: 763835121

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325362169&siteId=291194637