[MapReduce, Hive] classroom test

  After a lapse of a long time blog.


   The test comprises three stages, cleaning, processing and visualization, the object is achieved within the Result file data is processed, and statistics display.

Stage a  data cleaning: Cleaning in accordance with the data, and import data washing hive database

  Although the content requires the data format from ->

ip:    199.30.25.88

time:  10/Nov/2016:00:01:03 +0800

traffic:  62

Articles: article / 11325

Video: video / 3235

  Becomes ->

ip ---> urban city (IP)

date--> time:2016-11-10 00:01:03

day: 10

traffic:62

type:article/video

id:11325

  But did not use later time, so no MapReduce cleaning operations, if necessary, can be used to map a format change time.

  Further needs cleaning, the traffic in which text files more spaces, with or shell MapReduce text processing is performed. (I recommend the latter)

  Then directly import the hive inside it.

Phase II  data processing: three statistical tasks

  Because the statistics are relatively simple, I did not use MapReduce Java program, but with HiveQL to complete (the nature of the SQL language). But then realized once again will use MapReduce.

  • Visits statistics Top10 most popular video / article (video / article)

    select id,type, count(*) as times from result group by id,type order by times DESC limit 10;

  • According to statistics the city's most popular courses Top10 (ip)

    select b.id,b.ip,b.type,b.times from(

      select a.*, row_number() over(partition by a.ip order by a.times desc) n from (select id,ip,type, count(*) times from result group by id,ip,type)a

    )b where b.n<=10;

  • According to traffic statistics Top10 most popular courses (traffic)

    select id,type,count(traffic) traffics from result group by id,type order by times desc limit 10;

  Write SQL statement is convenient, you can directly call the hive to see the results, but the overall structure is very clear, I can complete MapReduce program in accordance with this structure, will facilitate a lot.

Phase III  Data Visualization: The statistical results poured into the MySql database, unfolded through a graphical display mode

  This process is not yet complete, but with sqoop the query results into mysql database inside, after the visualization to do out on the line with echarts.

Guess you like

Origin www.cnblogs.com/limitCM/p/11853884.html