Hadoop MapReduce knowledge preview, WordCount word frequency statistics case

Mapreduce

先分再和、分而治之The idea of ​​MapReduce

Map: perform some repetitive processing on a set of data elements

Reduce: Perform some further result sorting on the intermediate results of the Map

  • MapReduce data type
    -> kv键值对form

Phase composition:

A MapReduce programming model can only contain one Map stage and one Reduce stage , or only the Map stage

If the user's business logic is very complex, only multiple MapReduce programs can be run serially

Mapper > Reducer > Mapper > Reducer

MapReduce instance process

  • MRAppMaster: Responsible for the process scheduling and status coordination of the entire MR program
  • MapTask: Responsible for the entire data processing process in the Map phase
  • ReduceTask: Responsible for the entire data processing process in the Reduce phase

Advantages and disadvantages of MapReduce

  • Features of MapReduce
    • 易与编程, simply implement some interfaces, you can complete a distributed program
    • 良好的扩展性, computing power can be expanded by adding machines
    • 高容错性, any single node downtime does not affect the completion of the entire job task
    • 适合海量数据离线处理
  • Disadvantages of MapReduce
    • 实时计算性能差, mainly used for offline operations, unable to achieve second-level data response
    • 不能进行流式计算, MapReduce is mainly for offline static data sets

MapReduce programming case-WordCount word frequency statistics

Implementation ideas

  1. The core of the Map stage: Pass the input data 切割and mark it all as 1, so the output is<word, 1>
  2. The core of the Shuffle stage: through the built-in 排序分组functions of the MR program, use keythe same data as a set of data to form a newk-v键值对
  3. The core of the Reduce phase: process the Shuffled set of data, which is all the key-value pairs of the word. 对所有的1进行累加求和, is the total number of occurrences of the word
    insert image description here

Steps

  1. Create a new file that needs word frequency statistics, and enter the content
    insert image description here

  2. Upload files to HDFS file system
    insert image description here

  3. Run Hadoop built-in case wordcount,

    • under $HADOOP_HOME/share/hadoop/mapreducethe path namedhadoop-mapreduce-examples-*.jar
      insert image description here

    • Running instance:

    hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /input /output
    

    The wordcountparameter means to specify the instance of running word frequency statistics,

    /inputis the path of the file where word frequency statistics need to be performed,

    /outputoutput path for the result, no need to manually create

    insert image description here

    • View Results
      insert image description here

Guess you like

Origin blog.csdn.net/weixin_45735297/article/details/129770763