--- MapReduce Hadoop foundation to achieve

A: MapReduce thinking

(A) MapReduce problem

1. How to achieve code distributed to the nodes in the cluster, and how to run them

2. Place the code distribution to which the specified machine running

3. Real-time monitoring of the operation of the node

4. How the results are summarized

In short: our simple business logic easily extended to distributed computing environment in massive data

(B) preparation of the basic concepts and program logic MapReduce

MapReduce program execution process is divided into two stages: Mapper and Reducer stage stage

Mapper stage which includes:

Copy the code
1> Specifies the path of the input file, the input file and logically segmented into a plurality of split pieces of data. Subsequently resolved into slices input key-value pair <k1, v1>, where k1 is often said that our starting offset, v1 is the content of the line of text according to certain rules. 

2> function call to map their preparation, the key input to the <k1, v1> converted to the key-value pairs <k2, v2>, wherein each key-value pair <k1, v1> map function will be called once. 

3> output the key-value <k2, v2> partitioning, sorting, grouping, wherein the packet is the same key value into a set with them. 

4> (optional) data packets to the local merge processing (combiner).
Copy the code

Reducer stage which includes:

5> Mapper outputs from a plurality of tasks, in accordance with the different partitions, to copy the network nodes on different Reducer treatment, followed by a plurality of output Mapper tasks are combined, ordered. 

6> function call to reduce their preparation, the key input to the <k2, v2s> converted into key-value pairs <k3, v3>

7> Reducer task to save the output to the specified file.

Two: WordCount program realization

Write (a) map program

cn.hadoop.mr.wc Package; 

Import java.io.IOException; 

Import org.apache.commons.lang.StringUtils; 
Import org.apache.hadoop.io.LongWritable; 
Import org.apache.hadoop.io.Text; 
Import org.apache.hadoop.mapreduce.Mapper; 

// generic th. 4, the first mapper two specified input data type, the two outputs of a mapper data type 
// map and reduce the data input and output are encapsulated to <k, v> in the form of key-value pairs 
the // default, the mapper data input, the text is to be processed key row start offset, value is the contents of the line
 public  class WCMapper Mapper the extends <LongWritable, the Text, the Text, LongWritable> { 
    @Override     // custom data type does not appear in the redundant data network coding to improve network transmission efficiency. Improve the data communication capabilities cluster
     protected  void Map (LongWritable Key, the Text value, Mapper <LongWritable, the Text, the Text, LongWritable> .context context)
            IOException throws, InterruptedException { 
        // Get the contents of the line 
        String LINEVAL = value.toString (); 
        
        // this line of text segmentation by delimiter 
        String words [] = StringUtils.split (LINEVAL, "  " ); 
        
        / / iterate, into an output mode <k, v> form 
        for (String Word: words) {
             // the output data is written in the context 
            context.write ( new new the Text (Word), new new LongWritable ( . 1 )); 
        } 
        
        / / not actually write a <k, v> is sent to the cluster node, but the same key after a complete traversal of the class data, sent out from the cache 
        // result is transmitted in the form of <k, <v1, v2 , v3, ..., vn >>, where the actual form is <K, <1,1,1,1,1, ....,. 1 >>
     } 
}

Write (b) Reduce program

package cn.hadoop.mr.wc;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values,
            Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
        long count = 0;
        //遍历values,统计每个key出现的总次数
        for(LongWritable value:values) {
            COUNT . + = value GET (); 
        } 
        // output statistics that a word 
        context.write (Key, new new LongWritable (COUNT)); 
    } 
}

(C) generating job, will be submitted to the cluster map and reduce

package cn.hadoop.mr.wc;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WCRunner {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job wcjob =Job.getInstance (the conf); 
        
        // set the entire job classes that used in those jar package 
        wcjob.setJarByClass (WCRunner.class); 
        
        // set of job classes mapper and reducer used 
        wcjob.setMapperClass (WCMapper.class); 
        wcjob.setReducerClass (WCReducer.class); 
        
        // set the map and reduce output class 
        wcjob.setOutputKeyClass (Text.class); 
        wcjob.setOutputValueClass (LongWritable.class); 
        
        // this map may be provided separately output data k, v type . If the above types are different, the following is a useful 
        wcjob.setMapOutputKeyClass (Text.class); 
        wcjob.setMapOutputValueClass (LongWritable.class); 
        
        path // specified input data to be processed stored 
        FileInputFormat.setInputPaths (wcjob, new Path ( " / wc / input / ")); 
        // the specified storage path output structure
        FileOutputFormat.setOutputPath (wcjob, new new Path ( "/ WC / the Output")); 
        
        // job will be submitted to the cluster running 
        wcjob.waitForCompletion (true); // pass argument is: whether to display the status and progress of the program
     } 
}

(D) Supplementary

To run map and reduce, you also need to import jar package mapreduce file share folder under the folder

(E) Export jar package

(Vi) experimental test

1. Experimental data

Hello kitty Hello Mark
Bye Kitty
Good morning
Say Hello System Path
Hosts file
Open file
Hello ssyfj
Get access
Get default
wcdata.txt

2. Upload to hdfs system

Create a file input directory

hadoop fs -mkdir -p /wc/input/

Upload files to the input directory

hadoop fs -put wcdata.txt /wc/input/

Note: Do not create result output directory, or will be error

3. Call program for data analysis

hadoop jar wc.jar cn.hadoop.mr.wc.WCRunner

wc.jar our export jar package, cn.hadoop.mr.wc.WCRunner is the main class we run

4. Check results

Three: Hadoop local test

(A) introducing hdfs, mapreduce, yarn jar packets (including the packet lib dependent jar library) in the project

(B) resolve Exception in thread "main" java.lang.NullPointerException problem (two ways)

1. Add to the beginning of the main method of the main class WCRunner

        System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.7.1");

2. Add the environment variable

(C) Obtain the following files to the bin directory under your home directory Hadoop

Download: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1

(D) to import all the files in the bin directory to the C: \ Windows \ System32 directory

(E) results View

(Vi) to resolve Eclipse run Hadoop, the log does not show a problem

Generally the following occurs: MapReduce and progress information does not appear.

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).  
log4j:WARN Please initialize the log4j system properly.  
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.  

这种情况一般是由于log4j这个日志信息打印模块的配置信息没有给出造成的,可以在项目的src目录下,新建一个文件,命名为“log4j.properties”,填入以下信息:

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n  

 

Guess you like

Origin www.cnblogs.com/ssyfj/p/12327470.html