Hadoop learning (4) -mapreduce Some notes

Some attention to detail on mapreduce

If mapreduce liux packaged into a run down,

Command java -cp xxx.jar main class name

If errors are reported, indicating a lack of relevant dependent jar package

Command hadoop jar xxx.jar class name because machines used in the cluster hadoop jar xx.jar mr.wc.JobSubmitter command to start the client when the main method, hadoop jar where this command will hadoop installation directory on the machine in the jar classpath packages and when added to the run profile

So, our new client-side main method Configuration () statement will load the configuration file in the classpath, there is a natural

fs.defaultFS and mapreduce.framework.name these parameters and yarn.resourcemanager.hostname

Will all relevant local hadoop jar package will be cited

Mapreduce also have a local job to run, it is that you can not submit to the yarn, can be run in standalone mode with multiple threads while simulation can be.

If the matter is under in Linux or windows, to submit job will be submitted to the local default to run,

If you submit to the yarn running linux default, you need to write the configuration file hadoop / etc / mapred-site.xml file

<name>mapreduce.framework.name</name>

</property>

</configuration>

Key, value pairs, if it is your kind of thing, then this class to achieve Writable, at the same time make you want to serialize the data into binary, and then put DataOutput override methods wirte parameters inside, another readFields override method It is used to deserialize,

Note deserialization time, will acquire no constructor parameter of this class is constructed out of an object, and then to recover the object by readFields method.

DataOutput also a stream, but is in a package hadoop, when their use, which objects need to add FileOutputStream

DataOutput write a string of time to use writeUTF ( "string"), so he coding time, will add to the length of the string before the string, which is considering the question of its character encoding, hadoop resolved when will first read the first two bytes and see how long this string, otherwise he does not know if such write (string .getBytes ()) the string in the end how many bytes.

In the reduce phase, if an object is written hdfs inside, it will call the toString method of the string, you can override the toString methods of the class

For example, the following class can be serialized in hadoop

package mapreduce2;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys.Write;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.Waitable;

public class FlowBean implements Writable {

    private int up;//上行流量
    private int down;//下行流量
    private int sum;//总流量
    private String phone;//电话号
    
    public FlowBean(int up, int down, String phone) {
        this.up = up;
        this.down = down;
        this.sum = up + down;
        this.phone = phone;
    }
    public int getUp() {
        return up;
    }
    public void setUp(int up) {
        this.up = up;
    }
    public int getDown() {
        return down;
    }
    public void setDown(int down) {
        this.down = down;
    }
    public int getSum() {
        return sum;
    }
    public void setSum(int sum) {
        this.sum = sum;
    }
    public String getPhone() {
        return phone;
    }
    public void setPhone(String phone) {
        this.phone = phone;
    }
    @Override
    public void readFields(DataInput di) throws IOException {
         // Note here order to read and write sequence is the same as 
        the this .UP = di.readInt ();
         the this .down = di.readInt ();
         the this .sum = the this .UP + the this .down;
         the this .phone = di.readUTF (); 
    } 
    @Override 
    public  void Write (of DataOutput the Do) throws IOException { 
        Do.writeInt ( the this .UP); 
        Do.writeInt ( the this .down); 
        Do.writeInt ( the this .sum); 
        Do.writeUTF ( the this.phone); 
    } 
    @Override 
    public String toString () {
         return + "phone number" the this .phone + "total flow" + the this .sum; 
    } 
}

When all reduceTask are finished running, it will invoke a cleanup method

Application Exercise: Statistics a total page views for the n pieces of data

A scheme: only one ReduceTask, using cleanup methods, in ReduceTask stage, not directly into the first hdfs inside, but kept inside a Treemap

Then after reducetask, in which cleanup inside Treemap by five output to HDFS inside;

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable<PageCount>{
    
    private String page;
    private int count;
    
    public void set(String page, int count) {
        this.page = page;
        this.count = count;
    }
    
    public String getPage() {
        return page;
    }
    public void setPage(String page) {
        this.page = page;
    }
    public int getCount() {
        return count;
    }
    public void setCount(int count) {
        this.count = count;
    }

    @Override
    public int compareTo(PageCount o) {
        return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
    }
    
    

}

map类

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class PageTopnMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split(" ");
        context.write(new Text(split[1]), new IntWritable(1));
    }

}

reduce类

package cn.edu360.mr.page.topn;

import java.io.IOException;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class PageTopnReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    TreeMap<PageCount, Object> treeMap = new TreeMap<>();
    
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        PageCount pageCount = new PageCount();
        pageCount.set(key.toString(), count);
        
        treeMap.put(pageCount,null);
        
    }
    @Override
    protected void cleanup(Context context)
            throws IOException, InterruptedException {
        Configuration conf = context.getConfiguration();
　　　　//可以在cleanup里面拿到configuration，从里面读取要拿前几条数据
        int topn = conf.getInt("top.n", 5);
        
        
        Set<Entry<PageCount, Object>> entrySet = treeMap.entrySet();
        int i= 0;
        
        for (Entry<PageCount, Object> entry : entrySet) {
            context.write(new Text(entry.getKey().getPage()), new IntWritable(entry.getKey().getCount()));
            i++;
            if(i==topn) return;
        }   
    }
}

然后jobSubmit类，注意这个要设定Configuration，这里面有几种方法

第一种是加载配置文件

        Configuration conf = new Configuration();
        conf.addResource("xx-oo.xml");

然后再在xx-oo.xml文件里面写

<configuration>
    <property>
        <name>top.n</name>
        <value>6</value>
    </property>
</configuration>

第二种方式

　　　　//通过直接设定
        conf.setInt("top.n", 3);
        //通过对java主程序 直接传进来的参数
        conf.setInt("top.n", Integer.parseInt(args[0]));

第三种方式通过获取配置文件参数

　　　　 Properties props = new Properties();
        props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
        conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));

然后再在topn.properties里面配置参数

top.n=5

subsubmit类，默认在本机模拟运行

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobSubmitter {

    public static void main(String[] args) throws Exception {

        /**
         * 通过加载classpath下的*-site.xml文件解析参数
         */
        Configuration conf = new Configuration();
        conf.addResource("xx-oo.xml");
        
        /**
         * 通过代码设置参数
         */
        //conf.setInt("top.n", 3);
        //conf.setInt("top.n", Integer.parseInt(args[0]));
        
        /**
         * 通过属性配置文件获取参数
         */
        /*Properties props = new Properties();
        props.load(JobSubmitter.class.getClassLoader().getResourceAsStream("topn.properties"));
        conf.setInt("top.n", Integer.parseInt(props.getProperty("top.n")));*/
        
        Job job = Job.getInstance(conf);

        job.setJarByClass(JobSubmitter.class);

        job.setMapperClass(PageTopnMapper.class);
        job.setReducerClass(PageTopnReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\url\\input"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\url\\output"));

        job.waitForCompletion(true);

    }
}

额外java知识点补充

Treemap，放进去的东西会自动排序

两种Treemap的自定义方法，第一种是传入一个Comparator

public class TreeMapTest {
    
    public static void main(String[] args) {
        
        TreeMap<FlowBean, String> tm1 = new TreeMap<>(new Comparator<FlowBean>() {
            @Override
            public int compare(FlowBean o1, FlowBean o2) {
                //如果两个类总流量相同的会比较电话号
                if( o2.getAmountFlow()-o1.getAmountFlow()==0){
                    return o1.getPhone().compareTo(o2.getPhone());
                }
                //如果流量不同，就按从小到大的顺序排序
                return o2.getAmountFlow()-o1.getAmountFlow();
            }
        });
        FlowBean b1 = new FlowBean("1367788", 500, 300);
        FlowBean b2 = new FlowBean("1367766", 400, 200);
        FlowBean b3 = new FlowBean("1367755", 600, 400);
        FlowBean b4 = new FlowBean("1367744", 300, 500);
        
        tm1.put(b1, null);
        tm1.put(b2, null);
        tm1.put(b3, null);
        tm1.put(b4, null);
        //treeset的遍历
        Set<Entry<FlowBean,String>> entrySet = tm1.entrySet();
        for (Entry<FlowBean,String> entry : entrySet) {
            System.out.println(entry.getKey() +"\t"+ entry.getValue());
        }
    }

}

第二种是在这个类中，实现一个Comparable接口

package cn.edu360.mr.page.topn;

public class PageCount implements Comparable<PageCount>{
    
    private String page;
    private int count;
    
    public void set(String page, int count) {
        this.page = page;
        this.count = count;
    }
    
    public String getPage() {
        return page;
    }
    public void setPage(String page) {
        this.page = page;
    }
    public int getCount() {
        return count;
    }
    public void setCount(int count) {
        this.count = count;
    }

    @Override
    public int compareTo(PageCount o) {
        return o.getCount()-this.count==0?this.page.compareTo(o.getPage()):o.getCount()-this.count;
    }
    
    

}

Hadoop learning (4) -mapreduce Some notes

Guess you like