hadoop distributed cache

concept:

    When executing MapReduce, some information may need to be shared between mappers. If the amount of information is not large, it can be loaded into memory from HDFS.

Use the DistributedCache method:

    1. Load the HDFS path of the shared file in the main method. The path can be a directory or a file. You can append "#" + alias at the end of the path, which can be used in the map stage

   String cache="hdfs://10.203.87.5:8020/cache/file";

   cache=cache+"#myfile";//myfile is an alias

   job.addCacheFile(new Path(cache).toUri(),conf);//Add to job settings

   2. In the setup method of the Mapper class or Reducer, use the input stream to obtain the distributed cache file   

protected void setup(Conetxt context) throws IOException{
	
    FileReader reader=new FileReader();
	BufferedReader br=new BufferedReader(reader);
}
Loading into memory occurs before the job, and each slave node each caches a copy of the same shared data. If the shared data is too large, you can cache the shared data in batches and execute the job repeatedly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324826631&siteId=291194637