concept:
When executing MapReduce, some information may need to be shared between mappers. If the amount of information is not large, it can be loaded into memory from HDFS.
Use the DistributedCache method:
1. Load the HDFS path of the shared file in the main method. The path can be a directory or a file. You can append "#" + alias at the end of the path, which can be used in the map stage
String cache="hdfs://10.203.87.5:8020/cache/file"; cache=cache+"#myfile";//myfile is an alias
job.addCacheFile(new Path(cache).toUri(),conf);//Add to job settings
2. In the setup method of the Mapper class or Reducer, use the input stream to obtain the distributed cache file
protected void setup(Conetxt context) throws IOException{ FileReader reader=new FileReader(); BufferedReader br=new BufferedReader(reader); }Loading into memory occurs before the job, and each slave node each caches a copy of the same shared data. If the shared data is too large, you can cache the shared data in batches and execute the job repeatedly.