The use of hadoop distributed cache

Introduction

DistributedCache is a mechanism provided by the hadoop framework, which can distribute the files specified by the job to the machine where the task is executed before the job is executed, and there are related mechanisms to manage the cache files.
The cache content is in the file, and each node can read the cache according to the access path in hdfs.

Steps for usage

1. When adding a distributed cache,
first define the cache path

String cacheFile = "hdfs://xxxx";

You can set the alias "#" and the alias is the alias. You can directly use
cacheFile = cacheFile + "#alias"
in the main method to add to the job in the method (then you can use it in the map phase)

// 缓存jar包到task运行节点的classpath中
job.addArchiveToClassPath(archive);
// 缓存普通文件到task运行节点的classpath中
job.addFileToClassPath(file);
// 缓存压缩包文件到task运行节点的工作目录
job.addCacheArchive(uri);
// 缓存普通文件到task运行节点的工作目录
job.addCacheFile(uri) 

2. Use distributed cache

  @Override
  protected void setup(Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
    
    
      super.setup(context);
      // 读取缓存文件中的内容 直接根据别名读取
          FileReader fr = new FileReader("别名");
          BufferedReader br = new BufferedReader(fr)}

Guess you like

Origin blog.csdn.net/sc9018181134/article/details/104054235