1.PCollection.apply介绍

在文章开始之前，先来介绍下PCollection.apply方法：
public OutputT apply(
String name, PTransform<? super PCollection, OutputT> t) {
return Pipeline.applyTransform(name, this, t);
}
第二个参数是Beam的转换器PTransform<? super PCollection, OutputT>，PTransform定义两个泛型，第一个是输入类型，第二个是输出类型。上篇我们使用ParDo.of(),返回一个PTransform对象，ParDo是用于通用并行处理的Beam转换。ParDo的处理范例类似于map/shuffle/reduce形式的算法中的“Map”操作：一个ParDo转换考虑到了输入PCollection中的每个元素，在该元素上执行一些操作。下一篇会讲Beam有哪几种转换方式。

2.文件作为数据源的输入和输出

下面我们进入正题,先上代码：

public static void main(String[] args) {
        getDataFromFile();
    }
    
    public static void getDataFromFile() {
           // Create the pipeline.
        PipelineOptions options = 
            PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        PCollection<String> lines = p.apply(
          "ReadMyFile", TextIO.read().from("pom.xml"));
        
        lines.apply(new CountWords()) //返回一个Map<String, Long>
        .apply(MapElements.via(new FormatAsTextFn())) //返回一个PCollection<String>
        .apply("WriteCounts", TextIO.write().to("count.txt"));
        p.run().waitUntilFinish();
    }
    
    
     public static class CountWords extends PTransform<PCollection<String>,
             PCollection<KV<String, Long>>> {
         @Override
         public PCollection<KV<String, Long>> expand(PCollection<String> lines) {
        
             // 将文本行转换成单个单词
             PCollection<String> words = lines.apply(
                     ParDo.of(new ExtractWordsFn()));
        
             // 计算每个单词次数
             PCollection<KV<String, Long>> wordCounts =
                     words.apply(Count.<String>perElement());
        
             return wordCounts;
         }
        }

     /**
         *1.a.通过Dofn编程Pipeline使得代码很简洁。b.对输入的文本做单词划分，输出。
         */
        static class ExtractWordsFn extends DoFn<String, String> {
         
            @ProcessElement
            public void processElement(ProcessContext c) {
                if (c.element().trim().isEmpty()) {
                    return ;
                }

                // 将文本行划分为单词
                String[] words = c.element().split("[^a-zA-Z']+");
                // 输出PCollection中的单词
                for (String word : words) {
                    if (!word.isEmpty()) {
                        c.output(word);
                    }
                }
            }
        }
        
        /**
         *2.格式化输入的文本数据，将转换单词为并计数的打印字符串。
         */
        public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, String> {
            @Override
            public String apply(KV<String, Long> input) {
                return input.getKey() + ": " + input.getValue();
            }
        }

首先我们来说一下数据的操作流程：
1.读文件
2.将文件的内容进行分词，将单词取出来
3.统计每一个单词的计数
4.包装数据的输出格式
5.将数据输出到文件

3.对上面的代码解释说明

1.TextIO.read().from(“pom.xml”) 这句将pom.xml文件读入，返回一个Read对象。这里的文件路径我用的是相对路径，当然也可以用绝对路径。Read继承自PTransform<PBegin, PCollection>，所以可直接当做参数调用apply方法。所以lines = p.apply(
“ReadMyFile”, TextIO.read().from(“pom.xml”))输出了一个PCollection对象。

2.CountWords 继承了 PTransform<PCollection, PCollection<KV<String, Long>>>，它的数据输入类型为String，输出类型为KV<String, Long>也就是Map。
3.CountWords.expand方法只执行一次，输入参数为PCollection，输出为PCollection<KV<String, Long>>。方法中 lines.apply(ParDo.of(new ExtractWordsFn()));将文本转换为单个单词放入PCollection（当然是原始的有重复的单词）

4.words.apply(Count.perElement());返回一个<Key,Value>的类型。Count.perElement()方法返回一个PerElement对象，PerElement继承PTransform，处理转化过程：至于更细致的Map过程，可自行研究

 private static class PerElement<T> extends PTransform<PCollection<T>, PCollection<KV<T, Long>>> {
    private PerElement() {}
    @Override
    public PCollection<KV<T, Long>> expand(PCollection<T> input) {
      return input
          .apply(
              "Init",
              MapElements.via(
                  new SimpleFunction<T, KV<T, Void>>() {
                    @Override
                    public KV<T, Void> apply(T element) {
                      return KV.of(element, (Void) null);
                    }
                  }))
          .apply(Count.perKey());
    }
  }

5.apply(MapElements.via(new FormatAsTextFn()))将输入的Map数据转换为String输出。FormatAsTextFn继承自SimpleFunction类，它的apply方法执行参数是一个Map的Entry，可以通过getKey，getValue获取key-value。MapElements.via方法用来遍历Mao每一个Entry.
6. .apply(“WriteCounts”, TextIO.write().to(“count.txt”));将输出结果写出到相对路径的count.txt中

注：本文中的代码来自于 vbay的Github

Beam分词计数.md

1.PCollection.apply介绍

2.文件作为数据源的输入和输出

3.对上面的代码解释说明

猜你喜欢