【 Apache之 Crunch 介绍】

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.



 

data Model and Operators

Crunch's Java API is centered around three interfaces that represent distributed datasets: PCollection, PTable, and PGroupedTable.

A PCollection<T> represents a distributed, immutable collection of elements of type T. For example, we represent a text file as a PCollection<String> object. PCollection<T> provides a method, parallelDo, that applies a DoFn to each element in the PCollection<T> in parallel, and returns a new PCollection<U> as its result.

A PTable<K, V> is a sub-interface of PCollection<Pair<K, V>> that represents a distributed, unordered multimap of its key type K to its value type V. In addition to the parallelDo operation, PTable provides a groupByKey operation that aggregates all of the values in the PTable that have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job. Developers can exercise fine-grained control over the number of reducers and the partitioning, grouping, and sorting strategies used during the shuffle by providing an instance of the GroupingOptions class to the groupByKey function.

The result of a groupByKey operation is a PGroupedTable<K, V> object, which is a distributed, sorted map of keys of type K to an Iterable that may be iterated over exactly once. In addition to parallelDo processing via DoFns, PGroupedTable provides a combineValues operation that allows a commutative and associative Aggregator to be applied to the values of the PGroupedTable instance on both the map and reduce sides of the shuffle. A number of common Aggregator<V> implementations are provided in the Aggregators class.

Finally, PCollection, PTable, and PGroupedTable all support a union operation, which takes a series of distinct PCollections that all have the same data type and treats them as a single virtual PCollection.

三种分布式数据集的抽象接口:PCollection,PTable,PGroupedTable

1)PCollection<T>代表分布式、不可变的数据集,提供 parallelDo 和 union 方法,触发对每个元素进行DoFn操作,返回新的PCollection<U>

2)PTable<K, V>是PCollection<Pair<K,V>>实现,代表分布式、未排序的multimap。除了继承自PCollection 的parallelDo,还复写了union方法,提供了 groupByKey 方法。groupByKey方法对应MapReduce job里的排序阶段。在groupByKey操作里,开发者可以在shuffle过程里(参见GroupingOptions类)做细粒度的reducer数目、分区策略、分组策略以及排序策略控制

3)PGroupedTable<K, V>是groupByKey操作的结果,代表分布式、排过序的map,具备迭代器,其实现是PCollection<Pair<K,Iterable<V>>>。除了继承自PCollection的parallelDo、union,提供 combineValues 方法,允许在shuffle的map端或reduce端使用满足交换律和结合律的聚合算子(参见Aggregator类)作用于PGroupedTable实例的values上

All of the other data transformation operations supported by the Crunch APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are implemented in terms of these four primitives. The patterns themselves are defined in the org.apache.crunch.lib package and its children, and a few of of the most common patterns have convenience functions defined on the PCollection and PTable interfaces.

Every Crunch data pipeline is coordinated by an instance of the Pipeline interface, which defines methods for reading data into a pipeline via Source instances and writing data out from a pipeline to Target instances. There are currently three implementations of the Pipeline interface that are available for developers to use:

MRPipeline: Executes the pipeline as a series of MapReduce jobs.

MemPipeline: Executes the pipeline in-memory on the client.

SparkPipeline: Executes the pipeline by converting it to a series of Spark pipelines.

Apache Crunch是FlumeJava的实现,为不太方便直接开发和使用的MapReduce程序,开发一套MR流水线,具备数据表示模型,提供基础原语和高级原语,根据底层执行引擎对MR Job的执行进行优化。从分布式计算角度看,Crunch提供的许多计算原语,可以在Spark、Hive、Pig等地方找到很多相似之处,而本身的数据读写,序列化处理,分组、排序、聚合的实现,类似MapReduce各阶段的拆分都可以在Hadoop里找到影子。

猜你喜欢

转载自gaojingsong.iteye.com/blog/2367231