【Introduction to Crunch of Apache】

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

 

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.



 

data Model and Operators

Crunch's Java API is centered around three interfaces that represent distributed datasets: PCollection, PTable, and PGroupedTable.

 

A PCollection<T> represents a distributed, immutable collection of elements of type T. For example, we represent a text file as a PCollection<String> object. PCollection<T> provides a method, parallelDo, that applies a DoFn to each element in the PCollection<T> in parallel, and returns a new PCollection<U> as its result.

 

A PTable<K, V> is a sub-interface of PCollection<Pair<K, V>> that represents a distributed, unordered multimap of its key type K to its value type V. In addition to the parallelDo operation, PTable provides a groupByKey operation that aggregates all of the values in the PTable that have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job. Developers can exercise fine-grained control over the number of reducers and the partitioning, grouping, and sorting strategies used during the shuffle by providing an instance of the GroupingOptions class to the groupByKey function.

 

The result of a groupByKey operation is a PGroupedTable<K, V> object, which is a distributed, sorted map of keys of type K to an Iterable that may be iterated over exactly once. In addition to parallelDo processing via DoFns, PGroupedTable provides a combineValues operation that allows a commutative and associative Aggregator to be applied to the values of the PGroupedTable instance on both the map and reduce sides of the shuffle. A number of common Aggregator<V> implementations are provided in the Aggregators class.

 

Finally, PCollection, PTable, and PGroupedTable all support a union operation, which takes a series of distinct PCollections that all have the same data type and treats them as a single virtual PCollection.

Abstract interface for three distributed datasets: PCollection, PTable, PGroupedTable

 

1) PCollection<T> represents a distributed, immutable data set, provides parallelDo and union methods, triggers DoFn operation on each element, and returns a new PCollection<U>

 

2) PTable<K, V> is a PCollection<Pair<K,V>> implementation, representing a distributed, unsorted multimap. In addition to the parallelDo inherited from PCollection, it also overwrites the union method and provides the groupByKey method. The groupByKey method corresponds to the sorting phase in the MapReduce job. In the groupByKey operation, developers can control the number of reducers, partition strategy, grouping strategy, and sorting strategy in fine-grained granularity during the shuffle process (see GroupingOptions class).

 

3) PGroupedTable<K, V> is the result of the groupByKey operation, representing a distributed, sorted map with an iterator, whose implementation is PCollection<Pair<K,Iterable<V>>>. In addition to parallelDo and union inherited from PCollection, the combineValues ​​method is provided, which allows the use of aggregation operators (see Aggregator class) that satisfy the commutative and associative laws on the map side or reduce side of the shuffle to act on the values ​​of the PGroupedTable instance.

 

 

All of the other data transformation operations supported by the Crunch APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are implemented in terms of these four primitives. The patterns themselves are defined in the org.apache.crunch.lib package and its children, and a few of of the most common patterns have convenience functions defined on the PCollection and PTable interfaces.

 

Every Crunch data pipeline is coordinated by an instance of the Pipeline interface, which defines methods for reading data into a pipeline via Source instances and writing data out from a pipeline to Target instances. There are currently three implementations of the Pipeline interface that are available for developers to use:

 

MRPipeline: Executes the pipeline as a series of MapReduce jobs.

MemPipeline: Executes the pipeline in-memory on the client.

SparkPipeline: Executes the pipeline by converting it to a series of Spark pipelines.

 

Apache Crunch is an implementation of FlumeJava. It develops a set of MR pipelines for MapReduce programs that are not easy to directly develop and use. It has a data representation model, provides basic primitives and high-level primitives, and optimizes the execution of MR Jobs according to the underlying execution engine. . From the perspective of distributed computing, many computing primitives provided by Crunch can find many similarities in Spark, Hive, Pig, etc., while the implementation of data reading and writing, serialization, grouping, sorting, and aggregation is similar to The splitting of each stage of MapReduce can find shadows in Hadoop.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326562801&siteId=291194637