Apache Beam 2.25.0 released, big data stream processing and batch programming paradigm

Apache Beam 2.25.0 is released. Beam is a unified programming model for defining and executing data processing pipelines, including ETL, batch processing, and stream processing. The Beam project focuses on programming paradigms and interface definitions for data processing, and does not involve the implementation of specific execution engines. Ideally, data processing programs based on Beam can be executed on any distributed computing engine.

The main feature changes of this version include:

  • Added support for repeatable fields in the JSON decoder of ReadFromBigQuery. (Python)
  • Added an opt-in, performance-driven runtime type checking system to the Python SDK.
  • Added support for Python 3 type annotations on PTransforms using typed PCollections.
  • Improved Interactive Beam API, streaming jobs can now start long-running background recording jobs. Run ib.show() or ib.collect() samples from recording.
  • In Interactive Beam, ib.show() and ib.collect() now have "n" and "duration" as parameters. This means that at most "n" elements can be read, and at most "duration" seconds of data can be read from recording.
  • Initial preview of Dataframes support.
  • Fixed the type hint support for @ptransform_fn decorator in Python SDK. This feature is not enabled by default to maintain backward compatibility; it can be --type_check_additional=ptransform_fnenabled using a  flag. In future versions of Beam, it may be enabled by default.
  • Added X feature (Java/Python).

For details, check the update instructions: https://github.com/apache/beam/blob/master/CHANGES.md#2250---2020-10-23

Guess you like

Origin www.oschina.net/news/119426/beam-2-25-0-released