Apache Beam 2.28.0 has been released. Beam is a unified programming model for defining and executing data processing pipelines, including ETL, batch processing, and stream processing. The Beam project focuses on the programming paradigm and interface definition of data processing, and does not involve the implementation of a specific execution engine. Ideally, the data processing program based on Beam can be executed on any distributed computing engine.
Update highlights
- Numerous improvements related to Parquet support ( BEAM-11460 , BEAM-8202 and BEAM-11526 )
- Hash function in BeamSQL ( BEAM-10074 )
- Hash function in ZetaSQL ( BEAM-11624 )
- Use HLL Impl to create ApproximateDistinct ( BEAM-10324 )
I / Os
SpannerIO supports the use of BigDecimal for Numeric fields ( BEAM-11643 )
- Add Beam schema support to ParquetIO ( BEAM-11526 )
- Support ParquetTable Writer ( BEAM-8202 )
- GCP BigQuery sink (streaming inserts) uses the segmentation determined by the runner ( BEAM-11408 )
- PubSub support types: TIMESTAMP, DATE, TIME, DATETIME ( BEAM-11533 )
New features/improvements
- ParquetIO adds readGenericRecords and readFilesGenericRecords methods to read files with unknown schemas. For details, see PR-13554 and ( BEAM-11460 )
- Add support for thrift in KafkaTableProvider ( BEAM-11482 )
- Add support for HadoopFormatIO to skip key/value cloning ( BEAM-11457 )
- Support conversion to GenericRecords ( BEAM-11571 ) in Convert.to conversion
- Support reading Parquet files of unknown schema ( BEAM-11460 )