Inventory of big data related projects promoted to Apache TLP in 2020

Inventory of big data related projects promoted to Apache TLP in 2020

Past memory big data Past memory big data\
In the past year, many Apache incubation projects have successfully graduated to top-level projects (Top-Level Project, TLP for short). Here I will give you an inventory of big data related to promotion to Apache TLP in 2020 project. In 2020, a total of four big data-related projects have successfully graduated into top-level projects, mainly Apache® ShardingSphere™, Apache® Hudi™, Apache® Iceberg™ and Apache® IoTDB™, which are introduced here in order of graduation.

For big data projects that have graduated to TLP in the past few years, please refer to "Inventory of Big Data Related Projects Promoted to Apache TLP in 2017", "Inventory of Big Data Related Projects Promoted to Apache TLP in 2018" and "Inventory of Promoted to Apache in 2019" "TLP's Big Data Related Projects".

Apache ShardingSphere: an ecosystem of open source distributed database middleware solutions

Apache ShardingSphere is an ecosystem composed of a set of open source distributed database middleware solutions. It consists of three independent products, JDBC, Proxy, and Sidecar (planned), but they can be mixed and deployed together. They all provide standardized data sharding, distributed transactions, and database governance functions, which can be applied to various diversified application scenarios such as Java isomorphism, heterogeneous languages, and cloud native. Its structure is as follows:

Inventory of big data related projects promoted to Apache TLP in 2020

Apache ShardingSphere is positioned as a relational database middleware, aiming to fully and reasonably utilize the computing and storage capabilities of relational databases in distributed scenarios, rather than realizing a brand new relational database. It grasps the essence of things by paying attention to the unchanging. The relational database still occupies a huge market today and is the cornerstone of each company's core business. It will be difficult to shake in the future. At this stage, we pay more attention to the increase on the original basis instead of subversion.

Apache ShardingSphere is a project led by JD.com and contributed by multiple companies. It is the first open source project of JD Group to enter the Apache Foundation and the first distributed database middleware project of the Apache Foundation. The project entered the Apache incubator in November 2018 and has become the top project of the Apache Software Foundation on April 16, 2020 [1]. For more information about Apache ShardingSphere, please refer to the official website: https://shardingsphere.apache.org/ .

Apache Iceberg: Data Lake Solution for Tracking Very Large Scale Tables

Apache Iceberg was originally designed and developed by Netflix. It is a lightweight data lake solution designed to solve the problems of listing a large number of partitions and time-consuming and inconsistent metadata and HDFS data. It is a new format for tracking very large-scale tables. It is specially designed for object storage (such as S3). Its core idea is to track all changes in the table on the timeline. A more important concept in Iceberg is a snapshot. A snapshot represents a complete set of table data files. Each update operation generates a new snapshot.

Apache Iceberg has the following characteristics:

  • Optimize the data storage process: Iceberg provides ACID transaction capabilities, upstream data writing can be seen without affecting the current data processing tasks, which greatly simplifies ETL; Iceberg provides upsert, merge into capabilities, which can greatly reduce the data storage delay;

  • Support more analysis engines: Excellent kernel abstraction makes it not bound to specific computing engines. Currently, the computing engines supported by Iceberg are Spark, Flink, Presto and Hive.

  • Unified data storage and flexible file organization: Provides a stream-based incremental calculation model and a batch-based full-scale calculation model. Batch processing and streaming tasks can use the same storage model, and the data is no longer isolated; Iceberg supports hidden partitions and partition evolution, which facilitates the business to update the data partition strategy. Supports storage formats such as Parquet, Avro, and ORC.

  • Incremental reading processing capability: Iceberg supports reading incremental data in a streaming manner, supporting Structed Streaming and Flink table Source.

Apache Iceberg entered the Apache incubator on November 16, 2018, and successfully graduated as an Apache top-level project on May 20, 2020. The strange thing is that I haven’t seen Apache officially announce it as a top-level project. This article mainly refers to: https://incubator.apache.org/projects/iceberg.html and https://incubator.apache.org/projects/iceberg .html .

More detailed information about Apache Iceberg can be found on its official website: https://iceberg.apache.org/

Apache Hudi: Big Data Incremental Processing Framework

Apache Hudi (Hadoop Upsert Delete and Incremental) is Uber in order to solve the inefficiency problem of the ingestion pipeline and ETL pipeline that need to insert update and incremental consumption primitives in the big data ecosystem. It is a data storage abstraction optimized for analysis and scanning, which can apply changes to data sets in HDFS within a few minutes, and supports multiple incremental processing systems to process data. Integration with the current Hadoop ecosystem (including Apache Hive, Apache Parquet, Presto, and Apache Spark) through a custom InputFormat makes the framework seamless for end users.

Hudi's design goal is to update data sets on HDFS quickly and incrementally. It provides two ways to update data: Copy On Write and Merge On Read. The Copy On Write mode is when we update the data, we need to obtain the files involved in the updated data through the index, and then read the data and merge the updated data. This mode is relatively simple to update the data, but when the data involved is updated When it is relatively large, the efficiency is very low; and Merge On Read is to write the update to a separate new file, and then we can choose to merge the updated data with the original data synchronously or asynchronously (it can be called a combination), because the updated Only write new files at the time, so the update speed of this mode will be faster.

With Hudi, we can collect incremental data in MySQL, HBase, and Cassandra in real time and write it to Hudi. Then, Presto, Spark, and Hive can quickly read these incrementally updated data, as follows:

Inventory of big data related projects promoted to Apache TLP in 2020

The Apache Hudi project began development in 2016. At that time, Uber was codenamed Hoodie. It was open sourced in 2017, entered the Apache incubator in January 2019, and officially became a top-level project on June 4, 2020[2].

For more information about Apache Hudi, please refer to the introduction of "Apache Hudi: Uber's Open Source Big Data Incremental Processing Framework" and "The Evolution of Uber Big Data Platform (2014~2019)", and the official Apache Hudi Document: http://hudi.apache.org/

Apache IoTDB: Internet of Things database

Apache IoTDB (Internet of Things Database) is a software system that integrates collection, storage, management and analysis of IoT time series data. Apache IoTDB adopts a lightweight architecture with high performance and rich functions, and is deeply integrated with Apache Hadoop, Spark and Flink, etc., which can meet the needs of massive data storage, high-speed data reading and complex data analysis in the industrial Internet of Things field .

The Apache IoTDB suite is composed of several components, which together form a series of functions such as "data collection-data writing-data storage-data query-data visualization-data analysis". Its structure is as follows:

Inventory of big data related projects promoted to Apache TLP in 2020

Users can import time series data collected from sensors on the device, system status data such as server load and CPU memory, time series data in message queues, time series data of applications, or time series data from other databases into local or remote IoTDB through JDBC. in. Users can also write the above data directly to the local (or on HDFS) TsFile file. The TsFile file can be written to HDFS to implement data processing tasks such as anomaly detection and machine learning on the data processing platform of Hadoop or Spark. For TsFile files written to HDFS or local, you can use the TsFile-Hadoop or TsFile-Spark connector to allow Hadoop or Spark to process data. The result of the analysis can be written back into a TsFile file. IoTDB and TsFile also provide corresponding client tools to meet the various needs of users in the SQL format, script format, and graphical format for viewing and writing data.

Apache IoTDB is a time series database self-developed by Tsinghua University. It was launched in 2014 and officially entered the Apache incubator on November 18, 2018. It is the first project of a Chinese university to enter the Apache incubator. It officially graduated on September 23, 2020. Apache top-level project [3].

Reference link

[1] Become the top project of the Apache Software Foundation on April 16, 2020: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces60
[2] June 4, 2020 Officially became a top-level project: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces64
[3] officially graduated on September 23, 2020 and became an Apache top-level project: https://blogs. apache.org/foundation/entry/the-apache-software-foundation-announces68

Guess you like

Origin blog.51cto.com/15127589/2678214