Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator

Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator

Past large data memory historical memory Big Data
Today is the last day of 2019, tomorrow is a new year, here I wish you all a Happy New Year! Thank you also for your support to the editor in the past year! In the past two years, this blog has counted the big data-related projects that were promoted to Apache TLP (Apache Top-Level Project) that year. For details, see "Inventory of Big Data-related Projects Promoted to Apache TLP in 2017" and "Inventory of Big Data-related Projects Promoted to Apache TLP in 2018" "Apache TLP Big Data Related Projects", inheriting this convention, this article will give you an inventory of big data related projects promoted to Apache TLP in 2019. Since there are very few big data projects promoted to TLP this year, there are only three, and two of them It doesn't seem to have much to do with our daily life; so this time I also included the big data-related projects submitted to the Apache incubator this year. Project introductions are arranged from the time of graduation from the incubator, plus the projects that have entered the incubation in a few years, there are a total of six, as follows.

Apache Airflow: Open source distributed task scheduling framework

Apache Airflow is a flexible and extensible workflow automation and scheduling system for managing hundreds of petabytes of big data processing pipelines. The project was originally developed by Airbnb in 2014 and was submitted to the Apache Incubator in March 2016. On January 8, 2019, the Apache Foundation officially announced that it has become a top-level Apache project.

Apache Airflow assembles a workflow with subordinate and subordinate dependencies into a directed acyclic graph (DAG), and Airflow scheduler runs these Tasks on a group of workers based on the dependencies. Airflow scheduler has the ability to interact with data sources such as Hive, Presto, MySQL, HDFS, Postgres, and provides hooks to make it highly scalable. In addition to using the command line, the tool also provides a WebUI to visually view dependencies, monitor progress, and trigger tasks.

The Apache Airflow system consists of many services with different roles, and its architecture is as follows:

Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop

It can be seen that the Apache Airflow system consists of six parts: metadata database, web server, scheduler, Celery excutor, message broker, and Celery workers. The functions of these components are as follows:

•Metadata database: store task status information, generally choose to use MySQL database to store;
•Airflow web server: web interface used to query metadata to monitor and execute DAGs;
•Scheduler: scheduler to view task status from metadata , And determine which tasks need to be executed and the task execution priority process, the scheduler usually runs as a service;
•Excutor: the executor is a message queue process, which is bound to the scheduler, used to determine the actual execution of each The work process of the task plan. There are different types of executors, and each executor uses a class that specifies a worker process to perform tasks. For example, LocalExecutor uses parallel processes running on the same machine as the scheduler process to execute tasks. Other executors like CeleryExecutor use worker processes that exist in independent worker machine clusters to perform tasks.
•Message broker: stores the commands that need to run tasks in the queue, such as RabbitMQ;
•Celery Workers: the process that actually executes the task logic, it retrieves the command from the queue, runs it, and then updates the metadata. More information about Apache Airflow can be found at http://airflow.apache.org/

Apache Rya: cloud-based big data triple storage database

Apache Rya is a cloud-based three-tuple storage (subject-predicate-object) database for big data and provides millisecond response time to query. This project was developed by The Laboratory for Telecommunication Sciences (University of Maryland, USA) and entered and submitted to the Apache Incubator in September 2015. On September 24, 2019, the Apache Foundation officially announced that it has become a top-level Apache project.

Apache Rya is mainly used to store RDF data. RDF is an abbreviation of Resource Description Framework, which is a resource description framework, which is essentially a data model. It provides a unified standard for describing entities/resources. Simply put, it is a method and means of expressing things. RDF is formally expressed as an SPO (subject-predicate-object) triplet, sometimes called a statement, and it is also called a piece of knowledge in the knowledge graph.

Apache Rya is a scalable RDF data management system built on top of Apache Accumulo. It also implements the MongoDB backend. Rya uses novel storage methods, indexing schemes, and query processing technologies that can scale to billions of triples across multiple nodes. Apache Rya provides fast and easy access to data through SPARQL, a traditional query mechanism for RDF data.

More information about Apache Rya can be found at http://rya.apache.org/ .

Apache SINGA: a universal distributed deep learning platform

Apache SINGA is a general distributed deep learning platform for training large deep learning models on large-scale data sets. The project was originally developed at the National University of Singapore in 2014 and submitted to the Apache Incubator in March 2015. On October 16, 2019, the Apache Foundation officially announced that it has become a top-level Apache project.

The design of Apache SINGA is based on an intuitive programming model, which is the abstraction of deep learning layers. Apache SINGA supports most deep learning models, including Convolutional Neural Network (CNN), Restricted Boltzmann Model (RBM) and Recurrent Neural Network (RNN), etc., providing users with many built-in layers that can be used directly. Apache SINGA has a flexible architecture and supports synchronous training, asynchronous training and mixed training. In order to train deep learning models in parallel, SINGA supports different neural network partitioning mechanisms, namely batch dimension partition, feature dimension partition and multi-dimensional hybrid partition.

As a distributed system, the primary goal of Apache SINGA is to have good scalability. In other words, Apache SINGA hopes to reduce the training time of the model by using more computing resources (ie computers) with a certain accuracy.

Another goal of Apache SINGA is ease of use. It is very difficult for programmers to develop and train deep learning models with deep complex structures. Distributed training further increases the burden on programmers, such as: data and model partitioning, network communication, etc. Therefore, it is very important to provide an easy-to-use programming model that allows programmers to implement their own deep learning models and algorithms without considering the underlying distributed platform.

More information about Apache SINGA can be found at http://singa.apache.org/

Apache Hudi: Big Data Incremental Processing Framework (Incubating)

Apache Hudi (Hoodie) is Uber in order to solve the inefficiency of the ingestion pipeline and ETL pipeline that need to insert update and incremental consumption primitives in the big data ecosystem. The project began development in 2016 and was open sourced in 2017. In 2019 Entered the Apache incubator in January.

Hudi (Hadoop Upsert Delete and Incremental) is a data storage abstraction optimized for analysis and scanning. It can apply changes to data sets in HDFS within a few minutes, and supports multiple incremental processing systems to process data. Integration with the current Hadoop ecosystem (including Apache Hive, Apache Parquet, Presto, and Apache Spark) through a custom InputFormat makes the framework seamless for end users.

Hudi's design goal is to update data sets on HDFS quickly and incrementally. It provides two ways to update data: Copy On Write and Merge On Read. The Copy On Write mode is that when we update the data, we need to obtain the files involved in the updated data through the index, and then read the data and merge the updated data. This mode is relatively simple to update the data, but when the data involved is updated When it is relatively large, the efficiency is very low; and Merge On Read is to write the update to a separate new file, and then we can choose to merge the updated data with the original data synchronously or asynchronously (it can be called a combination), because the updated Only write new files at the time, so the update speed of this mode will be faster.

With Hudi, we can collect incremental data in MySQL, HBase, and Cassandra in real time and write it to Hudi. Then, Presto, Spark, and Hive can quickly read these incrementally updated data, as follows:

Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator

If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop

For more information about Apache Hudi, please refer to the introduction of "Apache Hudi: Uber's Open Source Big Data Incremental Processing Framework" and "The Evolution of Uber Big Data Platform (2014~2019)", and the official Apache Hudi Document: http://hudi.apache.org/

Apache DolphinScheduler: Distributed workflow task scheduling system (incubating)

Apache DolphinScheduler is a distributed and extensible visual DAG workflow task scheduling system, dedicated to solving the intricate dependencies in the data processing process, so that the scheduling system can be used out of the box in the data processing process. Apache DolphinScheduler is a big data distributed scheduling system independently developed by Analysys. It was formerly known as Easy Scheduler. It was officially open sourced on March 28, 2019, and entered the Apache incubator on August 29, 2019. The main goals of Apache DolphinScheduler are as follows:

•Associate Tasks according to the dependencies of tasks in the form of DAG graphs, and visually monitor the running status of tasks in real time. Support rich task types: Shell, MR, Spark, SQL (mysql, postgresql, hive, sparksql), Python, Sub_Process , Procedure, etc.
•Support workflow timing scheduling, dependent scheduling, manual scheduling, manual pause/stop/resume, and at the same time support failure retry/alarm, recovery failure from designated nodes, Kill tasks, etc.
•Support workflow priority, task priority Level and task failover and task timeout alarm/failure
• Support workflow global parameters and node custom parameter settings
• Support online upload/download of resource files, management, etc., support online file creation and editing
• Support task log online viewing and Scrolling, downloading logs online, etc.
•Realize cluster HA, realize decentralization of Master cluster and Worker cluster through Zookeeper
•Support online viewing of Master/Worker cpu load, memory, cpu
•Support workflow running history tree/Gantt chart display, Support task status statistics, process status statistics
•Support complements
•Support multi-tenancy
•Support internationalization
•There are more waiting for partners to explore
the architecture of Apache DolphinScheduler as follows:Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator

If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop

It consists of the following modules

•MasterServer: MasterServer adopts a distributed non-centered design concept. MasterServer is mainly responsible for DAG task segmentation, task submission monitoring, and monitoring the health status of other MasterServer and WorkerServer at the same time. When the MasterServer service starts, the temporary node is registered with Zookeeper, and fault-tolerant processing is performed by monitoring the changes of the Zookeeper temporary node.
•WorkerServer: WorkerServer also adopts a distributed non-centered design concept, and WorkerServer is mainly responsible for task execution and providing log services. When the WorkerServer service starts, it registers a temporary node with Zookeeper and maintains the heartbeat.
•ZooKeeper: The MasterServer and WorkerServer nodes in the system use ZooKeeper for cluster management and fault tolerance. In addition, the system also performs event monitoring and distributed locks based on ZooKeeper.
•Task Queue: Provides the operation of the task queue. Currently, the queue is also implemented based on Zookeeper. Since there is less information stored in the queue, there is no need to worry about too much data in the queue. In fact, the millions of data storage queues have been pressure tested, which has no effect on system stability and performance.
•Alert: Provides alarm-related interfaces, which mainly include the storage, query and notification functions of two types of alarm data. Among them, there are two notification functions: mail notification and SNMP (not yet implemented).
• API: API interface layer, which is mainly responsible for processing requests from the front-end UI layer. The service provides a unified RESTful api to provide request services to the outside. The interface includes workflow creation, definition, query, modification, release, offline, manual start, stop, pause, resume, start execution from this node, and so on.
•UI: The front-end page of the system provides various visual operation interfaces of the system.
It can be seen that Apache DolphinScheduler is similar to the Apache Airflow function introduced earlier. For the detailed differences between the two, please refer to https://dolphinscheduler.apache.org/ and I will not introduce them in detail here.

Apache TubeMQ: Distributed messaging middleware system (incubating)

TubeMQ is a distributed messaging middleware system developed by Tencent in 2013. It focuses on the high-performance storage and transmission of massive data under the big data scenario. After nearly 7 years of massive data precipitation of trillions, the current average daily access volume exceeds 25 trillion. Compared with the open source MQ components of many stars, TubeMQ has a relatively good core advantage in mass practice (stability + performance) and low cost. The project officially entered the Apache incubator on November 03, 2019.

The idea of ​​TubeMQ system architecture is derived from Apache Kafka. In terms of implementation, it fully adopts an adaptive approach, combined with actual combat, and has done a lot of optimization and research and development work, such as partition management, allocation mechanism and new node communication process, and independently develop high-performance low-level RPC communication modules. These implementations enable TubeMQ to have good robustness and higher throughput under the premise of ensuring real-time and consistency.

TubeMQ is very suitable for scenarios with high concurrency, massive data and a small amount of data loss under abnormal conditions, such as massive log collection, indicator statistics, monitoring, etc. TubeMQ is not suitable for scenarios that require very strict data reliability.

Like other message queuing systems, TubeMQ is also built on the publish-subscribe model. That is, producers publish messages to topics, and consumers subscribe to these topics. After the consumer has processed the message, it will send a confirmation to the producer. The overall structure of TubeMQ is as follows:
Inventory of big data related projects promoted to Apache TLP in 2019 and into the Apache incubator
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop

As can be seen from the above figure, TubeMQ consists of five modules in total:

•Portal: The Portal part responsible for external interaction and operation and maintenance operations, including API and Web. The API is connected to the management system outside the cluster. The Web is the page encapsulation for daily operation and maintenance functions based on the API;
•Master: Responsible The Control part of the cluster control. This part is composed of one or more Master nodes. Master HA is completed through heartbeat keep-alive and real-time hot standby switching between Master nodes. The master master is responsible for managing the status of the entire cluster, resource scheduling, permission checking, and metadata. Data query, etc.;
•Broker: The Store part responsible for the actual data storage. This part is composed of independent Broker nodes. Each Broker node manages the topic collection in this node, including topic addition, deletion, modification, and query. , Message storage, consumption, aging, partition expansion, offset records of data consumption, etc. in Topic. The external capabilities of the cluster, including the number of topics, throughput, capacity, etc., are completed by horizontally expanding Broker nodes;
•Client: responsible for data production and The client part of consumption, which we provide externally in the form of Lib, is the consumer end that you use the most. Compared with before, the consumer end now supports two data pull modes, Push and Pull, and data consumption behavior supports both order and filtered consumption. Kind. For the Pull consumption model, support the business to reset the precise offset through the client to support extractly-once consumption. At the same time, the consumer has launched a new BidConsumer client that does not need to be restarted for cross-cluster switching;
• Zookeeper: The zk part responsible for offset storage, this part The function has been weakened to only do the persistent storage of offset, considering the following multi-node copy function, this module is temporarily retained.
For more information about Apache TubeMQ, please see https://github.com/Tencent/TubeMQ

Guess you like

Origin blog.51cto.com/15127589/2677022