Message queue pulsar and storage system bookkeeper and SQL query engine Presto

The most obvious difference between pulsar and kafka is that pulsar supports multi-tenancy and has the concept of assets and namespaces. Assets represent tenants in the system. Assuming a Pulsar cluster is used to support multiple applications (like Yahoo), each asset in the cluster can represent an organizational team, a core function, or a product line. An asset can contain multiple namespaces, and a namespace can contain any number of themes.

 

Partition: Both pulsar and kafka support topic multi-partitioning.

Persistence: kafka file storage, pulsar uses Apache BookKeeper storage. This is also a significant advantage of pulsar. Kafka file storage is distributed on each broker in the cluster. Once a broker hangs up or a new broker joins, the leader election of the replica or partition balancing operation will be performed, which will consume the performance of Kafka. The broker of pulsar is stateless, the data is stored in BookKeeper, and the service and data are separated. So it will not face this kind of problem, and can adjust the cluster at will. In addition, Kafka's broker also cares about whether the storage capacity exceeds the space of its own hard disk. But pulsar does not have this problem.

In the released version 2.2, Pulsar will introduce SQL to facilitate SQL query and analysis of data stored in Pulsar. Pulsar SQL leverages Presto to provide efficient and scalable queries for purposes. This efficient query is mainly due to the Apache BookKeeper, the underlying storage system of Pulsar.

 

"Pulsar (challenge kafka new generation message system) official document translation - entry and actual combat - build a local independent cluster environment

"Pulsar (Challenge Kafka New Generation Messaging System) Official Document Translation-Must-see for Beginners-Concept and Architecture-(2) Messaging Concepts"

Apache bookkeeper is a distributed, scalable, fault-tolerant (multi-copy), low-latency storage system that provides high-performance, high-throughput storage capabilities. Bookkeeper implements the write operation in append mode.

Bookkeeper has a very successful application case: Apache Pulsar is an MQ that Yahoo has open sourced in recent years. Compared with Kafka, Pulsar has advantages in storage. The storage capacity of a single partition of Kafka is limited by the hard disk capacity of the broker that deploys Kafka. , when there is a large amount of data that requires MQ support, the partition may encounter a bottleneck and cannot be expanded. Of course, the number of partitions and brokers can be increased in advance to meet the storage requirements of MQ, but when the message needs to be stored for a relatively long time or after a large amount of data, such as one month of storage, the data needs to be pulled back and calculated on a monthly basis Tasks, this scenario is very wasteful for Kafka clusters, because what we need is more storage, not more broker capabilities. Bookkeeper provides Pulsar with architecture support for storage and computing separation. Storage The broker of pulsar can be expanded separately, which is not available in kafka

basic concept

  • Entry: Entry is a record stored in bookkeeper

  • Ledger: It can be considered that ledger is used to store Entry, and multiple Entry sequences form a ledger

  • Bookie: A Bookie is a storage server of bookkeeper, which is used to store ledger. Generally speaking, it stores a section of ledger. Because the storage is distributed, each ledger will be stored on multiple bookies.

  • MetaData Storage: Metadata storage is used to store bookie-related metadata, such as which ledgers are on bookie, and bookkeeper currently uses zk storage. Before deploying bookkeeper, there must be a zk cluster

  • Data storage files and cache:

  • Journal: In fact, it is the WAL (write ahead log) of bookkeeper, which is used to store the transaction log of bookkeeper. The journal file has a maximum size. After reaching this size, a new journal file will be created

  • Entry log: The file that stores the entry. I understand that ledger is a logical concept. The entries in different ledgers will be aggregated by ledger first, and then written into the entry log file. Similarly, the entry log will have a maximum size, and a new entry log file will be created after reaching the maximum size

  • Index file: The index file of ledger. The entry in the ledger is written into the entry log file. The index file is used to index each ledger in the entry log file of Deng Yingchao, and record the storage location of each ledger in the entry log and the data in the entry log. The length in the entry log file

  • Ledger cache: used to cache index files to speed up search efficiency

  • Data storage: A LastLogMark will be stored in the memory, which contains txnLogId (the id of the journal file) and txnLogPos (the position in the journal file). The entry log file and the index file will be cached in the memory first. When the memory reaches a certain value or After a period of time (scheduled thread) expires from the last flashing, it will trigger the flashing of the entry log file and index file, and then persist the LastLogMark. When the lastLogMark is persisted, it means that the entry and index before the lastLogMark are all It has been written to the disk. At this time, the journal file before lastLogMark can be cleared. If the LastLogMark crashes before persistence, it can be restored through the journal file to ensure that the data will not be lost.

  • Data Compaction: The merging of data is somewhat similar to the compact process of hbase. On the bookie, although the entry log will be aggregated according to the ledger before the disk is refreshed, the factor data will continue to be added, and the data of each leader will be stored in the entry log file, and there is one on the bookie for garbage collection. The thread will delete the entry file that is not associated with any ledger in order to reclaim disk space, and the purpose of compaction is to avoid the situation that only a few records in the entry log are associated with ledger, and such entry cannot be allowed The log file has been occupying disk space, so the garbage collection thread will copy the entry associated with the ledger in such an entry log to a new entry log file (modify the index at the same time), and then delete the old entry log file. Similar to hbase, bookkeeper's compaction is also divided into two types:

  • Minor compaction: Do compaction when valid entries in the entry log account for less than 20#

  • Major compaction: When the effective entries in the entry log account for less than 80%, compaction can be started

API provided

Bookkeeper provides two levels of api:

  • Ledger API: It is used to directly operate ledger, which is relatively complicated and is the underlying API provided by bookkeeper
  • Distributed Log: Distributed log, which is a high-level API based on ledger API, is relatively simpler and easier to use

Distributed Log Architecture:

 

The logs written by the distributed log api will be stored on the bookkeeper in the same order of writing:

 

Applicable scenarios of Bookkeeper

  • WAL: bookkeeper can be used as a wal solution
  • Stream storage: For example, pulsar stores messages through bookkeeper
  • Object/Blob storage

 

Presto is a facebook open source distributed SQL query engine, suitable for interactive analysis and query, the data volume supports GB to PB bytes. The architecture of presto evolved from the architecture of relational databases. The reason why presto can stand out among various memory computing databases lies in the following points:

  1. A clear architecture is a system that can run independently and does not depend on any other external systems. For example, scheduling, presto itself provides monitoring of the cluster, and can complete scheduling based on monitoring information.
  2. Simple data structure, columnar storage, logical rows, most of the data can be easily converted into the data structure required by presto.
  3. Rich plug-in interface, perfect docking with external storage systems, or adding custom functions.

This article introduces presto from the outside to the inside.

architecture

Presto adopts a typical master-slave model:

  1. The coordinator (master) is responsible for meta management, worker management, query parsing and scheduling
  2. The worker is responsible for computing and reading and writing.
  3. Discovery server, usually embedded in the coordinator node, can also be deployed separately for node heartbeat. In the following, the default discovery and coordinator share a machine.

In the configuration of the worker, you can choose to configure:

  1. discovery的ip:port。
  2. An http address, the content is the service inventory, including the discovery address.
  3. a local file address
{
"environment": "production",
    "services": [
    {   
        "id": "ffffffff-ffff-ffff-ffff-ffffffffffff",
        "type": "discovery",
        "location": "/ffffffff-ffff-ffff-ffff-ffffffffffff",
        "pool": "general",
        "state": "RUNNING",
        "properties": {
            "http": "http://192.168.1.1:8080"
        }   
    }   
]   
}

The principles of 2 and 3 are based on the service inventory. The worker will dynamically monitor this file. If there is a change, it will load the latest configuration and point to the latest discovery node.

In design, both discovery and coordinator are single nodes. If there are multiple coordinators alive at the same time, the worker will randomly report the process and task status to one of them, resulting in split brain. Deadlocks may occur when scheduling queries.

Discovery and coordinator usability design. Due to the use of the service inventory, the monitoring program can modify the content in the service inventory to point to the discovery of the standby machine after discovering that the discovery is down. Switch seamlessly. The configuration of the coordinator must be specified when the process starts, and multiple coordinator cannot survive in the same cluster. So the best way is to configure it with discovery on the same machine. The secondary machine deploys standby discovery and coordinator. In normal times, the secondary machine is a cluster containing only one machine. When the primary machine goes down, the worker's heartbeat switches to the secondary machine instantly.


data model

Presto adopts a three-tier table structure:

  1. catalog corresponds to a certain type of data source, such as hive data, or mysql data
  2. schema corresponds to the database in mysql
  3. table corresponds to the table in mysql

Presto storage units include:

  1. Page: A collection of multiple rows of data, including data in multiple columns. Only logical rows are provided internally, and they are actually stored in columns.
  2. Block: A column of data. Different encoding methods are usually adopted according to different types of data. Understanding these encoding methods will help your storage system connect to presto.

Different types of blocks:

  1. Array type block, applied to fixed-width types, such as int, long, double. block consists of two parts
  • boolean valueIsNull[] indicates whether each row has a value.
  • T values[] The specific value of each row.

2. Variable-width block, applied to string data, consists of three parts of information

    • Slice : A string of concatenated data from all rows.
    • int offsets[] : The starting cheap position of each row of data. The length of each row is equal to the starting offset of the next row minus the starting offset of the current row.
    • boolean valueIsNull[] Indicates whether a row has a value. If there is a row with no value, then the offset of this row is equal to the offset of the previous row.

3. A block of string type with a fixed width, the data of all rows are spliced ​​into a long series of Slices, and the length of each row is fixed.

4. Dictionary block: For some columns, there are fewer distinct values, so it is suitable to use a dictionary to save. There are mainly two parts:

    • A dictionary can be any type of block (even a dictionary block can be nested), and each line in the block is sorted and numbered in sequence.
    • int ids[] indicates the number of the value corresponding to each row of data in the dictionary. When searching, first find the id of a certain row, and then get the real value in the dictionary.

plug-in

After understanding the data model of presto, you can write plug-ins for presto to connect to your own storage system. Presto provides a set of connector interfaces to read metadata from custom storage and store data in columns. First look at the basic concept of connector:

  1. ConnectorMetadata: Manage table metadata, table metadata, partition and other information. When processing a request, it is necessary to obtain meta information in order to confirm the location of the read data. Presto will pass in filter conditions to reduce the range of data read. Meta information can be read from disk or cached in memory.
  2. ConnectorSplit: A collection of data processed by an IO Task, which is the unit of scheduling. A split can correspond to a partition, or multiple partitions.
  3. SplitManager : According to the meta of the table, construct the split.
  4. SlsPageSource : Read 0 or more pages from the disk according to the split information and the column information to be read for calculation by the calculation engine.

Plugins help developers add these features:

  1. Connect to your own storage system.
  2. Add custom data types.
  3. Add custom handlers.
  4. Custom permission control.
  5. Custom resource controls.
  6. Add query event processing logic.

Presto provides a simple connector: local file connector, which can be used to refer to how to implement your own connector. However, the unit of traversing data used in the local file connector is cursor, that is, a row of data, not a page. Three types are implemented in the hive connector, parquet connector, orc connector, rc file connector.

The above introduces some principles of presto from a macro perspective. The next few articles let us go deep into the interior of presto and understand some internal designs, which will be of great use for performance tuning and also help to add custom operators.


memory management

Presto is an in-memory computing engine, so the memory management must be fine-grained to ensure the orderly and smooth execution of queries, and starvation and deadlocks may occur in some cases.

memory pool

Presto uses logical memory pools to manage different types of memory requirements.

Presto divides the entire memory into three memory pools, namely System Pool, Reserved Pool, and General Pool.

  1. System Pool is reserved for system use, and by default 40% of the memory space is reserved for system use.
  2. Reserved Pool and General Pool are used to allocate query runtime memory.
  3. Most of the queries use the general Pool. The largest query uses the Reserved Pool, so the space of the Reserved Pool is equal to the maximum space used by a query running on a machine, and the default is 10% of the space.
  4. General enjoys other memory spaces besides System Pool and General Pool.

Why use mempool

System Pool is used for the memory used by the system, such as transferring data between machines. Buffers are maintained in the memory, and this part of the memory is mounted under the system name.

So, why do you need reserved area memory? And the memory in the reserved area is exactly equal to the maximum memory used by the query on the machine?

If there is no Reserved Pool, then when there are a lot of queries and the memory space is almost used up, a certain query with relatively large memory consumption starts to run. But at this time, there is no memory space for this query to run, and this query has been in a suspended state, waiting for available memory. But after other small-memory queries run, new small-memory queries are added. Since the small memory query occupies less memory, it is easy to find the available memory. In this case, the large memory query will hang until starved to death.

Therefore, in order to prevent this starvation situation, a space must be reserved for a large memory query to run. The size of the reserved space is equal to the maximum memory allowed by the query. Every second, Presto picks out a query with the largest memory usage and allows it to use the reserved pool to avoid that there is no available memory for the query to run.

memory management

Presto memory management, divided into two parts:

  • query memory management
    • The query is divided into many tasks, and each task will have a thread loop to obtain the status of the task, including the memory used by the task. Summarize the memory used by the query.
    • If the aggregate memory of the query exceeds a certain size, the query is forcibly terminated.
  • machine memory management
    • The coordinator has a thread that periodically trains each machine in rotation to check the current memory status of the machine.

When the query memory and machine memory are aggregated, the coordinator will select a query with the largest memory usage and allocate it to the Reserved Pool.

Memory management is managed by the coordinator, and the coordinator makes a judgment every second, specifying that a certain query can use reserved memory on all machines. So the question is, if the query is not running on a certain machine, isn't the memory reserved by the machine wasted? Why not pick out the largest task execution on a single machine. The reason is deadlock. If the query enjoys reserved memory on other machines, the execution will end soon. However, it is not the biggest task on a certain machine, and it has not been run, which makes the query unable to end.

Guess you like

Origin blog.csdn.net/qq_35240226/article/details/107974165