Data highway: Detailed explanation of data warehouse cluster communication technology

This article is shared from the Huawei Cloud Community " Live Broadcast Review | Data Highway - Detailed Explanation of Data Warehouse Cluster Communication Technology " by Hu Latang.

In the era of big data, the scale of clusters is getting larger and larger, business concurrency is getting higher and higher, and the communication pressure between the nodes of the database cluster is also increasing. In this live broadcast of the theme "Data Highway - Detailed Explanation of Data Warehouse Cluster Communication Technology", we invited Mr. Wei Deng, Huawei Cloud GaussDB (DWS) technical evangelist, to explain in depth how GaussDB (DWS) cluster communication technology can be deployed on a large scale. How to implement a high-performance distributed communication system when carrying high-concurrency services in a cluster.

1. Overview of GaussDB (DWS) cluster communication

In the GaussDB (DWS) cluster, there will be one or more coordination nodes (CN), each host has several data nodes (CN), global transaction controller (GTM), operation and maintenance management module (OM), cluster Management module (CM), data import and export module (GDS).

  • Coordination node (CN): Responsible for request decomposition, scheduling, and result return; SQL parsing and optimization; only metadata is saved, not data.
  • Data node (DN): Responsible for storing actual table data (specified distribution method: hash table, replication table, RroundRobin table); executing SQL tasks and returning execution results to CN.
  • Global Transaction Controller (GTM) : Responsible for generating and maintaining global transaction IDs, transaction snapshots, timestamps and other globally unique information.
  • Operation and maintenance management module (OM) : Provides daily operation and maintenance and configuration management.
  • Cluster Management Module (CM) : Cluster management and monitoring of physical resource usage of each unit.
  • GDS Loader : batch data loading, parallel acceleration

All the above modules communicate with each other through the cluster network. Cluster communication is different from traditional database modules such as executors, optimizers, and storage. Cluster communication is unique to distributed databases. Cluster performance optimization has a great impact on locating cluster problems.

The following figure is an overview of the GaussDB (DWS) cluster. This content sharing has simplified the illustration. GaussDB (DWS) is an MPP distributed database that uses Share Nothing architecture. Data is distributed and stored in various DN nodes. CN does not store data. As the entry point for receiving queries, the generated plan will be pushed down to DN for parallel execution as much as possible to improve performance. When DN performs multi-table Join, because the local DN only has partial data, data exchange between DNs is required to centrally distribute table data or intermediate results.

Data communication process of GaussDB (DWS) general query: (green arrow)

  • The client connects to the CN and issues the query;
  • CN connects all DNs, generates and issues execution plans;
  • Exchange table data or intermediate results between DNs through the network;
  • DN performs data processing locally and returns the result set to CN;
  • CN aggregates and processes the result set and returns it to the client.

GaussDB(DWS) cluster communication overview

2. Introduction to CN communication framework

 

1. IP and port information

The client connects to CN through the IP port. The pgxc_node system table in CN saves the IP and port information of all nodes in the cluster, helping CN to connect to other nodes in the cluster.

In the pgxc_node system table in the figure below, node_port and node_host are host information; node_port1 and node_host1 are standby information. hostis_primary has a master-standby relationship. When it is t, CN will connect to the host first and then the standby, and vice versa. The hostis_primary value is automatically refreshed by the CM cluster management component during active/standby switching.

2. Client communicates with CN

The client executes the query process:

  • The client initiates a connection to the listening port of CN;
  • CN postmaster main thread accepts the connection, creates a postgres thread and hands the connection to this thread for processing;
  • The client sends the query to the CN;
  • The CN's postgres thread delivers the query plan to other CNs/DNs, and the query results are returned to the client along the original path;
  • The client query ends and the connection is closed;
  • The corresponding postgres thread on CN is destroyed and exits.

Diagram of communication between client and CN

The process of establishing a connection between CN and DN is basically the same as that between the client and CN. In order to reduce the overhead of establishing a connection between CN and DN, as well as the creation and destruction of postgres threads in the DN process, the CN side implements a pooler connection pool.

3. Pooler connection pool

The Pooler connection pool saves all connections between CN and other CN/DN processes. Each connection corresponds to a postgres thread on other CN/DN. The Pooler connection pool reduces the overhead of establishing connections and creating and destroying postgres threads by reusing connections and threads.

Pooler reuse process:

  • When the session needs to connect, find the correct pooler connection pool for the key through DB+USER, and take the existing connection from it first;
  • After the query ends, CN's postgres thread will not return the connection, and the connection can be used for the next query of the current session;
  • After the session ends, CN's postgres thread will return the connection to the corresponding pooler. The postgres thread on the corresponding DN will not exit. It is in ReadCommand, waiting for CN's new postgres thread to initiate a task after reuse.

Pooler connection pool diagram

4. Pooler view

The pg_pooler_status view records all connection information in the pooler connection pool. As shown in the figure below, each line represents a connection initiated by this CN, corresponding to a postgres thread of the counterpart process. in_use is 't', which means that the connection is being used by a thread, and 'f', which means that the idle connection is waiting for reuse. tid is listed as the thread number of this CN that holds this connection, node_name is listed as the peer process number, and remote_pid is listed as the peer thread number. When query_id is 0 or CN/DN are inconsistent, find the CN and DN connection relationship through the pooler view.

5. Pooler connection cleaning

There are two types of connection pool cleaning mechanisms: connections held by Session and connections in Pooler's idle connection pool.

Connection held by Session:

  • cache_connection, whether to use the pooler connection pool to cache connections;
  • session_timeout, the client connection will report an error after the idle timeout and exit and return the connection;
  • enable_force_reuse_connections, force the connection to be returned after the transaction ends;
  • conn_recycle_timeout (2.1), return the connection after the CN idle session times out.

Pooler connections in the idle connection pool:

  • pg_clean_free_conn, cleans 1/4 of the idle connection pool connections, CM calls it regularly;
  • clean connection, clean up all idle connections corresponding to DB or user.

3. Introduction to DN communication framework

1. Stream operator

GaussDB (DWS) is an MPP distributed database that uses Share Nothing architecture. Data is stored in various DN nodes. Data in two tables that meet the join conditions must be distributed on the same DN. Tables that do not meet the conditions need to be redistributed. That is, a stream operator is generated.

Each stream operator requires two threads to process asynchronous network IO. The lower layer sends data is called producer, and the upper layer receives data is called consumer.

2. Stream thread

Each stream operator on DN needs to start a stream thread to send network data asynchronously. If SMP parallelism is enabled, a stream operator may need to start multiple stream threads and establish more connections between DNs. There are three types of stream operators (Streaming):

  • GATHER: CN communicates with DN and collects DN result sets
  • BROADCAST: DN broadcasts all local data to other DNs
  • REDISTRIBUTE: The DN hashes the local data and sends it to the corresponding DN.

3. Stream thread pool

The stream thread pool implements the reuse of DN stream threads and avoids the overhead of stream thread creation, initialization, cleaning, and destruction.

The stream thread pool is implemented using a lock-free queue. 2000 stream threads are started concurrently, and the time is optimized from 2 seconds to 10ms. When the stream operator requires a stream thread, it matches the corresponding stream thread pool through the DB name and reuses the existing threads of the same DB first. The created stream thread is put into the thread pool to wait for reuse after the query is completed. The threads in the stream thread pool themselves have the function of timeout when idle, and 1/4 is recycled every 60s. The max_stream_pool parameter sets the upper limit of the thread pool cache. When it is 0, the stream thread pool function is turned off. It can also be temporarily set to clean up the stream thread.

Stream thread pool diagram

4. Libcomm communication library

When the cluster reaches 1000 DNs, each stream thread needs to establish 1000 connections. If 1000 stream threads are concurrent, DN needs to establish a total of 1 million connections, which will consume a lot of connection, memory, and FD resources. Based on this situation, the Libcomm communication library was designed. The Libcomm communication library simulates n logical connections on a physical long connection, so that all concurrent data runs on one physical connection, which solves the problem of excessive number of physical connections and time-consuming connection establishment. The problem.

4. Locating communication problems

 

1. Communication hang problem

Steps to locate communication hang issues:

  • Find the query_id of the problem query in the pgxc_stat_activity view;
  • Query the pgxc_thread_wait_status view based on query_id;
  • After filtering out the wait node, flush data, and synchronize quit states, the query blocking point is found;
  • If all of the above three states are in the above three states, use the Libcomm logical connection view for further positioning;

2. Communication error problem

Common communication error problems are as shown in the figure:

3. Locating communication performance problems

  • Use explain performance analysis;

  • Locate the hot blocking stack according to the hang problem;
  • Use the gsar tool to check whether network packet loss and retransmission occur in the environment;

4. Network environment issues

  • Use the gsar tool to confirm whether network packet loss and retransmission occur;
  • Use the netstat command to confirm which connection the retransmission occurs on;

gs_ssh -c "netstat -anot|grep 'on ('|grep -v '/0/0'|sort -rnk3|head“|grep tcp

  • Use the top command to check whether the CPU usage of the ksoftirq process is abnormal on the machines at both ends of the connection;
  • Use ping, telnet and tcpdump to further analyze packet loss issues;

This ends this sharing. For more information about the technical analysis of GaussDB (DWS) products and the introduction of new features of data warehouse products, please pay attention to the GaussDB (DWS) forum. Technical blog sharing and live broadcast arrangements will be published on GaussDB (DWS) as soon as possible. forum.

Forum link: https://bbs.huaweicloud.com/forum/forum-598-1.html

Live replay link: https://bbs.huaweicloud.com/live/cloud_live/202312191630.html

 

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Bilibili crashed twice, Tencent’s “3.29” first-level accident... Taking stock of the top ten downtime accidents in 2023 Vue 3.4 “Slam Dunk” released MySQL 5.7, Moqu, Li Tiaotiao… Taking stock of the “stop” in 2023 More” (open source) projects and websites look back on the IDE of 30 years ago: only TUI, bright background color... Vim 9.1 is released, dedicated to Bram Moolenaar, the father of Redis, "Rapid Review" LLM Programming: Omniscient and Omnipotent&& Stupid "Post-Open Source "The era has come: the license has expired and cannot serve the general public. China Unicom Broadband suddenly limited the upload speed, and a large number of users complained. Windows executives promised improvements: Make the Start Menu great again. Niklaus Wirth, the father of Pascal, passed away.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10576039