Introduction to TIDB and introduction to TIDB deployment, principles and usage

Introduction to TiDB and introduction to TiDB deployment, principles and usage

From MySQL architecture to TiDB

Database classification

Before introducing the TiDB database, let us first introduce the usage scenarios. There are many types of databases today, including RDBMS (relational database), NoSQL (Not Only SQL), and NewSQL, all of which have their place in the database field. It can be said that a hundred schools of thought are contending. So why should we learn to use TiDB? Next, let’s start with the use of MySQL, which we are most familiar with.

MySQL pain points

​ Suppose there is a rapidly developing Internet company. The data volume of the core business database MySQL has reached nearly 100 million rows and is still growing. The company attaches great importance to data assets. All data requires multiple copies to be stored for at least 5 years, and in addition to In addition to the offline reporting business that performs statistical analysis on historical data, there are also some needs for real-time query of user data.

​According to past MySQL usage experience, MySQL has better performance when a single table has less than 50 million rows. After a single table exceeds 50 million rows, database performance and maintainability will drop dramatically. Of course, you can do MySQL sub-database and table at this time, such as using Mycat or Sharding-jdbc.

MySQL sub-database and table

The advantages of MySQL sub-database and table are very obvious

  • By splitting a large table into small tables, the data volume of a single table can be controlled to less than 50 million rows, making MySQL performance stable and controllable.

  • After splitting a single large table into small tables, it can be expanded horizontally and deployed to multiple servers to improve database service indicators such as QPS, TPS, and Latency of the entire cluster.

The shortcomings of MySQL sub-database and sub-table are also very obvious.

  • After the table is divided across instances, distributed transaction management problems arise. Once the database server goes down, there is a risk of transaction inconsistency.

  • After the table is divided, there are certain restrictions on SQL statements, which greatly reduces the functional requirements of the business side. Especially for real-time report statistics requirements, the restrictions are very large.

  • After the table is divided, the objects that need to be maintained increase exponentially (the number of MySQL instances, the number of SQL changes that need to be performed, etc.)

MySQL pain point solutions

Based on the above core pain points, we need to explore new database technology solutions to cope with the challenges brought by the explosive growth of business and provide better database service support for the business.
After researching the major databases on the market, we can consider using NewSQL technology to solve the problem, because NewSQL technology has the following salient features:

  • Unlimited horizontal scalability
  • Distributed strong consistency ensures 100% data security
  • Complete distributed transaction processing capabilities and ACID features

The TiDB database GitHub can be regarded as an international open source project in terms of activity and community contributors, and is a representative product in NewSQL technology, so we can choose to use the TiDB database!

Four core application scenarios

  1. Scenarios with financial industry attributes that require high data consistency and high reliability, high system availability, scalability, and disaster recovery

As we all know, the financial industry has high requirements for data consistency and high reliability, high system availability, scalability, and disaster recovery. The traditional solution is that two computer rooms in the same city provide services, and one computer room in another place provides data disaster recovery capabilities but does not provide services. This solution has the following shortcomings: low resource utilization, high maintenance costs, RTO (Recovery Time Objective) and RPO ( Recovery Point Objective) cannot truly achieve the value expected by the enterprise. TiDB uses multi-copy + Multi-Raft protocol to schedule data to different computer rooms, racks, and machines. When some machines fail, the system can automatically switch to ensure that the system's RTO <= 30s and RPO = 0.

  1. Massive data and high-concurrency OLTP scenarios that require high storage capacity, scalability, and concurrency

​ With the rapid development of business, data has shown explosive growth. Traditional stand-alone databases cannot meet the capacity requirements of the database due to the explosive growth of data. A feasible solution is to use middleware products with separate databases and tables or NewSQL databases to replace or use high-end databases. Storage devices, etc. The most cost-effective among them is the NewSQL database, such as TiDB. TiDB adopts a computing and storage separation architecture, which can expand and shrink computing and storage respectively. Computing supports a maximum of 512 nodes, each node supports a maximum of 1000 concurrencies, and the cluster capacity supports a maximum of PB level.

  1. Real-time HTAP scenario

​ With the rapid development of 5G, Internet of Things, and artificial intelligence, enterprises will produce more and more data, and its scale may reach hundreds of TB or even PB levels. The traditional solution is to process online online transactions through OLTP databases. , the data is synchronized to the OLAP database for data analysis through ETL tools. This processing solution has many problems such as high storage cost and poor real-time performance. TiDB introduced the column storage engine TiFlash in version 4.0 and combined it with the row storage engine TiKV to build a true HTAP database. With a small increase in storage costs, online transaction processing and real-time data analysis can be done in the same system, greatly saving enterprises cost.

  1. Data aggregation and secondary processing scenarios

​Currently, the business data of most enterprises are scattered in different systems without a unified summary. As the business develops, the decision-making level of the enterprise needs to understand the business status of the entire company in order to make timely decisions, so it is necessary to integrate the dispersed data into The data in various systems are gathered in the same system and processed twice to generate T+0 or T+1 reports. The traditional common solution is to use ETL + Hadoop, but the Hadoop system is too complex, and the operation and maintenance and storage costs are too high to meet the needs of users. Compared with Hadoop, TiDB is much simpler. Businesses synchronize data to TiDB through ETL tools or TiDB synchronization tools. Reports can be directly generated in TiDB through SQL.

Summarize

​ Traditional relational databases have a long history. Currently, RDBMSs are represented by Oracle, MySQL, and PostgreSQL. They are also relatively senior in the field of databases. They are widely used in all walks of life. Most RDBMSs are local storage or shared storage.
​ However, there are some problems in this type of database, such as its own capacity limitations. As the business volume continues to increase, capacity gradually becomes a bottleneck. At this time, the DBA will perform multiple sharding of the database tables to alleviate the capacity problem. A large number of sub-databases and sub-tables not only consume a lot of manpower, but also complicate the routing logic for business access to the database. In addition, RDBMS has poor scalability, usually the cost of cluster expansion and contraction is high, and it does not meet the needs of distributed transactions.
Representatives of NoSQL databases are Hbase, Redis, MongoDB, Cassandra, etc. This type of database solves the problem of poor scalability of RDBMS, and cluster capacity expansion becomes much more convenient. However, because the storage method is multiple KV storage, it is not compatible with SQL. Sex is greatly compromised. For NoSQL databases, they can only meet some of the characteristics of distributed transactions.
Representatives in the NewSQL field are Google's spanner and F1, which claim to be able to achieve global data center disaster recovery and fully meet the ACID of distributed transactions, but they can only be used on Google Cloud. TiDB was born under a general background and also filled the gap in the domestic NewSQL field. Since TiDB wrote its first line of code in May 2015, it has released dozens of large and small versions, and version iterations are very rapid.

Introduction to TiDB database

TiDB overview

TiDB database official website: https://pingcap.com/zh/

TiDB is an open source distributed relational database independently designed and developed by PingCAP. It is a converged distributed database product that supports both online transaction processing and online analytical processing (Hybrid Transactional and Analytical Processing, HTAP). It has horizontal expansion or It has important features such as capacity reduction, financial-grade high availability, real-time HTAP, cloud-native distributed database, compatibility with MySQL 5.7 protocol and MySQL ecosystem. The goal is to provide users with one-stop OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP solutions. TiDB is suitable for various application scenarios such as high availability, strong consistency requirements, and large data scale.

Database classification and sorting

SQL、NoSQL、NewSQL

SQL

Relational database (RDBMS, that is, SQL database)
commercial software: Oracle, DB2
open source software: MySQL, PostgreSQL
The stand-alone version faced by relational databases has been difficult to meet the needs of massive data.

NoSQL

NoSQL = Not Only SQL, which means "not just SQL, advocates the use of non-relational data storage. It is
generally chosen to sacrifice the support of complex SQL and ACID transactions in exchange for elastic expansion capabilities.
It usually does not guarantee strong consistency (supports eventual consistency)
. Classification

  • Key-Value database: such as MemcacheDB, Redis
  • Document storage: such as MongoDB
  • Column storage: Convenient to store structured and semi-structured data, and perform data compression. It has great IO advantages for querying certain columns: such as HBase, Cassandra
  • Graph database: stores graph relationships (note: not pictures). Such as Neo4J
NewSQL

For OLTP reading and writing, it provides the same scalability and performance as NOSQL, while supporting transactions that meet ACID characteristics, that is, maintaining the high scalability and high performance of NoSQL, and maintaining the relational model. Why NewSQL is needed
?

  • NoSQL cannot completely replace RDBMS

  • Standalone RDBMS cannot meet performance requirements

  • Using the "stand-alone RDBMS + middleware" approach, it is difficult to solve distributed transactions and high availability issues at the middleware layer.

NewSQL design architecture

  • It can be based on a new database platform or based on existing SQL engine optimization.
  • No shared storage (MPP architecture) is a relatively common architecture
  • Implement high availability and disaster recovery based on multiple copies
  • Distributed query
  • Data Sharding mechanism
  • Achieve data consistency through 2PC, Paxos/Raft and other protocols

Representative products

  • Google Spanner
  • OceanBase
  • TiDB

OLTP、OLAP

OLTP
  • It emphasizes the ability to support a large number of concurrent transaction operations (add, delete, modify, and query) in a short period of time, and the amount of data involved in each operation is very small (such as dozens to hundreds of bytes)

  • Emphasis on strong consistency of transactions (such as bank transfer transactions, zero tolerance for errors)

For example: During "Double Eleven", hundreds of thousands of users may place orders in the same second. The backend database must be able to process these order requests concurrently and at near real-time speed

OLAP
  • Prefers complex read-only queries, reading massive data for analysis and calculation, and the query time is often very long

For example: After "Double Eleven" is over, Taobao's operators analyze and mine orders to find out some market rules and so on. This kind of analysis may require reading all historical orders for calculation, which may take tens of seconds or even dozens of minutes.

  • OLAP representative products:
    • Greenplum
    • TeraData
    • Alibaba AnalyticDB

The birth of TiDB

Huang Dongxu, author of the famous open source distributed cache service Codis, co-founder & CTO of PingCAP, and senior infrastructure engineer, is good at the design and implementation of distributed storage systems, and is a technical guru among open source enthusiasts. Even today when the Internet is so prosperous, in the blurred and uncertain area of ​​​​databases, he is still trying to find a certain practical direction.
At the end of 2012, he saw two papers published by Google, which reflected the faint brilliance of his own heart like a prism. These two papers describe F1/Spanner, a massive relational data used internally by Google, which solves the problems of relational databases, elastic expansion and global distribution, and is used on a large scale in production. "If this can be realized, it will be disruptive to the field of data storage." Huang Dongxu was excited about the emergence of the perfect solution, and PingCAP's TiDB was born on this basis.

TiDB architectural features

TiDB overall architecture

​TiDB cluster mainly includes three core components: TiDB Server, PD Server and TiKV Server. In addition, there are TiSpark components to solve users’ complex OLAP needs and TiDB Operator components to simplify cloud deployment management.

As a new generation of NewSQL database, TiDB has gradually gained a foothold in the database field. It combines the outstanding features of Etcd/MySQL/HDFS/HBase/Spark and other technologies. With the large-scale promotion of TiDB, it will gradually weaken the advantages of OLTP/OLAP. boundaries and simplify the current complicated ETL process, triggering a new wave of technology. In a word, TiDB has a bright future and a promising future.

TiDB architecture diagram

TiDB Server

TiDB Server is responsible for receiving SQL requests, processing SQL-related logic, finding the TiKV address that stores the data required for calculation through PD Server, interacting with TiKV to obtain data, and finally returning results. TiDB Server is stateless. It does not store data itself. It is only responsible for calculation. It can be infinitely expanded horizontally and can provide a unified access address to the outside world through load balancing components (such as LVS, HAProxy or F5).

PD Server

Placement Driver (PD for short) is the management module of the entire cluster. It has three main tasks:

  • One is to store the meta-information of the cluster (which TiKV node a certain Key is stored on);

  • The second is to schedule and load balance the TiKV cluster (such as data migration, Raft group leader migration, etc.);

  • The third is to assign a globally unique and increasing transaction ID.

PD ensures data security through the Raft protocol. Raft's leader server is responsible for processing all operations, and the remaining PD servers are only used to ensure high availability. It is recommended to deploy an odd number of PD nodes

TiKV Server

TiKV Server is responsible for storing data. From the outside, TiKV is a distributed Key-Value storage engine that provides transactions. The basic unit of storing data is Region. Each Region is responsible for storing data of a Key Range (left-closed and right-open range from StartKey to EndKey). Each TiKV node is responsible for multiple Regions. TiKV uses the Raft protocol for replication to maintain data consistency and disaster recovery. Replicas are managed in units of Region. Multiple Regions on different nodes form a Raft Group and are replicas of each other. The load balancing of data among multiple TiKVs is scheduled by PD, which is also scheduled in Region units.

TiSpark

​ TiSpark, as the main component in TiDB to solve users' complex OLAP needs, runs Spark SQL directly on the TiDB storage layer, while integrating the advantages of TiKV distributed clusters and integrating into the big data community ecosystem. At this point, TiDB can support both OLTP and OLAP through a system, eliminating the worry of user data synchronization.

TiDB Operator

TiDB Operator provides the ability to deploy and manage TiDB clusters on mainstream cloud infrastructure (Kubernetes). It combines the best practices of container orchestration in the cloud native community with TiDB's professional operation and maintenance knowledge, integrating one-click deployment, multi-cluster co-deployment, automatic operation and maintenance, fault self-healing and other capabilities, which greatly lowers the threshold for users to use and manage TiDB. and cost

TiDB core features

TiDB has many features as follows, two of which are core features: horizontal expansion and high availability.

Highly compatible with MySQL

In most cases, you can easily migrate from MySQL to TiDB without modifying the code. The MySQL cluster after sharding the database and tables can also be migrated in real time through the TiDB tool. When users use it, they can switch from MySQL to TiDB transparently, but the backend of the "new MySQL" has "unlimited" storage and is no longer limited by the local disk capacity. During operation and maintenance, TiDB can also be used as a slave database and linked to the MySQL master-slave architecture.

Distributed transactions

TiDB 100% supports standard ACID transactions.

One-stop HTAP solution

HTAP: Hybrid Transactional/Analytical Processing
TiDB is a typical OLTP row storage database with powerful OLAP performance. Together with TiSpark, it can provide a one-stop HTAP solution. One storage can process OLTP & OLAP at the same time, without the need for traditional and cumbersome ETL. process.

Cloud native SQL database

TiDB is a database designed for the cloud. It supports public clouds, private clouds and hybrid clouds. It can realize automated operation and maintenance with the TiDB Operator project, making deployment, configuration and maintenance very simple.

Horizontal elastic expansion

By simply adding new nodes, you can achieve horizontal expansion of TiDB, expand throughput or storage on demand, and easily cope with high concurrency and massive data scenarios.

True financial grade high availability

Compared with the traditional master-slave (MS) replication scheme, the Raft-based majority election protocol can provide financial-level 100% strong data consistency guarantee, and can achieve automatic recovery from failures without losing most copies ( auto-failover) without manual intervention

∩Core Features-Horizontal Expansion∩

Unlimited horizontal expansion is a major feature of TiDB. The horizontal expansion mentioned here includes two aspects: computing power (TiDB) and storage capacity (TiKV).

  • TiDB Server is responsible for processing SQL requests. As your business grows, you can simply add TiDB Server nodes to improve overall processing capabilities and provide higher throughput.

  • TiKV is responsible for storing data. As the amount of data grows, more TiKV Server nodes can be deployed to solve the data Scale problem.

  • PD will schedule between TiKV nodes in Region units and migrate some data to the newly added nodes.

Therefore, in the early stage of business, you can deploy only a small number of service instances (it is recommended to deploy at least 3 TiKV, 3 PD, and 2 TiDB). As the business volume increases, TiKV or TiDB instances can be added as needed.

∩Core Features-High Availability∩

High availability is another major feature of TiDB. The three components of TiDB/TiKV/PD can tolerate the failure of some instances without affecting the availability of the entire cluster. The following describes the availability of these three components, the consequences of a single instance failure, and how to recover.
TiDB
TiDB is stateless. It is recommended to deploy at least two instances. The front end provides external services through the load balancing component. When a single instance fails, it will affect the Session currently running on this instance. From an application perspective, a single request will fail. You can continue to obtain services after reconnecting. After a single instance fails, you can restart the instance or deploy a new instance.

PD
PD is a cluster that maintains data consistency through the Raft protocol. When a single instance fails, if the instance is not the leader of Raft, the service will not be affected at all; if the instance is the leader of Raft, a new Raft will be re-elected. leader, automatically restores services. PD cannot provide external services during the election process, which takes about 3 seconds. It is recommended to deploy at least three PD instances. After a single instance fails, restart the instance or add a new instance.

TiKV
TiKV is a cluster that maintains data consistency through the Raft protocol (the number of copies is configurable, and three copies are saved by default), and performs load balancing scheduling through PD. When a single node fails, it will affect all Regions stored on this node. For the Leader node in the Region, the service will be interrupted and wait for re-election; for the Follower node in the Region, the service will not be affected. When a TiKV node fails and cannot be recovered within a period of time (default 30 minutes), PD will migrate the data on it to other TiKV nodes.

TiDB storage capacity and computing power

Storage Capacity-TiKV-LSM

​ TiKV Server is usually 3+. TiDB has 3 copies of each data by default. This is somewhat similar to HDFS, but data is replicated through the Raft protocol. The data on TiKV Server is performed in units of Region, and is controlled by PD Server. The cluster performs unified scheduling, similar to HBASE's Region scheduling.
The data format stored in the TiKV cluster is KV. In TiDB, the data is not stored directly in HDD/SSD, but a TB-level localized storage solution is implemented through RocksDB. The important point to mention is: RocksDB and HBASE In the same way, LSM trees are used as storage solutions, which avoids a large number of random reads and writes caused by the expansion of B+ tree leaf nodes. thereby improving the overall throughput

Computing power-TiDB Server

TiDB Server itself is stateless, which means that when computing power becomes a bottleneck, the machine can be directly expanded, which is transparent to users. Theoretically there is no upper limit on the number of TiDB Servers

TiDB experimental environment installation and deployment

[Note]: Here we first deploy an experimental environment for quick familiarity, and the production-level installation and deployment will be introduced later.

Simulate the deployment of a production environment cluster on a single machine

Apply for an Alibaba Cloud ECS cloud host instance as a deployment environment and preemptive resources. Just choose the cheap option. It only costs a few yuan a day.

You can preset the root password, and then use the public IP to connect using xshell. If there is a local server, you can use your own existing server.

Machine environment information:

[root@iZ0jlfl8zktqzyt15o1o16Z ~]# cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core) 
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:             30           0          29           0           0          30
Swap:             0           0           0
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz

Install MySQL-mariadb-server

Install MySQL-mariadb-server to connect to the TiDB database and use it for subsequent testing

# yum安装
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# yum install mariadb* -y
# 启动服务和开机自启
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# systemctl start mariadb && systemctl enable mariadb
Created symlink from /etc/systemd/system/multi-user.target.wants/mariadb.service to /usr/lib/systemd/system/mariadb.service.
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# 

[root@iZ0jlfl8zktqzyt15o1o16Z ~]# netstat -tnlpu|grep 3306
tcp6       0      0 :::3306                 :::*                    LISTEN      11576/mysqld        

[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -uroot -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2
Server version: 5.5.68-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> ALTER USER 'root'@'localhost' IDENTIFIED BY 'Root@123';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'USER 'root'@'localhost' IDENTIFIED BY 'Root@123'' at line 1
MariaDB [(none)]> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [mysql]> UPDATE user SET password=password('123456') WHERE user='root';
Query OK, 4 rows affected (0.00 sec)
Rows matched: 4  Changed: 4  Warnings: 0
MariaDB [mysql]> flush privileges;
Query OK, 0 rows affected (0.00 sec)
MariaDB [mysql]> exit;
Bye
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -uroot -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 5.5.68-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> exit
Bye
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# 

Install stand-alone version

Simulate the deployment of a production environment cluster on a single machine

Applicable scenario: You want to use a single Linux server to experience the smallest complete topology cluster of TiDB and simulate the deployment steps in a production environment.

# 在单机上模拟部署生产环境集群
# 下载并安装 TiUP
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
WARN: adding root certificate via internet: https://tiup-mirrors.pingcap.com/root.json
You can revoke this by remove /root/.tiup/bin/7b8e153f2e2d0928.root.json
Successfully set mirror to https://tiup-mirrors.pingcap.com
Detected shell: bash
Shell profile:  /root/.bash_profile
/root/.bash_profile has been modified to add tiup to PATH
open a new terminal or source /root/.bash_profile to use it
Installed path: /root/.tiup/bin/tiup
===============================================
Have a try:     tiup playground
===============================================
# 根据提示source引入环境变量
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# source /root/.bash_profile
# 安装 TiUP 的 cluster 组件
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup cluster
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.10.3
   Local installed version:    
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

The component `cluster` version  is not installed; downloading from repository.
download https://tiup-mirrors.pingcap.com/cluster-v1.10.3-linux-amd64.tar.gz 8.28 MiB / 8.28 MiB 100.00% 12.07 MiB/s                                               
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.3/tiup-cluster
Deploy a TiDB cluster for production
# 根据提示信息需要update
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup update --self && tiup update cluster
download https://tiup-mirrors.pingcap.com/tiup-v1.10.3-linux-amd64.tar.gz 6.81 MiB / 6.81 MiB 100.00% 12.83 MiB/s                                                  
Updated successfully!
component cluster version v1.10.3 is already installed
Updated successfully!
# 由于模拟多机部署,需要通过 root 用户调大 sshd 服务的连接数限制
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# cat /etc/ssh/sshd_config | grep MaxSessions
#MaxSessions 10
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# echo "MaxSessions 20" >> /etc/ssh/sshd_config
# 或者vim打开文件打开注释,把10改为20保存
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# cat /etc/ssh/sshd_config | grep MaxSessions|grep -vE "^#"
MaxSessions 20
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# service sshd restart
Redirecting to /bin/systemctl restart sshd.service
# 创建启动机器定义文件
#按下面的配置模板,编辑配置文件,命名为 topo.yaml,其中:
	#user: "tidb":表示通过 tidb 系统用户(部署会自动创建)来做集群的内部管理,默认使用 22 端口通过 ssh 登录目标机器
	#replication.enable-placement-rules:设置这个 PD 参数来确保 TiFlash 正常运行
	#host:设置为本部署主机的 IP
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# vi topo.yaml

# # Global variables are applied to all deployments and used as the default value of
# # the deployments if a specific deployment value is missing.
global:
 user: "tidb"
 ssh_port: 22
 deploy_dir: "/tidb-deploy"
 data_dir: "/tidb-data"

# # Monitored variables are applied to all the machines.
monitored:
 node_exporter_port: 9100
 blackbox_exporter_port: 9115

server_configs:
 tidb:
   log.slow-threshold: 300
 tikv:
   readpool.storage.use-unified-pool: false
   readpool.coprocessor.use-unified-pool: true
 pd:
   replication.enable-placement-rules: true
   replication.location-labels: ["host"]
 tiflash:
   logger.level: "info"

pd_servers:
 - host: 172.28.54.199

tidb_servers:
 - host: 172.28.54.199

tikv_servers:
 - host: 172.28.54.199
   port: 20160
   status_port: 20180
   config:
     server.labels: {
    
     host: "logic-host-1" }

 - host: 172.28.54.199
   port: 20161
   status_port: 20181
   config:
     server.labels: {
    
     host: "logic-host-2" }

 - host: 172.28.54.199
   port: 20162
   status_port: 20182
   config:
     server.labels: {
    
     host: "logic-host-3" }

tiflash_servers:
 - host: 172.28.54.199

monitoring_servers:
 - host: 172.28.54.199

grafana_servers:
 - host: 172.28.54.199

# 查看可以部署的版本信息
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup list tidb

# tiup cluster deploy <cluster-name> <tidb-version> ./topo.yaml --user root -p
#参数 <cluster-name> 表示设置集群名称
#参数 <tidb-version> 表示设置集群版本,可以通过 tiup list tidb 命令来查看当前支持部署的 TiDB 版本
#参数 -p 表示在连接目标机器时使用密码登录
# 按照引导,输入”y”及 root 密码,来完成部署:
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup cluster deploy wangtingtidb v5.2.4 ./topo.yaml --user root -p
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.3/tiup-cluster deploy wangtingtidb v5.2.4 ./topo.yaml --user root -p
Input SSH password: 

+ Detect CPU Arch Name
  - Detecting node 172.28.54.199 Arch info ... Done

+ Detect CPU OS Name
  - Detecting node 172.28.54.199 OS info ... Done
Please confirm your topology:
Cluster type:    tidb
Cluster name:    wangtingtidb
Cluster version: v5.2.4
Role        Host           Ports                            OS/Arch       Directories
----        ----           -----                            -------       -----------
pd          172.28.54.199  2379/2380                        linux/x86_64  /tidb-deploy/pd-2379,/tidb-data/pd-2379
tikv        172.28.54.199  20160/20180                      linux/x86_64  /tidb-deploy/tikv-20160,/tidb-data/tikv-20160
tikv        172.28.54.199  20161/20181                      linux/x86_64  /tidb-deploy/tikv-20161,/tidb-data/tikv-20161
tikv        172.28.54.199  20162/20182                      linux/x86_64  /tidb-deploy/tikv-20162,/tidb-data/tikv-20162
tidb        172.28.54.199  4000/10080                       linux/x86_64  /tidb-deploy/tidb-4000
tiflash     172.28.54.199  9000/8123/3930/20170/20292/8234  linux/x86_64  /tidb-deploy/tiflash-9000,/tidb-data/tiflash-9000
prometheus  172.28.54.199  9090                             linux/x86_64  /tidb-deploy/prometheus-9090,/tidb-data/prometheus-9090
grafana     172.28.54.199  3000                             linux/x86_64  /tidb-deploy/grafana-3000
Attention:
    1. If the topology is not what you expected, check your yaml file.
    2. Please confirm there is no port/directory conflicts in same host.
Do you want to continue? [y/N]: (default=N) y
+ Generate SSH keys ... Done
+ Download TiDB components
  - Download pd:v5.2.4 (linux/amd64) ... Done
  - Download tikv:v5.2.4 (linux/amd64) ... Done
  - Download tidb:v5.2.4 (linux/amd64) ... Done
  - Download tiflash:v5.2.4 (linux/amd64) ... Done
  - Download prometheus:v5.2.4 (linux/amd64) ... Done
  - Download grafana:v5.2.4 (linux/amd64) ... Done
  - Download node_exporter: (linux/amd64) ... Done
  - Download blackbox_exporter: (linux/amd64) ... Done
+ Initialize target host environments
  - Prepare 172.28.54.199:22 ... Done
+ Deploy TiDB instance
  - Copy pd -> 172.28.54.199 ... Done
  - Copy tikv -> 172.28.54.199 ... Done
  - Copy tikv -> 172.28.54.199 ... Done
  - Copy tikv -> 172.28.54.199 ... Done
  - Copy tidb -> 172.28.54.199 ... Done
  - Copy tiflash -> 172.28.54.199 ... Done
  - Copy prometheus -> 172.28.54.199 ... Done
  - Copy grafana -> 172.28.54.199 ... Done
  - Deploy node_exporter -> 172.28.54.199 ... Done
  - Deploy blackbox_exporter -> 172.28.54.199 ... Done
+ Copy certificate to remote host
+ Init instance configs
  - Generate config pd -> 172.28.54.199:2379 ... Done
  - Generate config tikv -> 172.28.54.199:20160 ... Done
  - Generate config tikv -> 172.28.54.199:20161 ... Done
  - Generate config tikv -> 172.28.54.199:20162 ... Done
  - Generate config tidb -> 172.28.54.199:4000 ... Done
  - Generate config tiflash -> 172.28.54.199:9000 ... Done
  - Generate config prometheus -> 172.28.54.199:9090 ... Done
  - Generate config grafana -> 172.28.54.199:3000 ... Done
+ Init monitor configs
  - Generate config node_exporter -> 172.28.54.199 ... Done
  - Generate config blackbox_exporter -> 172.28.54.199 ... Done
+ Check status
Enabling component pd
	Enabling instance 172.28.54.199:2379
	Enable instance 172.28.54.199:2379 success
Enabling component tikv
	Enabling instance 172.28.54.199:20162
	Enabling instance 172.28.54.199:20160
	Enabling instance 172.28.54.199:20161
	Enable instance 172.28.54.199:20160 success
	Enable instance 172.28.54.199:20162 success
	Enable instance 172.28.54.199:20161 success
Enabling component tidb
	Enabling instance 172.28.54.199:4000
	Enable instance 172.28.54.199:4000 success
Enabling component tiflash
	Enabling instance 172.28.54.199:9000
	Enable instance 172.28.54.199:9000 success
Enabling component prometheus
	Enabling instance 172.28.54.199:9090
	Enable instance 172.28.54.199:9090 success
Enabling component grafana
	Enabling instance 172.28.54.199:3000
	Enable instance 172.28.54.199:3000 success
Enabling component node_exporter
	Enabling instance 172.28.54.199
	Enable 172.28.54.199 success
Enabling component blackbox_exporter
	Enabling instance 172.28.54.199
	Enable 172.28.54.199 success
Cluster `wangtingtidb` deployed successfully, you can start it with command: `tiup cluster start wangtingtidb --init`

# 根据提示命令tiup cluster start wangtingtidb --init启动集群
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup cluster start wangtingtidb --init
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.3/tiup-cluster start wangtingtidb --init
Starting cluster wangtingtidb...
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/wangtingtidb/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/wangtingtidb/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [Parallel] - UserSSH: user=tidb, host=172.28.54.199
+ [ Serial ] - StartCluster
Starting component pd
	Starting instance 172.28.54.199:2379
	Start instance 172.28.54.199:2379 success
Starting component tikv
	Starting instance 172.28.54.199:20162
	Starting instance 172.28.54.199:20160
	Starting instance 172.28.54.199:20161
	Start instance 172.28.54.199:20160 success
	Start instance 172.28.54.199:20161 success
	Start instance 172.28.54.199:20162 success
Starting component tidb
	Starting instance 172.28.54.199:4000
	Start instance 172.28.54.199:4000 success
Starting component tiflash
	Starting instance 172.28.54.199:9000
	Start instance 172.28.54.199:9000 success
Starting component prometheus
	Starting instance 172.28.54.199:9090
	Start instance 172.28.54.199:9090 success
Starting component grafana
	Starting instance 172.28.54.199:3000
	Start instance 172.28.54.199:3000 success
Starting component node_exporter
	Starting instance 172.28.54.199
	Start 172.28.54.199 success
Starting component blackbox_exporter
	Starting instance 172.28.54.199
	Start 172.28.54.199 success
+ [ Serial ] - UpdateTopology: cluster=wangtingtidb
Started cluster `wangtingtidb` successfully
The root password of TiDB database has been changed.
The new password is: 'rpi381$9*!cvX07D-w'.
Copy and record it to somewhere safe, it is only displayed once, and will not be stored.
The generated password can NOT be get and shown again.
# 注意TiDB的初始密码会控制台输出,拷贝留存一份
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# 
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -h 172.28.54.199 -P 4000 -u root -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MySQL connection id is 15
Server version: 5.7.25-TiDB-v5.2.4 TiDB Server (Apache License 2.0) Community Edition, MySQL 5.7 compatible

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MySQL [(none)]> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MySQL [mysql]> Set password for 'root'@'%'=password('123456');
Query OK, 0 rows affected (0.02 sec)

MySQL [mysql]> flush privileges;
Query OK, 0 rows affected (0.01 sec)

MySQL [mysql]> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
Query OK, 0 rows affected (0.02 sec)

MySQL [mysql]> exit;
Bye

Verify cluster

# 执行以下命令确认当前已经部署的集群列表:
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup cluster list
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.3/tiup-cluster list
Name          User  Version  Path                                               PrivateKey
----          ----  -------  ----                                               ----------
wangtingtidb  tidb  v5.2.4   /root/.tiup/storage/cluster/clusters/wangtingtidb  /root/.tiup/storage/cluster/clusters/wangtingtidb/ssh/id_rsa

# 执行以下命令查看集群的拓扑结构和状态:
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# tiup cluster display wangtingtidb
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.3/tiup-cluster display wangtingtidb
Cluster type:       tidb
Cluster name:       wangtingtidb
Cluster version:    v5.2.4
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://172.28.54.199:2379/dashboard
Grafana URL:        http://172.28.54.199:3000
ID                   Role        Host           Ports                            OS/Arch       Status   Data Dir                    Deploy Dir
--                   ----        ----           -----                            -------       ------   --------                    ----------
172.28.54.199:3000   grafana     172.28.54.199  3000                             linux/x86_64  Up       -                           /tidb-deploy/grafana-3000
172.28.54.199:2379   pd          172.28.54.199  2379/2380                        linux/x86_64  Up|L|UI  /tidb-data/pd-2379          /tidb-deploy/pd-2379
172.28.54.199:9090   prometheus  172.28.54.199  9090                             linux/x86_64  Up       /tidb-data/prometheus-9090  /tidb-deploy/prometheus-9090
172.28.54.199:4000   tidb        172.28.54.199  4000/10080                       linux/x86_64  Up       -                           /tidb-deploy/tidb-4000
172.28.54.199:9000   tiflash     172.28.54.199  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000     /tidb-deploy/tiflash-9000
172.28.54.199:20160  tikv        172.28.54.199  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160       /tidb-deploy/tikv-20160
172.28.54.199:20161  tikv        172.28.54.199  20161/20181                      linux/x86_64  Up       /tidb-data/tikv-20161       /tidb-deploy/tikv-20161
172.28.54.199:20162  tikv        172.28.54.199  20162/20182                      linux/x86_64  Up       /tidb-data/tikv-20162       /tidb-deploy/tikv-20162
Total nodes: 8

Access TiDB's Grafana monitoring:
Access the cluster Grafana monitoring page through http://39.101.65.150:3000. The default username and password are both admin.
You need to change the initial password for the first visit to 123456

Visit TiDB’s Dashboard:

Access the cluster TiDB Dashboard monitoring page through http://39.101.65.150:2379/dashboard. The default user name is root and the password is 123456 newly set in MySQL.

# 命令行在mariadb和TiDB分别建库
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -u root -P 3306 -h 39.101.65.150 -p"123456" -e "create database mariadb_666;"
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -u root -P 4000 -h 39.101.65.150 -p"123456" -e "create database tidb_666;"

You can see that it has been seen in both databases

Introduction to using TiDB

SQL operations

[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -u root -P 4000 -h 39.101.65.150 -p"123456"
Welcome to the MariaDB monitor.  Commands end with ; or \g.
# 创建一个名为 samp_db 的数据库
MySQL [(none)]> CREATE DATABASE IF NOT EXISTS samp_db;
Query OK, 0 rows affected (0.08 sec)
# 查看数据库
MySQL [(none)]> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA     |
| PERFORMANCE_SCHEMA |
| mysql              |
| samp_db            |
| test               |
| tidb_666           |
+--------------------+
7 rows in set (0.01 sec)
# 删除数据库
MySQL [(none)]> DROP DATABASE samp_db;
Query OK, 0 rows affected (0.19 sec)

MySQL [(none)]> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA     |
| PERFORMANCE_SCHEMA |
| mysql              |
| test               |
| tidb_666           |
+--------------------+
6 rows in set (0.01 sec)

MySQL [(none)]> CREATE DATABASE IF NOT EXISTS samp_db;
Query OK, 0 rows affected (0.08 sec)
# 切换数据库
MySQL [(none)]> USE samp_db;
Database changed
# 创建表
MySQL [samp_db]> CREATE TABLE IF NOT EXISTS person (
    ->       number INT(11),
    ->       name VARCHAR(255),
    ->       birthday DATE
    -> );
Query OK, 0 rows affected (0.08 sec)
# 查看建表语句
MySQL [samp_db]> SHOW CREATE table person;
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table  | Create Table                                                                                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| person | CREATE TABLE `person` (
  `number` int(11) DEFAULT NULL,
  `name` varchar(255) DEFAULT NULL,
  `birthday` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
# 查看表的列
MySQL [samp_db]> SHOW FULL COLUMNS FROM person;
+----------+--------------+-------------+------+------+---------+-------+---------------------------------+---------+
| Field    | Type         | Collation   | Null | Key  | Default | Extra | Privileges                      | Comment |
+----------+--------------+-------------+------+------+---------+-------+---------------------------------+---------+
| number   | int(11)      | NULL        | YES  |      | NULL    |       | select,insert,update,references |         |
| name     | varchar(255) | utf8mb4_bin | YES  |      | NULL    |       | select,insert,update,references |         |
| birthday | date         | NULL        | YES  |      | NULL    |       | select,insert,update,references |         |
+----------+--------------+-------------+------+------+---------+-------+---------------------------------+---------+
3 rows in set (0.00 sec)

MySQL [samp_db]> DROP TABLE IF EXISTS person;
Query OK, 0 rows affected (0.20 sec)

MySQL [samp_db]> CREATE TABLE IF NOT EXISTS person (
    ->       number INT(11),
    ->       name VARCHAR(255),
    ->       birthday DATE
    -> );
Query OK, 0 rows affected (0.08 sec)
# 创建索引,对于值不唯一的列,可使用 CREATE INDEX 或 ALTER TABLE 语句
MySQL [samp_db]> CREATE INDEX person_num ON person (number);
Query OK, 0 rows affected (2.76 sec)
# 查看表内所有索引
MySQL [samp_db]> SHOW INDEX from person;
+--------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+-----------+
| Table  | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression | Clustered |
+--------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+-----------+
| person |          1 | person_num |            1 | number      | A         |           0 |     NULL | NULL   | YES  | BTREE      |         |               | YES     | NULL       | NO        |
+--------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+-----------+
1 row in set (0.01 sec)
# 删除索引
MySQL [samp_db]> DROP INDEX person_num ON person;
Query OK, 0 rows affected (0.26 sec)
# 创建唯一索引
MySQL [samp_db]> CREATE UNIQUE INDEX person_num ON person (number);
Query OK, 0 rows affected (2.76 sec)
# 插入数据
MySQL [samp_db]> INSERT INTO person VALUES("1","tom","20170912");
Query OK, 1 row affected (0.01 sec)
# 查数据
MySQL [samp_db]> SELECT * FROM person;
+--------+------+------------+
| number | name | birthday   |
+--------+------+------------+
|      1 | tom  | 2017-09-12 |
+--------+------+------------+
1 row in set (0.01 sec)
# 修改表内数据
MySQL [samp_db]> UPDATE person SET birthday='20200202' WHERE name='tom';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MySQL [samp_db]> SELECT * FROM person;
+--------+------+------------+
| number | name | birthday   |
+--------+------+------------+
|      1 | tom  | 2020-02-02 |
+--------+------+------------+
1 row in set (0.00 sec)
# 删除表内数据
MySQL [samp_db]> DELETE FROM person WHERE number=1;
Query OK, 1 row affected (0.01 sec)

MySQL [samp_db]> SELECT * FROM person;
Empty set (0.01 sec)
# 创建一个用户 tiuser,登录为本地localhost访问
MySQL [samp_db]> CREATE USER 'tiuser'@'localhost' IDENTIFIED BY '123456';
Query OK, 0 rows affected (0.02 sec)
# 本地验证
[root@iZ0jlfl8zktqzyt15o1o16Z ~]# mysql -u tiuser  -P 4000 -h 127.0.0.1 -p"123456"
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MySQL connection id is 99
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MySQL [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
+--------------------+
1 row in set (0.00 sec)
MySQL [(none)]> 

Unable to access remotely

Authorize ordinary users to access remotely:

MySQL [mysql]> update user set host='%' where user='tiuser';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
MySQL [mysql]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

Authorize ordinary users to access the library:

MySQL [mysql]> GRANT SELECT ON samp_db.* TO 'tiuser'@'%';
Query OK, 0 rows affected (0.01 sec)
MySQL [mysql]> SHOW GRANTS for 'tiuser'@'%';
+-------------------------------------------+
| Grants for tiuser@%                       |
+-------------------------------------------+
| GRANT USAGE ON *.* TO 'tiuser'@'%'        |
| GRANT SELECT ON samp_db.* TO 'tiuser'@'%' |
+-------------------------------------------+
2 rows in set (0.00 sec)

delete users

# 删除用户 tiuser
MySQL [samp_db]> DROP USER 'tiuser'@'localhost';
Query OK, 0 rows affected (0.03 sec)
# 查看所有权限
MySQL [samp_db]> SHOW GRANTS;
+-------------------------------------------------------------+
| Grants for User                                             |
+-------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' WITH GRANT OPTION |
+-------------------------------------------------------------+
1 row in set (0.00 sec)
MySQL [samp_db]> 

Read historical data

Function Description

TiDB implements the function of reading historical data through the standard SQL interface, without the need for a special client or driver. After the data is updated or deleted, the data before the update/deletion can still be read out through the SQL interface.

In addition, even if the table structure changes after the data is updated, TiDB can still read the data using the old table structure.

Operating procedures

​To support reading historical version data, a new system variable is introduced: tidb_snapshot. This variable is valid in the Session scope and its value can be modified through the standard Set statement. Its value is text and can store TSO and datetime. TSO is the global timing timestamp, which is obtained from the PD side; the date and time format can be: "2020-10-08 16:45:26.999". Generally speaking, it can only be written to seconds, such as "2020-10 -08 16:45:26". When this variable is set, TiDB will use this timestamp to create a Snapshot (no overhead, just create a data structure), and all subsequent Select operations will read data on this Snapshot.
Note:
TiDB transactions are globally timed through PD, so the stored data version also uses the timestamp granted by PD as the version number. When generating a Snapshot, the value of the tidb_snapshot variable is used as the version number. If the local time of the machine where the TiDB Server is located and the machine where the PD Server is located are significantly different, the time of the PD needs to prevail.
​ When the operation of reading the historical version is completed, you can end the current Session or set the value of the tidb_snapshot variable to "" through the Set statement to read the latest version of the data.

Historical data retention policy

​ TiDB uses MVCC to manage versions. When updating/deleting data, no real data deletion will be done, only a new version of data will be added, so historical data can be retained. Not all historical data will be retained. Historical data that exceeds a certain period of time will be completely deleted to reduce space usage and avoid performance overhead caused by too many historical versions.
TiDB uses GC (Garbage Collection) that runs periodically for cleaning. For details about GC, see TiDB Garbage Collection (GC).
​ What needs to be focused on here is tikv_gc_life_time and tikv_gc_safe_point. tikv_gc_life_time is used to configure the retention time of historical versions and can be modified manually; tikv_gc_safe_point records the current safePoint, and users can safely use a timestamp greater than safePoint to create snapshots and read historical versions. safePoint is automatically updated every time GC starts running

Example of reading historical data operation

# 创建一个表,并插入几行测试数据
MySQL [mysql]> create table t (c int);
Query OK, 0 rows affected (0.08 sec)

MySQL [mysql]> insert into t values (1), (2), (3);
Query OK, 3 rows affected (0.04 sec)
Records: 3  Duplicates: 0  Warnings: 0
# 查看表中的数据
MySQL [mysql]> select * from t;
+------+
| c    |
+------+
|    1 |
|    2 |
|    3 |
+------+
3 rows in set (0.00 sec)

# 查看当前时间,一般这个时间可以理解成数据的版本和时间挂钩,在某个时间点前和后有变动,查询时可以以时间为参照点
MySQL [mysql]> select now();
+---------------------+
| now()               |
+---------------------+
| 2022-08-22 10:17:26 |
+---------------------+
1 row in set (0.01 sec)
# 更新某一行数据
MySQL [mysql]> update t set c=222222 where c=2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MySQL [mysql]> select * from t;
+--------+
| c      |
+--------+
|      1 |
| 222222 |
|      3 |
+--------+
3 rows in set (0.00 sec)
# 设置一个特殊的环境变量,这个是一个 session scope 的变量,其意义为读取这个时间之前的最新的一个版本,也就时查询数据时不再是实时数据,而是截至到2022-08-22 10:17:26时间的数据,这个时间之后的数据修改不体现,从而实现查询历史数据
MySQL [mysql]> set @@tidb_snapshot="2022-08-22 10:17:26";
Query OK, 0 rows affected (0.00 sec)
#注意:
#这里的时间设置的是 update 语句之前的那个时间。
#在 tidb_snapshot 前须使用 @@ 而非 @,因为 @@ 表示系统变量,@ 表示用户变量
MySQL [mysql]> select * from t;
+------+
| c    |
+------+
|    1 |
|    2 |
|    3 |
+------+
3 rows in set (0.00 sec)
# 清空这个变量后,即可读取最新版本数据
MySQL [mysql]> set @@tidb_snapshot="";
Query OK, 0 rows affected (0.00 sec)

MySQL [mysql]> select * from t;
+--------+
| c      |
+--------+
|      1 |
| 222222 |
|      3 |
+--------+
3 rows in set (0.00 sec)

TiDB technical principles

Database, operating system and compiler are collectively called the three major systems, which can be said to be the cornerstone of the entire computer software. The database is closer to the application layer and supports many businesses. After decades of development, this field continues to make new progress.

Many people have used databases, but few have implemented a database, especially a distributed database. Understanding the implementation principles and details of the database can, on the one hand, improve personal skills and help build other systems, and on the other hand, it can also help make good use of the database.

The best way to study a technology is to study one of the open source projects, and databases are no exception. There are many good open source projects in the field of stand-alone databases, among which MySQL and PostgreSQL are the two most well-known. Many students have seen the codes of these two projects. But in terms of distributed databases, there are not many good open source projects. TiDB has currently received widespread attention, especially from some technology enthusiasts who hope to participate in this project. Due to the complexity of the distributed database itself, many people do not understand the entire project well, so I hope to write some articles, from top to bottom, from shallow to deep, about some of the technical principles of TiDB, including user-visible technologies and A large number of technical points hidden behind the SQL interface that are invisible to users

data storage

The most fundamental function of a database is to save data, so we start here.

There are many ways to save data. The simplest method is to build a data structure directly in the memory to save the data sent by the user. For example, using an array, each time a piece of data is received, a record is appended to the array. This solution is very simple, can meet the most basic requirements, and the performance will definitely be very good, but other than that, it is full of loopholes. The biggest problem is that the data is completely in the memory. Once the service is shut down or the service is restarted, the data will be permanently lost. .

In order to solve the problem of data loss, we can put the data in non-volatile storage media (such as hard disk). The improved solution is to create a file on the disk, and when a piece of data is received, append a line in the file. OK, we now have a solution for persistently storing data. But it’s not good enough. What if the disk has bad sectors? We can do RAID (Redundant Array of Independent Disks) to provide stand-alone redundant storage. What if the entire machine hangs? For example, in the event of a fire, RAID cannot protect the data. We can also switch storage to network storage, or perform storage replication through hardware or software. At this point it seems that we have solved the data security problem and can breathe a sigh of relief. But, can the consistency between copies be guaranteed during the copying process? That is to say, on the premise of ensuring that the data is not lost, it is also necessary to ensure that the data is good. Ensuring that data is not lost is only the most basic requirement, and there are more headaches waiting to be solved:

• Can it support disaster recovery across data centers?

• Is the writing speed fast enough?

• Once the data is saved, is it easy to read?

• How to modify the saved data? How to support concurrent modifications?

• How to modify multiple records atomically?

Each of these problems is very difficult, but to make an excellent data storage system, each of the above problems must be solved. In order to solve the data storage problem, we developed the TiKV project. Next, I will introduce to you some design ideas and basic concepts of TiKV.

Key-Value

As a system for saving data, the first thing to decide is the data storage model, that is, in what form the data will be saved. TiKV's choice is the Key-Value model and provides an ordered traversal method. To put it simply, TiKV can be regarded as a huge Map, in which Key and Value are original Byte arrays. In this Map, Keys are arranged in the order of comparison of the total original binary bits of the Byte array.

  1. This is a huge Map, which stores Key-Value pairs.

  2. The Key-Value pairs in this Map are ordered according to the binary order of the Key, that is, we can Seek to the location of a certain Key, and then continuously call the Next method to obtain Key-Values ​​larger than this Key in increasing order.

Now let us forget any concepts in SQL and focus on how to implement a high-performance and high-reliability huge (distributed) Map like TiKV

RocksDB

​ For any persistent storage engine, data must be saved on disk, and TiKV is no exception. However, TiKV does not choose to write data directly to the disk. Instead, it saves the data in RocksDB, and RocksDB is responsible for the specific data implementation. The reason for this choice is that it is a lot of work to develop a stand-alone storage engine, especially to build a high-performance stand-alone engine, which requires various detailed optimizations. RocksDB is a very excellent open source stand-alone storage engine that can satisfy our needs. There are various requirements for the stand-alone engine, and the Facebook team is doing continuous optimization, so that we can enjoy a very powerful and constantly improving stand-alone engine with only a small investment of energy. Of course, we have also contributed some code to RocksDB and hope that this project will get better and better. Here you can simply think of RocksDB as a stand-alone Key-Value Map.

The underlying LSM tree stores the incremental modifications to the data in the memory. After reaching the specified size limit, the data is flushed to the disk in batches. The trees in the disk can be merged periodically to merge into one large tree to optimize performance.

Raft

​ How to ensure that data will not be lost or errors will not occur when a single machine fails? To put it simply, we need to find a way to copy the data to multiple machines, so that if one machine fails, we still have copies on other machines; more complexly, we also need this replication solution to be reliable, efficient and capable. Handle replica failure situations. It sounds difficult, but fortunately we have the Raft protocol. Raft is a consensus algorithm that is equivalent to Paxos but easier to understand. If you are interested, you can read Raft’s paper. This article will only give a brief introduction to Raft. For details, please refer to the paper. Another point to mention is that the Raft paper is only a basic solution. If it is implemented strictly according to the paper, the performance will be very poor. We have made a lot of optimizations to the implementation of the Raft protocol. For specific optimization details, please refer to tangliu's "TiKV Source Code Analysis Series-Raft" Optimization" this article.
Raft is a consistency protocol that provides several important functions:
1. Leader election
2. Member change
3. Log replication
TiKV uses Raft for data replication. Each data change will be recorded as a Raft log. Through Raft’s log Replication function to synchronize data to most nodes of the Group safely and reliably

​ Through stand-alone RocksDB, we can quickly store data on disk; through Raft, we can copy data to multiple machines to prevent single machine failure. Data is written through the interface of the Raft layer instead of writing directly to RocksDB. By implementing Raft, we have a distributed KV, and now we no longer have to worry about a certain machine crashing.

Region

​ Speaking of which, we can mention a very important concept: Region. This concept is the basis for understanding the subsequent series of mechanisms, please read this section carefully.
​As mentioned earlier, we regard TiKV as a huge ordered KV Map, so in order to achieve horizontal expansion of storage, we need to distribute the data across multiple machines. The data mentioned here is scattered across multiple machines and Raft's data replication is not a concept. In this section, we forget about Raft and assume that all data has only one copy, which is easier to understand.
​ For a KV system, there are two typical solutions for dispersing data on multiple machines: one is to Hash according to the Key, and select the corresponding storage node according to the Hash value; the other is to divide the Range, and a certain period of continuous Keys are all stored on a storage node. TiKV chose the second method, dividing the entire Key-Value space into many segments. Each segment is a series of consecutive Keys. We call each segment a Region, and we will try to keep the data stored in each Region not exceeding a certain amount. The size (this size is configurable, currently the default is 64mb). Each Region can be described by a left-closed and right-open interval from StartKey to EndKey.

Note that the Region here still has nothing to do with the table in SQL! Please continue to forget SQL and only talk about KV. After dividing the data into Regions, we will do two important things:

  • Using Region as the unit, distribute the data across all nodes in the cluster, and try to ensure that the number of Regions served on each node is about the same.
  • Do Raft replication and member management based on Region

Let’s look at the first point first. The data is divided into many Regions according to Key, and the data of each Region will only be saved on one node. Our system will have a component responsible for distributing Regions as evenly as possible on all the nodes in the cluster. This achieves horizontal expansion of storage capacity on the one hand (after adding new nodes, Regions on other nodes will be automatically Scheduling), on the other hand, load balancing is also achieved (there will not be a situation where a certain node has a lot of data and other nodes have little data). At the same time, in order to ensure that the upper-layer client can access the required data, our system will also have a component to record the distribution of Regions on the nodes. That is, through any Key, you can query which Region the Key is in, and the Which node the Region is currently on.
​ Regarding the second point, TiKV replicates data in Region units, that is, the data in a Region will save multiple copies. We call each copy a Replica. Raft is used to maintain data consistency between Replicas. Multiple Replicas of a Region will be stored on different nodes to form a Raft Group. One of the Replica will serve as the Leader of this Group, and the other Replica will serve as the Follower. All reading and writing are performed through the Leader, and then copied from the Leader to the Followers.

​ We use Region as a unit to disperse and replicate data, and we have a distributed KeyValue system with certain disaster recovery capabilities. We no longer have to worry about data being unable to be stored, or data being lost due to disk failure. This is cool, but it's not perfect yet and we need more features.

MVCC

Many databases implement multi-version control (MVCC), and TiKV is no exception. Imagine a scenario where two clients modify the Value of a Key at the same time. Without MVCC, the data needs to be locked. In a distributed scenario, this may cause performance and deadlock problems. TiKV's MVCC implementation is implemented by adding Version after Key. To put it simply, before there was MVCC, TiKV could be regarded as like this:

​ Key1 -> Value
​ Key2 -> Value
​ ……
​ KeyN -> Value

With MVCC, TiKV's Key arrangement is as follows:

​ Key1-Version3 -> Value
​ Key1-Version2 -> Value
​ Key1-Version1 -> Value
​ ……
​ Key2-Version4 -> Value
​ Key2-Version3 -> Value
​ Key2-Version2 -> Value
​ Key2-Version1 -> Value
​ ……
​ KeyN-Version2 -> Value
​ KeyN-Version1 -> Value
​ ……

Note that for multiple versions of the same Key, we put the one with a larger version number in front and the one with a smaller version number in the back (recall that the Keys we introduced in the Key-Value section are arranged in an orderly manner), so When the user obtains the Value through a Key + Version, the Key and Version can be used to construct the MVCC Key, which is Key-Version. Then you can directly Seek(Key-Version) and locate the first position that is greater than or equal to this Key-Version.

affairs

TiKV's transactions use the Percolator model and have made a lot of optimizations. TiKV's transactions use optimistic locking. During the execution of the transaction, write-write conflicts will not be detected. Conflict detection will only be done during the submission process. The conflicting parties that completed the submission earlier will write successfully, and the other party will try Re-execute the entire transaction. This model performs very well when business write conflicts are not serious, such as randomly updating data in a certain row of a table, and the table is very large. However, if the write conflict of the business is serious, the performance will be very poor. An extreme example is the counter. Multiple clients modify a small number of rows at the same time, resulting in serious conflicts and a large number of invalid retries.

Data calculation

Mapping of relational model to Key-Value model

​ Here we simply understand the relational model as Table and SQL statements, then the question becomes how to save the Table on the KV structure and how to run the SQL statement on the KV structure. Suppose we have such a table definition:
CREATE TABLE User { ​ ID int,​ Name varchar(20),​ Role varchar(20),​ Age int,​ PRIMARY KEY (ID),​ Key idxAge (age) }; There is a huge difference between SQL and KV structures, so how to map them conveniently and efficiently has become a very important issue. A good mapping scheme must facilitate the needs of data operations. So let’s first take a look at what are the requirements for data operations and what are their characteristics. For a Table, the data that needs to be stored includes three parts: 1. Meta information of the table 2. Row in the Table 3. Index data We will not discuss the meta information of the table for now, but will introduce it later. For Row, you can choose row storage or column storage, both of which have their own advantages and disadvantages. The primary target of TiDB is OLTP business. This type of business needs to support quickly reading, saving, modifying, and deleting a row of data, so it is more appropriate to use row storage. For Index, TiDB needs to support not only Primary Index, but also Secondary Index. Index functions as an auxiliary query, improves query performance, and ensures certain Constraints.















There are two modes when querying, one is point query, such as querying through the equivalent conditions of Primary Key or Unique Key, such as select name from user where id=1; This requires quickly locating a certain row of data through the index ; The other is Range query, such as select name from user where age > 30 and age < 35;. At this time, you need to query the data whose age is between 30 and 35 through the idxAge index. Index is also divided into Unique Index and non-Unique Index, both of which need to be supported.
After analyzing the characteristics of the data that needs to be stored, let's look at the operational requirements for these data, mainly considering the four statements of Insert/Update/Delete/Select.
For the Insert statement, Row needs to be written to KV and index data needs to be established.
For the Update statement, you need to update the Row and at the same time update the index data (if necessary).
For the Delete statement, the index needs to be deleted while deleting the Row.
The above three statements are very simple to process. For Select statements, the situation is a bit more complicated. First of all, we need to be able to read a row of data easily and quickly, so each Row needs to have an ID (explicit or implicit ID). Secondly, multiple consecutive rows of data may be read, such as Select * from user;. Finally, there is the need to read data through indexes. The use of indexes may be point queries or range queries.
The general requirements have been analyzed, now let us see what is available: a globally ordered distributed Key-Value engine. The overall order is important and can help us solve many problems. For example, to quickly obtain a row of data, assuming we can construct one or several Keys and locate this row, we can use the Seek method provided by TiKV to quickly locate the location of this row of data. Another example is the need to scan the entire table. If it can be mapped to a Key's Range and scanned from StartKey to EndKey, then the entire table data can be obtained simply in this way. The same idea applies to operating Index data. Next let's see how TiDB does it.
TiDB assigns a TableID to each table, an IndexID to each index, and a RowID to each row (if the table has an integer Primary Key, the value of the Primary Key will be used as the RowID), where the TableID is unique in the entire cluster. , IndexID/RowID are unique within the table, and these IDs are all of int64 type.
Each row of data is encoded into a Key-Value pair according to the following rules:
Key: tablePrefix{tableID}_recordPrefixSep{rowID}
Value: [col1, col2, col3, col4]
where tablePrefix/recordPrefixSep of Key are specific string constants, use Used to distinguish other data in KV space.
For Index data, it will be encoded into a Key-Value pair according to the following rules:
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue
Value: rowID
Index data also needs to consider the two situations of Unique Index and non-Unique Index. For Unique Index, the above encoding rules can be followed. But for non-Unique Index, a unique Key cannot be constructed through this encoding, because the tablePrefix{tableID}_indexPrefixSep{indexID} of the same Index are the same, and the ColumnsValue of multiple rows of data may be the same, so for non-Unique Index The encoding has been slightly adjusted:
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue_rowID
Value: null
This way, a unique Key can be constructed for each row of data in the index.
Note that the various xxPrefixes in the Key in the above encoding rules are all string constants, and their function is to distinguish namespaces to avoid conflicts between different types of data. They are defined as follows: var(​ tablePrefix = []byte{'
t
' }
​ recordPrefixSep = []byte(“_r”)
​ indexPrefixSep = []byte(“_i”)
)
In addition, please note that in the above scheme, regardless of the Key encoding scheme of Row or Index, all Rows in a Table are With the same prefix, the data of an Index also has the same prefix. In this way, data with the same prefix are arranged together in TiKV's Key space.
At the same time, as long as we carefully design the encoding scheme of the suffix part to ensure that the comparison relationship between before and after encoding remains unchanged, then the Row or Index data can be stored in TiKV in an orderly manner. This scheme that ensures that the comparison relationship before encoding and after encoding remains unchanged is called Memcomparable. For any type of value, the original type comparison result of the two objects before encoding is compared with the result after encoding into a byte array (note that the Key in TiKV and Value are both original byte arrays) the comparison results are consistent. After using this encoding, all Row data of a table will be arranged in the Key space of TiKV in the order of RowID, and the data of a certain Index will also be arranged in the Key space in the order of ColumnValue of the Index.
Now let's take a look at the requirements mentioned at the beginning and TiDB's mapping solution to see whether this solution can meet the needs.
First, we use this mapping scheme to convert both Row and Index data into Key-Value data, and each row and each index data has a unique Key.
Secondly, this mapping scheme is friendly to point queries and range queries. We can easily construct the key corresponding to a certain row or index, or the corresponding key of a certain adjacent row or adjacent index value. Key range.
Finally, when ensuring some Constraints in the table, you can determine whether the corresponding Constraints can be satisfied by constructing and checking whether a certain Key exists.
So far we have finished talking about how to map Table to KV. Here is a simple example for everyone to understand, let's take the above table structure as an example. Suppose there are 3 rows of data in the table:
1, “TiDB”, “SQL Layer”, 10
2, “TiKV”, “KV Engine”, 20
3, “PD”, “Manager”, 30
Then first, each row of data will be mapped to a Key-Value pair. Note that this table has a Primary Key of type Int, so the value of RowID is the value of this Primary Key. Assume that the Table ID of this table is 10, and its Row data is:
t10_r1 --> ["TiDB", "SQL Layer", 10]
t10_r2 --> ["TiKV", "KV Engine", 20]
t10_r3 -- > ["PD", "Manager", 30]
In addition to the Primary Key, this table also has an Index. Assume that the ID of this Index is 1, then its data is:
t10_i1_10_1 --> null
t10_i1_20_2 --> null
t10_i1_30_3 - -> null

meta information management

The previous section introduced how the data and indexes in the table are mapped to KV. This section introduces the storage of meta-information. Database/Table all have meta-information, that is, their definition and various attributes. This information also needs to be persisted, and we also store this information in TiKV. Each Database/Table is assigned a unique ID, which serves as a unique identifier, and when encoding as Key-Value, this ID will be encoded into the Key, plus the m_ prefix. In this way, a Key can be constructed, and the serialized metainformation is stored in the Value.

In addition, there is a special Key-Value that stores the version of the current Schema information. TiDB uses Google F1's Online Schema change algorithm. There is a background thread that constantly checks whether the Schema version stored on TiKV has changed, and ensures that the version change can be obtained within a certain period of time (if it does change)

SQL on KV architecture

The main function of TiKV Cluster is to store data as a KV engine. The details have been introduced before and will not be described here. Here we mainly introduce the SQL layer, which is the TiDB Servers layer. The nodes in this layer are stateless nodes and do not store data themselves. The nodes are completely equal. The most important work of this layer of TiDB Server is to process user requests and execute SQL operation logic. Next, we will make some brief introductions.

SQL operations

​ After understanding the mapping scheme from SQL to KV, we can understand how relational data is stored. Next, we need to understand how to use this data to meet the user's query needs, that is, how a query statement operates on the underlying stored data.
The simplest solution that can be thought of is to use the mapping solution described in the previous section to map the SQL query into a KV query, then obtain the corresponding data through the KV interface, and finally perform various calculations.
For example, Select count(*) from user where name="TiDB"; For such a statement, we need to read all the data in the table, and then check whether the Name field is TiDB. If so, return this row. Such an operation process is converted into a KV operation process:

  • Construct the Key Range: All RowIDs in a table are in the range [0, MaxInt64), then we use 0 and MaxInt64 according to the Key encoding rules of the Row to construct a left-closed and right-open [StartKey, EndKey) interval
  • Scan Key Range: Read the data in TiKV based on the Key Range constructed above
  • Filter data: For each row of data read, calculate the expression name="TiDB". If it is true, return this row upwards, otherwise discard this row of data.
  • Calculate Count: For each row that meets the requirements, accumulate the Count value. The above solution can definitely work, but it does not work very well. The reason is obvious:
    • When scanning data, each row must be read from TiKV through KV operation, which requires at least one RPC overhead. If there is a lot of data to be scanned, this overhead will be very large.
    • Not all rows are useful. If the conditions are not met, you don’t need to read them out.
    • The values ​​of the rows that meet the requirements are meaningless. In fact, the information about how many rows of data is needed here is enough.

Distributed SQL operations

It is also obvious how to avoid the above defects,

First, we need to move the calculation as close as possible to the storage node to avoid a large number of RPC calls.

Secondly, we need to push the Filter down to the storage node for calculation, so that only valid rows need to be returned to avoid meaningless network transmission.

Finally, we can push the aggregation function and GroupBy to the storage node for pre-aggregation. Each node only needs to return a Count value, and then tidb-server will Sum the Count value.

Here is a schematic diagram of data returned layer by layer:

SQL layer architecture

The above sections briefly introduce some functions of the SQL layer. I hope everyone has a basic understanding of the processing of SQL statements. In fact, the SQL layer of TiDB is much more complicated, with many modules and levels. The following figure lists the important modules and calling relationships:

​ The user's SQL request will be sent to tidb-server directly or through Load Balancer. tidb-server will parse the MySQL Protocol Packet, obtain the request content, and then perform syntax analysis, query plan formulation and optimization, and execute the query plan to obtain and process data.

All data is stored in the TiKV cluster, so in this process tidb-server needs to interact with tikv-server to obtain data.

Finally tidb-server needs to return the query results to the user

Task scheduling

Why Scheduling

Let’s first recall the inside story of TiDB technology - storage of some of the information mentioned. The TiKV cluster is the distributed KV storage engine of the TiDB database. Data is replicated and managed in units of Region. Each Region will have multiple Replica (replicas). These Replica It will be distributed on different TiKV nodes, where the Leader is responsible for reading/writing, and the Follower is responsible for synchronizing the raft log sent by the Leader. Now that you have this information, consider these questions:

  • How to ensure that multiple Replicas of the same Region are distributed on different nodes? Furthermore, what are the problems if multiple TiKV instances are started on one machine?
  • When the TiKV cluster is deployed across computer rooms for disaster recovery, how to ensure that if one computer room goes offline, multiple replicas of the Raft Group will not be lost?
  • After adding a node to the TiKV cluster, how to move the data from other nodes in the cluster?
  • What problems occur when a node goes offline? What does the entire cluster need to do? What to do if the node is only temporarily offline (restarting the service)? What should be done if a node is offline for a long time (disk failure, all data is lost)?
  • Assume that the cluster requires N replicas for each Raft Group. For a single Raft Group, the number of replicas may not be enough (for example, a node goes offline and loses replicas), or it may be too many (for example, a node that goes offline returns to normal). , automatically join the cluster). So how to adjust the number of Replica?
  • Reading/writing is performed through the Leader. If the Leader is only concentrated on a small number of nodes, what impact will it have on the cluster?
  • Not all Regions are frequently accessed. The access hotspots may only be in a few Regions. What do we need to do at this time?
  • When a cluster performs load balancing, it often needs to migrate data. Will this data migration occupy a lot of network bandwidth, disk IO and CPU? And then affect online services?

These problems may have simple solutions when taken individually, but when mixed together, they are not easy to solve. Some problems seem to only require considering the internal situation of a single Raft Group, such as deciding whether to add replicas based on whether the number of replicas is sufficient. But actually where this copy is added requires global information to be considered. The entire system is also changing dynamically. Region splits, node additions, node failures, changes in access hotspots, etc. will continue to occur. The entire scheduling system also needs to continuously move toward the optimal state in a dynamic manner. If no one has global information, it can control the overall situation. With components that can be scheduled and configured, it is difficult to meet these needs. Therefore, we need a central node to control and adjust the overall status of the system, so we have the PD module.

Scheduling needs

There are a lot of questions listed above, let’s classify and organize them first. Generally speaking, there are two major categories of problems:
1. As a distributed high-availability storage system, there are four requirements that must be met:

  • The number of copies cannot be more or less

  • Replicas need to be distributed on different machines

  • After adding a new node, you can migrate the replicas on other nodes.

  • After a node goes offline, the data of the node needs to be migrated.
    2. As a good distributed system, areas that need to be optimized include:

  • Maintain even distribution of Leaders throughout the cluster

  • Maintain uniform storage capacity of each node

  • Maintain even distribution of access hotspots

  • Control the speed of Balance to avoid affecting online services

  • Manage node status, including manually bringing nodes online/offline, and automatically taking offline failed nodes.

After meeting the first type of requirements, the entire system will have the functions of multi-copy fault tolerance, dynamic expansion/shrinking, tolerance of node disconnection, and automatic error recovery.
After meeting the second type of requirements, the load of the overall system can be made more even and can be easily managed.
In order to meet these needs, first we need to collect enough information, such as the status of each node, information of each Raft Group, statistics of business access operations, etc.; secondly, we need to set some policies. PD formulates based on
this information and scheduling policies. Come up with a scheduling plan that meets the aforementioned needs as much as possible; finally, some basic operations are required to complete the scheduling plan.

Basic operations of scheduling

Let’s first introduce the simplest point, which is the basic operation of scheduling, that is, what functions we can use in order to meet the scheduling strategy. This is the basis of the entire scheduling. Only when you understand what kind of hammer you have in your hand can you know what posture to use to smash nails.

The above scheduling requirements may seem complicated, but the final implementation is nothing more than the following three things:

  • Add a Replica

  • Delete a Replica

  • Transfer the Leader role between different Replicas of a Raft Group

It just so happens that the Raft protocol can meet these three needs. Through the three commands of AddReplica, RemoveReplica, and TransferLeader, it can support the above three basic operations.

collect message

Scheduling relies on the collection of information about the entire cluster. Simply put, we need to know the status of each TiKV node and the status of each Region. The TiKV cluster will report two types of messages to the PD:
Each TiKV node will regularly report the overall information of the node to the PD.
There is a heartbeat packet between the TiKV node (Store) and the PD. On the one hand, the PD detects whether each Store is alive through the heartbeat packet, and Whether there is a newly added Store; on the other hand, the heartbeat packet will also carry the status information of this Store, mainly including:

  • total disk capacity
  • Available disk capacity
  • Number of Regions carried
  • Data writing speed
  • Number of Snapshots sent/received (data may be synchronized between Replica through Snapshot)
  • Is it overloaded?
  • Tag information (a tag is a series of Tags with a hierarchical relationship)

The leader of each Raft Group will regularly report information to the PD.
There is a heartbeat packet between the leader of each Raft Group and the PD, which is used to report the status of the Region, which mainly includes the following information:

  • Leader's position
  • Followers location
  • Number of offline replicas
  • Data writing/reading speed

PD continuously collects information about the entire cluster through these two types of heartbeat messages, and then uses this information as the basis for decision-making. In addition, PD can also receive additional information through the management interface to make more accurate decisions. For example, when the heartbeat packet of a certain Store is interrupted, PD cannot determine whether the node has failed temporarily or permanently. It can only wait for a period of time (default is 30 minutes). If there is no heartbeat packet, it is considered that the Store has failed. Go offline, and then decide that all Regions on this Store need to be scheduled away. But sometimes, the operation and maintenance personnel take the initiative to take a certain machine offline. At this time, the PD can be notified that the Store is unavailable through the PD management interface, and the PD can immediately determine that all Regions on the Store need to be scheduled away.

Scheduling strategy

After PD collects this information, it also needs some strategies to develop specific scheduling plans.

  1. The number of Replica in a Region is correct.
    When PD discovers that the number of Replica in this Region does not meet the requirements through the heartbeat packet of a Region Leader, it needs to adjust the number of Replica through the Add/Remove Replica operation. Possible reasons for this happening are:
    • When a node goes offline, all the data on it is lost, resulting in insufficient replicas in some regions.

    • A certain offline node resumes service and automatically connects to the cluster. In this way, the number of Replicas in the Region that has already filled the Replica exceeds the number, and a certain Replica needs to be deleted.

  • The administrator adjusted the replica strategy and modified the configuration of max-replicas

2. Multiple Replicas in a Raft Group are not in the same location.
Please note the second point, "Multiple Replicas in a Raft Group are not in the same location". The term "same location" is used here instead of "same node". . Under normal circumstances, PD will only ensure that multiple Replicas do not fall on one node to avoid the loss of multiple Replicas due to the failure of a single node. In actual deployment, the following requirements may also arise:

  • Multiple nodes deployed on the same physical machine

  • TiKV nodes are distributed on multiple racks, and it is hoped that system availability can be guaranteed even when a single rack is powered off.

  • TiKV nodes are distributed in multiple IDCs. It is hoped that when a single computer room is powered off, the system can be ensured. These
    requirements are essentially that a certain node has common location attributes, forming a minimum fault-tolerant unit. We hope that there will be no faults inside this unit. There are multiple Replicas of a Region. At this time, you can configure labels for the node and specify which labels are location identifiers by configuring location-labels on the PD. When allocating Replica, try to ensure that there will not be multiple Replica nodes of a Region with the same location identifier. .

3. The replicas are evenly distributed among the stores
. As mentioned before, the upper limit of the data capacity stored in each replica is fixed, so we maintain a balanced number of replicas on each node, which will make the overall load more balanced.
4. The number of Leaders is evenly distributed among the Stores.
The Raft protocol reads and writes through the Leader, so the computing load is mainly on the Leader, and PD will spread the Leader among the nodes as much as possible.
5. The number of access hotspots is evenly distributed among the Stores.
Each Store and Region Leader carry information about the current access load when reporting information, such as the read/write speed of the Key. PD detects access hotspots and spreads them among nodes.
6. The storage space occupied by each Store is roughly equal.
When each Store is started, a Capacity parameter is specified, indicating the upper limit of the storage space of this Store. PD will consider the remaining storage space of the node when scheduling.
7. Control the scheduling speed to avoid affecting online services.
Scheduling operations consume CPU, memory, disk IO and network bandwidth. We need to avoid having too much impact on online services. PD will control the number of currently ongoing operations. The default speed control is relatively conservative. If you want to speed up scheduling (for example, if you have stopped service upgrades, added new nodes, and want to schedule as soon as possible), you can manually speed up scheduling through pd-ctl. speed.
8. Supports manual node offline.
When a node is manually offline through pd-ctl, PD will schedule the data on the node away under certain rate control. When the scheduling is completed, the node will be placed offline.

Implementation of scheduling

After understanding the above information, let's take a look at the entire scheduling process.

​ PD continuously collects information through the heartbeat packets of the Store or Leader to obtain detailed data of the entire cluster, and generates a scheduling operation sequence based on this information and the scheduling policy. Each time it receives a heartbeat packet from the Region Leader, the PD will check whether If there are operations to be performed on this Region, the required operations are returned to the Region Leader through the reply message of the heartbeat packet, and the execution results are monitored in the subsequent heartbeat packets. Note that the operations here are only suggestions for the Region Leader, and there is no guarantee that they will be executed. Whether and when they will be executed is determined by the Region Leader itself based on its current status.

Guess you like

Origin blog.csdn.net/wt334502157/article/details/126468893