Distributed Storage Technology (Part 2): Architecture Principles, Characteristics, Advantages and Disadvantages of Wide Table Storage and Full-text Search Engine

For write-intensive applications, the amount of writing is huge every day, the amount of data growth is unpredictable, and the requirements for performance and reliability are very high, and ordinary relational databases cannot meet their needs. The same is true for scenarios such as full-text search and data analysis that require extremely high query performance. In order to further meet the requirements of the above two types of scenarios, with wide table storage and search engine technology, this article will introduce their architecture, principles, advantages and disadvantages.

— wide table storage 

Wide table storage first came from Google's Bigtable paper, which was initially defined as:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

《Bigtable: A distributed storage system for structured data》

Bigtable will store data in several Tables (tables). The form of each Cell (data unit) in the Table is as follows: the data in the Cell is composed of byte strings (strings), using three rows, columns, and timestamps. dimensions for positioning.

The picture comes from "Bigtable: A distributed storage system for structured data"

When Bigtable stores data, it sorts the Table according to the Row Key of the Cell, and divides a Table into several adjacent Tablets according to the Row, and distributes the Tablets to different Tablet Servers for storage. In this way, when the client queries for a relatively close Row Key, the probability of the Cell falling on the same tablet will be greater, and the query efficiency will be higher.

Different Cells in the Table can save multiple versions of the same data, which are distinguished by timestamps. The timestamp is essentially a 64-bit integer, which can be automatically set by Bigtable as the current time (in microseconds) when the data is written, or can be set by the application itself, but the application needs to ensure that there will be no conflicts between cells. For Cells with the same Row Key and Column Key, Bigtable will sort them in descending order of timestamp, so that the latest data will be read first.

Because Google Bigtable solves the core business scenario requirements such as concurrent retrieval and query and high-speed log writing in massive data scenarios, the industry has been inspired by it to develop some projects, such as HBase, Cassandra, etc., and they are collectively called Wide Column Store (wide table storage, also known as table storage). Wide table storage can have no schema restrictions, and the fields of the table can be expanded freely. A data table can have infinitely many columns, and each row can have different columns, and each row can have many null values, similar to a sparse matrix. A column family stores related data that is often queried together.

The current ranking of wide-table NoSQL databases in DB-Engine is as follows. It can be seen that the most popular are Cassandra, HBase, and Cosmos DB on Azure. Next we will introduce the situation of HBase.

HBase is a column-oriented distributed NoSQL database and an open source implementation of the Google Bigtable framework, which can respond to random and real-time data retrieval needs. The main storage and processing objects of HBase are large and wide tables. The storage mode is compatible with Hadoop-supported file systems such as local storage, HDFS, and Amazon S3. Compared with RDBMS, it has a strong linear expansion capability. HBase ensures a stable data writing rate by adopting an LSM tree-based storage system, and uses its log management mechanism and HDFS multi-copy mechanism to ensure database fault tolerance. The usual applicable scenarios are: OLTP services for multi-version, sparse, semi-structured and structured data with high concurrent write/query.

HBase's data model consists of different logical concepts, including: table, row, row key, column, column family, unit, and timestamp.

  • Table (Table) : Table is the organizational form of data in HBase. It is a collection of columns. It is similar to the meaning of tables in traditional databases. It can also cover update records of column data under different timestamps.

  • Column (Column) : It is a separate data item in the database, and each Column contains a type of data.

  • Column Family (ColumnFamily) : The data in tables in HBase are grouped by ColumnFamily, which is an aggregation of columns of similar types in one or more HBase tables. HBase stores the data of the same ColumnFamily in a file, which can play a role similar to vertical partitioning. Unnecessary scanning can be reduced during query and query speed can be increased.

  • Row : Row is a collection of RowKey and ColumnFamily, one Row can include one or more ColumnFamily.

  • RowKey (RowKey) : The row data in HBase is sorted in the form of RowKey, which acts as a primary key. When querying, HBase can use RowKey to locate data, and Region is also divided by RowKey.

  • Timestamp (Timestamp) : Timestamp is the version identifier of a given value, and is written to the HBase database at the same time as the value. Timestamp can be any type of time format, and there can be multiple time versions of record updates for each RowKey.

 

The film comes from "HBase: The Definitive Guide"

In HBase, tables are divided into multiple Regions for storage according to RowKey. Each Region is the basic unit of HBase data management. Regions are segmented by RowKey, which has a function similar to horizontal range partitioning. Data can be distributed on each node of the cluster. Regions on different nodes are combined to form an overall logical view of the table. By expanding Region can increase capacity.

The film comes from "HBase: The Definitive Guide"

Regions are maintained by HRegionServer, and HRegionServer is managed by HMaster. HMaster can automatically adjust the number of Regions in HRegionServer, thus achieving unlimited expansion of stored data.

 

The picture comes from "HBase: The Definitive Guide"

The technical architecture of HBase is shown in the figure above. The main components or services include:

  • Client : Client is the access portal of the entire HBase cluster, responsible for communicating with HMaster for cluster management operations, or communicating with HRegionServer for data read and write operations.

  • ZooKeeper : The status information of each node in the cluster will be registered in ZooKeeper, and HMaster perceives the health status of each HRegionServer through ZooKeeper. In addition, HBase allows multiple HMasters to be started, and ZooKeeper can ensure that only one HMaser is running in the cluster.

  • HMaster : Responsible for managing CRUD operations on data tables, managing HRegionServer load balancing, and responsible for the allocation of new Regions; responsible for Region migration on failed HRegionServers when HRegionServers fail and shut down.

  • HRegionServer : One node corresponds to one HRegionServer. Responsible for receiving the Region assigned by HMaster, responsible for communicating with Client and processing all Region-related read/write requests managed by it.

  • HStore : HStore is responsible for the data storage function in HBase, with 1 MemStore and 0 or more StoreFiles. The data is first written into the memory MemStore, flashed into StoreFile (the package of File), and finally persisted to HDFS. When querying a certain column, you only need to call up the corresponding block of HFDS.

  • HLog : log management and playback, every operation entering MemStore will be recorded in HLog.

HBase uses LSM as the underlying storage structure. Compared with the B+ tree commonly used by RDBMS, LSM has a better write rate. Since the rate of disk random IO is exponentially slower than that of sequential IO, the storage design of the database tries to avoid disk random IO. Although the B+ tree will try to store the same node on one page, this is only effective when the amount of data is relatively small. When a large number of random writes occur, the nodes will tend to split, and the probability of disk random read and write will increase. In order to ensure the write rate, the LSM is first written in the memory, and then sequentially dropped to the disk in batches. The sequential write design enables LSM to have better mass data writing performance than B+ tree, but when reading, it needs to write and merge memory data and disk historical data, so the reading performance has a certain sacrifice. But at the same time, LSM can also improve a certain reading speed by merging small sets into large sets (merging), and Bloom Filter.

The picture comes from "HBase: The Definitive Guide"

HBase's storage-related structures include MemStore, HFile, and WAL. MemStore is a memory storage structure that can convert random writes of data into sequential writes. When data is written, it will be written to MemStore first, and then transferred to the disk until the memory storage capacity can no longer store the data. HFile is the file data structure in which HBase data is finally written to the disk, that is, the underlying storage format of StoreFile. In HBase, a StoreFile corresponds to an HFile. Usually, HFile is stored on HDFS, so it can ensure data integrity and provide distributed storage. WAL (Write-Ahead Log) is responsible for providing high-concurrency and persistent log storage and playback services. All business operations that occur in HBase are stored in WAL to provide disaster recovery. For example, when MemStore memory data is written to HFlie for persistence, if the machine is powered off, it can be played back through WAL without data loss.

— Full text search engine  

Unlike relational databases that store data in a fixed format in the form of row records, search engines store data in the form of documents, which are physically hierarchical or tree structures. The advantage of this approach is that it is very easy to increase the processing capacity of semi-structured or structured data.

At present, the representative database of search engines is the open source Elasticsearch, and the domestic Transwarp Technology has a self-developed search product Scope. In the ranking of DB Engine, Elasticsearch ranks within the top ten all year round. Compared with SQL databases, ES provides a scalable, near-real-time distributed search function, and supports splitting text data into multiple parts and handing them over to cluster nodes. Storage and backup to improve retrieval speed and ensure data integrity, support automated load balancing and emergency handling to ensure that the entire cluster is in a high availability state. ES is suitable for various businesses that require processing of unstructured data such as documents, such as intelligent word segmentation, full-text search, and relevance ranking.

ES defines a set of proprietary elements and concepts for storing and managing data, the most important being Field, Document, Type and Index. Field is the smallest data unit in Elasticsearch, similar to columns in relational databases, and is a collection of data values ​​with the same data type. Document is similar to the concept of row in relational data. A Document contains the corresponding data value in each Field. Type is similar to the table-level concept in databases, while Index is the largest data unit in Elasticsearch. Unlike SQL indexes, index in ES corresponds to the concept of schema in SQL.

Elasticsearch

SQL Database

Index

Database

Type

Table

Document

Row

Field

Column

In terms of physical model, the main concepts of Elasticsearch include Cluster (cluster), Node (node) and Shard (shard). The node refers to an instance of Elasticsearch or a Java process. If the hardware resources are sufficient, multiple instances can be run on one node. A shard is the smallest unit for data processing in an ES cluster and an instance of a Lucene index. Each Index consists of shards on one or more nodes. Shard is divided into Primary Shard and Replica Shard. Each Document will be stored in a Primary Shard. If the Primary Shard or the node where it resides fails, the Replica Shard can be converted to the primary Shard to ensure high data availability. When retrieving data, Query commands can be executed on the backup shard to reduce the pressure on the primary shard, thereby improving query performance.

When writing data to Elasticsearch, Elasticsearch assigns documents to multiple shards according to the document identifier ID. When querying data, Elasticsearch queries all shards and summarizes the results. In order to prevent the accuracy of the results from being affected by the failure of some fragmented queries during the query, Elasticsearch introduces the routing function. When data is written, the data is written to the specified shard through routing. When querying data, use the same route to indicate which shard the data will be retrieved from.

With the underlying Lucene inverted index technology, Elasticsearch performs very well in querying and searching text and log data, surpassing almost all relational databases and other Lucene-based products (such as Solr), so it is widely used in log analysis, In the fields of intelligence analysis and knowledge retrieval, especially multiple industry solutions based on the Elastic Stack. However, in 2020, Elastic changed its license model, restricting cloud vendors from directly hosting and selling Elasticsearch. The industry tried to bypass the relevant license restrictions through some other methods (such as OpenSearch developed based on Elasticsearch 5.7).

However, ES also has several very obvious architectural deficiencies, which limit its further expansion of application scenarios, including:

  • Does not support the transaction function and can only support the eventual consistency of data, so it cannot be used for the storage and use of key data

  • Weak analysis capabilities such as complex aggregation capabilities are weak

  • Master-slave replication is used for high availability between shards, and there may be data split-brain problems

  • The processing data capacity of a single Node needs to be improved

  • The security module of ES is a commercial plug-in, and a large number of ES clusters lack suitable security protection

It is the shortcomings of the above architecture that also give other search engines in the industry some directions for improvement, especially the security of ES, which has led to several major data security leaks in China. Transwarp Technology began to develop domestic alternatives to ES in 2017, and released Scope, a distributed search engine based on Lucene in 2019, which adopted a new high-availability architecture based on the Paxos protocol, supports cross-data center deployment modes, and built-in Native security functions, support for multiple Node instances on a single server, etc. have greatly improved the stability, reliability and performance of the search engine. It has landed in multiple large-scale production clusters in China, and the number of single cluster servers exceeds 500 nodes.

—Summary—  _ _

This article introduces the architecture, principles, advantages and disadvantages of wide table storage and search engine technology (now that various technologies are developing rapidly, there may be cases where the technical description is not consistent with the latest technology development). Then with the basic data storage management, the next step is to obtain the required results through integrated computing. In the face of large amounts of data, how to achieve high throughput, low latency, high expansion, and support for fault tolerance is the key to distributed computing technology. Starting with the first article, we will introduce distributed computing technologies represented by MapReduce and Spark.

【references】

【1】Chang F, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data[J]. ACM Transactions on Computer Systems (TOCS), 2008, 26(2): 1-26.

【2】Lars George,HBase: The Definitive Guid, 2011.

Guess you like

Origin blog.csdn.net/mkt_transwarp/article/details/130005538