Storage related summary 2.2--BigTable core design principle

introduction

BigTable is a distributed storage system developed by Google to manage structured data, built on GFS, Chubby and other technologies. among them:

  • GFS : Google's distributed file system, the idea of ​​open source distributed file system HDFS is similar to it;
  • Chubby : It is a distributed lock service implemented by Google based on the paxos algorithm. In addition, it also provides consistent distributed data reading and writing functions;

1 Data model

1.1 Storage method

BigTable is sorted by key(row:string, column:string, time:int64) -> value(string), which is convenient for quick indexing. In more detail, first sort by row name in ascending order, if the row name is the same, then by column name in ascending order, and finally by timestamp as the version number in descending order. Finally, it is stored in the GFS in the file system in lsm mode.

1.2 Data split

In order to achieve clustering, BigTable divides data into slices based on the range of row names, and each slice is called a tablet. Each tablet is a collection of data with the same row key prefix. Assuming that the user's data line key starts with the user id and the user id is 9 digits, then the tablet is divided by the first 9 digits of the line key, so that the data of the same user can be cohesive into one tablet.

1.3 Comparison with SQL

First of all, BigTable is used to manage structured data. But it has some differences with traditional SQL:

  • The columns of each SQL table are fixed, and the data types stored in the columns are also fixed; but the number of columns in BigTable and the columns of each row can be different, so BigTable is a statement with sparse characteristics.
  • The row number of the SQL table is numeric; the row of BigTable is a string type, that is, the row name, and the row name of each row is dynamically specified when writing.
  • In the underlying storage structure, many traditional SQL storage engines (such as Innodb) use B+ trees, while BigTable uses the LSM method. From the storage method, BigTable is based on key(row:string, column:string, time:int64) -> KV storage in value (string) mode.

2 Architecture

2.1 Overall architecture

Insert picture description here
The entire BigTable consists of three parts, namely client (client lib library), master system, tablet service; and then implemented based on Chubby and GFS. among them:

  • The master service is only responsible for the tasks related to the cluster status such as tablet allocation to the tablet and management of the status of the tablet server;
  • The client library is a library embedded in the client to provide APIs for applications;
  • The tablet service is a service cluster node that processes data requests. Each tablet server processes at least one or more tablet fragments, and then forwards them to GFS for persistence.

2.2 Metadata management

It can be seen from the foregoing that the final data of BigTable is stored in GFS, so how does it find which GFS file each tablet fragment is in? In fact, metadata is also sliced ​​and stored hierarchically according to the tablet method. This is the same principle as ceph self-contained (or self-storage) file metadata. In Chubby, only the tablet location or file name of the top-level metadata is recorded. Then by cascading the tablet of the secondary metadata, the tablet of user data can be located finally. The details are as follows:
Insert picture description here
because each tablet is composed of several sst files, location can be a sequence of files in terms of implementation details, or it can point to a manifest file like leveldb, and then the manifest file stores multiple related sst files.

3 Reading and writing process

  • Writing process

Each tablet server maintains one or more tablet shards, and each tablet server uses a WAL log file to record the received data writing of the managed shards; then insert the data into the memory; when the data in the memory reaches The upper limit will be written to the new sst file of the corresponding tablet fragment. In addition, the tablet service will also handle the merging of the background sst. The whole process is similar to leveldb.

  • Reading process

The reading process is similar to leveldb. First, read whether there is data in the memory of the tablet service; if there is no data, it will read whether there is data in the sst file of the tablet fragment; if there is no data, it will return to the search failure.

Guess you like

Origin blog.csdn.net/fs3296/article/details/113179666