Introduction to the principle of Cassandra

Originally derived from Facebook, Cassandra combines the column-oriented features  of Google BigTable with the P2P features of [Amazon Dynamo](http://en.wikipedia.org/wiki/Dynamo(storagesystem) ) distributed hash (DHT) , It has the characteristics of high performance, scalability, fault tolerance, and simple deployment.

Although it has many advantages, it seems that there are not many companies using it in China. It is far less popular than Hbase and MongoDB. It is obvious from the Baidu index that the popularity of these three systems in China is compared. Compared with the calm domestic market, Cassandra is developing in full swing abroad. DB-Engines , a foreign website specializing in scoring databases, shows that Cassandra ranks in the top ten, several places higher than Hbase. Since 2013 With rapid growth , more than 1,500 companies are currently using Cassandra. Unfortunately, there are not many domestic companies, and only one cloud storage startup, Yunnuo, is on the list. This also proves why the Chinese resources on the Internet are relatively scarce. I had to look for English materials, but by the way, my English reading ability was strengthened.

Features that attract me

There are three main reasons that attracted me to choose Cassandra as NoSQL:

Extremely high read and write performance

When Cassandra writes data, it first writes the request to the Commit Log to ensure that the data will not be lost, and then writes it to the Memtable in the memory. After the memory capacity is exceeded, the data in the memory is flushed to the SSTable of the disk, and the SSTable is periodically asynchronously updated. Do data merging (Compaction) to reduce query time when reading data. Because write operations only involve sequential writes and memory operations, it has very high write performance. When performing read operations, Cassandra supports the implementation mechanism like LevelDB , data is stored in layers, and hot data is placed in Memtable and relatively small SSTable, so it can achieve very high read performance.

Simple deployment structure

Compared with the master-slave structure of Hbase, Cassandra is a decentralized P2P structure. All nodes are exactly the same and there is no single point. For small companies, I can choose the number of data copies to be 2. First, use two or three machines to copy Cassandra. It can not only ensure the reliability of data, but also facilitate the expansion of machines in the future, and HBase should have at least four or five machines. In the future, in order to better support customers, data centers may need to be established in multiple places, and Cassandra supports multiple data centers very well, and can easily deploy multiple data centers. I saw a case of Russia's largest telecom company this morning. . In addition, our machines are currently hosted in a small computer room. In case the machines are full and cannot be increased, we need to consider relocating the computer room. Using multiple data centers can also achieve seamless migration.

Combination with Spark

As an underlying storage system, Cassandra can easily integrate with Spark for real-time computing, which is fatally attractive to our business scenarios. I have seen many use cases abroad that use Spark+Cassandra to implement Velocity computing, such as Ooyala (bring your own ladder).

Basic Architecture

Cassandra does not choose a central control node like BigTable or Hbase, but chooses a centerless P2P architecture. All nodes in the network are peers, they form a ring, and the nodes exchange data every second through the P2P protocol. , so that each node has the information of all other nodes, including location, status, etc., as shown in the following figure.

Cassandra Ring

The client can connect to any node in the cluster, and the node that establishes the connection with the client is called a coordinator, which is equivalent to an agent and is responsible for locating the nodes that actually have the data required for this request to be sent to. Go up to get it, but how to get it and return it depends mainly on the Consistency Level required by the client. For example, ONE means that as long as one node returns data, it can respond to the client, and QUONUM means that several nodes need to be returned. According to the number of user configurations, ALL means that all nodes equal to the number of copies of data can respond to the client only after returning results. For data consistency requirements that are not particularly high, you can choose ONE, which is the fastest way.

The core components of Cassandra include:

Gossip: A peer-to-peer communication protocol used to exchange position and status information of nodes with each other. Gossip information is stored locally immediately when a node is started, but historical information needs to be cleaned when node information changes, such as IP changes. Through the Gossip protocol, each node periodically exchanges data between itself and the nodes it has exchanged information every second. Each exchanged information has a version number, so that when there is new data, the old data can be overwritten. In order to ensure the data For the accuracy of the exchange, all nodes must use the same cluster list, and such nodes are also called seeds.

Partitioner: responsible for allocating data in the cluster, which determines which nodes place the first copy. Generally, Hash is used as the primary key, and each row of data is distributed to different nodes to ensure the scalability of the cluster. .

Replica placement strategy: The replication strategy, which determines which node places the replicated data, and the number of replicated copies.

Snitch: Defines a network topology that determines how to place replicated data and route requests efficiently.

cassandra.yaml: The main configuration file, which sets the initial configuration of the cluster, cache parameters for tables, tuning parameters and resource usage, timeout settings, client connections, backup and security

write request

When a write event occurs, the write event is first captured by the Commit Log and persisted to ensure data reliability. After that, the data will also be written to the memory, which is called Memtable. When the memory is full, the data file will be written, which is called SSTable, which is the abbreviation of Log-Structured Storage Table. If the client configures the Consistency Level to be ONE, it means that as long as one node writes successfully, the proxy node (Coordinator) returns to the client that the writing is complete. Of course, other nodes may fail to write in the middle. Cassandra will ensure the eventual consistency of data through Hinted Handoff or Read Repair or Anti-entropy Node Repair.

For write requests from multiple data centers, Cassandra has optimized it. Each data center selects a Coordinator to complete data replication in its data center, so that the node connected by the client only needs to forward the replication request to one node in the data center. , the data replication in the data center is completed by the Coordinator of this data center.

The storage structure of Cassandra is similar to the LSM tree ( Log-Structured Merge Tree ) structure. Unlike traditional data, which generally uses B+ trees, the storage engine writes sequentially to the disk in an appending manner to continuously store data. Writing can be concurrently written. Unlike the B+ tree, which requires locking, the writing speed is very high. LevelDB and Hbase use similar storage structures.

The Commit Log records the complete information of each write request. At this time, it will not be sorted according to the primary key, but written sequentially. In this way, there will be no large number of disk seek operations caused by random writes during disk operations, which will greatly improve the speed. Help, the so-called fastest local database LevelDB also adopts such a strategy. The Commit Log will be cleared after the data in the Memtable is flushed into the SSTable, so it will not take up too much disk space. Cassandra can also set a separate storage area when configuring Cassandra, which is a good way to use high performance but small capacity and expensive SSD hard drives To store Commit Log, it is convenient to use the hybrid layout of traditional mechanical hard drives with high speed but large capacity and very cheap price to store data.

When writing to Memtable, Cassandra can dynamically allocate memory space for it, and you can also use tools to adjust it yourself. When the threshold is reached, the data and indexes in Memtable will be put into a queue, and then flushed to disk. You can use the memtableflushqueue_size parameter to specify the length of the queue. When a flush occurs, write requests are stopped. You can also use the nodetool flush tool to manually flush data to disk. It is best to do this before restarting the node to reduce the time for Commit Log playback. To flush data, Memtables are reordered according to the partition key, and then written to disk sequentially. This process is very fast because only commit log appends and sequential disk writes are involved.

When the data in the memtable is flushed to the SSTable, the data in the Commit Log will be cleared. Each table will contain multiple Memtables and SSTables. Generally, after the refresh is completed, SSTables no longer allow write operations. Therefore, a partition generally spans multiple SSTable files, and subsequently merges multiple files through Compaction to improve read and write performance.

The write request described here refers not only to the Insert operation, but also to the Update operation. The processing of the Update operation by Cassandra is completely different from that of the traditional relational database. It does not immediately update the original data, but adds a new record. The data is merged again during the Compaction. The same is true for the Delete operation. The data to be deleted will be marked as tombstone first, and then will be permanently deleted in the subsequent Compaction.

read request

When reading data, first check the Bloom filter, each SSTable has a Bloom filter to check whether the partition key is in this SSTable, this step is done before accessing any disk IO. If it exists, check the partition key cache again, and then do the following:

If the index can be found in the cache, go to the compression offset map to find the data block with this data, get the compressed data from disk and return the result set. If the index is not found in the cache, search the partition summary to determine the approximate location of the index on disk, then obtain the index entry, perform a single seek and a sequential column read operation on the SSTable, and also go to the compression offset map below Find the data block that has this data, get the compressed data from disk and return the result set. When reading data, the cached data in Memtable and the data in multiple SSTables will be merged, and the final result will be returned. For example, after updating the user's Email, the username and password are still in the old SSTable, and the new Email is recorded in the new SSTable. When returning the result, the old and new data need to be read and merged.

After 2.0, the Bloom filter, compression offset map, and partition summary are not placed in the Heap, and only the partition key cache is still placed in the Heap. Bloom filter grows about 1~2G per billion partitions. The partition summary is a sample of the partition index, and you can configure the sample frequency through index_interval. The compression offset map grows by 1~3G per TB. The more data is compressed, the greater the number of compressed blocks, and the greater the compression offset table.

There are two types of Read Requests. One is the Correct Read Request. The client result can be returned after reading the data according to the Consistency Level configured by the client. One is the Background Read Repair Request, which will be sent to other replication nodes in addition to the node that the direct request arrives at, to repair the previously written problematic node to ensure the eventual consistency of the data. When the client reads, the Coordinator first contacts the node defined by the Consistency Level, sends the request to the fastest responding replication node, and returns the requested data. If multiple nodes are contacted, the data rows passed by each replication node will be compared in memory. If inconsistent, the latest data (according to the timestamp) will be selected and returned to the client, and the expired replication nodes will be updated in the background. Called Read Repair.

The following is the reading process when the Consistency Level is ONE. The Client connects to any node. The node sends a request to the node that actually owns the data. After the node with the fastest response returns the data to the Coordinator, the data is returned to the Client. If there is a problem with the data of other nodes, the Coordinator will send the latest data to the problematic node to repair the data.

Read One

Data finishing (Compaction)

The update operation will not update immediately, which will lead to random reading and writing of the disk, which is inefficient. Cassandra will sequentially write the data to a new SSTable and add a timestamp to indicate the old and new data. It also does not delete immediately, but uses Tombstone to mark the data for deletion. During compaction, the data in multiple SSTable files are integrated into the new SSTable file. When the read request on the old SSTable is completed, it will be deleted immediately, and the free space can be reused. Although Compcation does not have random IO access, it is still a heavyweight operation. It generally runs in the background and is controlled by limiting its throughput. The `compactionthroughputmbpersec parameter can be set, and the default is 16M/s. In addition, if the key cache shows that the sorted data is hot data, the operating system will put it into the page cache to improve performance. Its merging strategies are as follows:

  • SizeTieredCompactionStrategy: Each update will not directly update the original data, which will cause random access to the disk and low performance, but directly write to the next sstable when inserting or updating. operate. However, because the data is distributed in multiple sstables, multiple disk seeks are required for reading, and the reading performance is not high. In order to avoid this situation, sstables of similar size will be merged regularly in the background, and the merging speed will be very fast. By default, 4 sstables will be merged once. If there is no expired data to be cleaned up during merging, it will take double space, so the worst case requires 50% free disk.
  • LeveledCompactionStrategy: Create a sstable with a fixed size of 5M by default. The top level is L0 and the bottom level is L1. The bottom level is 10 times the size of the top level. This sorting strategy reads very fast and is suitable for read-sensitive situations. At worst, it only requires 10% of free disk space. It refers to the implementation of LevelDB. For details, see the specific implementation principle of LevelDB .

    There are also detailed descriptions of these two methods here .

Data replication and distribution

Data distribution and replication are usually together. The data is organized in the form of a table, and the primary key is used to identify which nodes should be stored. The copy of the row is called a replica. When a cluster is created, at least the following configurations must be specified: Virtual Nodes, Partitioner, Replication Strategy, Snitch.

There are two data replication strategies. One is SimpleStrategy, which is suitable for a data center. The first data is placed on the node determined by the Partitioner, and the latter is placed on the node found clockwise. It does not consider cross data centers and racks. copy. The other is NetworkTopologyStargegy. The first data is the same as the previous one. The second replicated data is placed on different racks. Each data center can have replicas of different data.

There are three Partitioner strategies, the default is Murmur3Partitioner, which uses MurmurHash. RandomPartitioner, using Md5 Hash. ByteOrderedPartitioner uses the bytes of data for orderly partitioning. Cassandra uses MurmurHash by default, which has higher performance.

Snitch is used to determine which data center and which rack to write or read from. There are several strategies:

  • DynamicSnitch: Monitor the execution of each node, and automatically adjust according to the execution performance of the node. This configuration is recommended in most cases
  • SimpleSnitch: Does not consider the database and rack case, consider this case when using the SimpleStategy strategy
  • RackInterringSnitch: Consider the database and rack
  • PropertyFileSnitch: Customize with the cassandra-topology.properties file
  • GossipPropertyFileSnitch: Define a local data center and rack, and then use the Gossip protocol to propagate this information to other nodes. The corresponding configuration file is cassandra-rockdc.properties

Failure detection and recovery

Cassandra confirms whether a node is available from the Gossip information to avoid routing client requests to an unavailable node, or a node with slower execution, which can be judged by dynamic snitch. Instead of setting a fixed value to mark failed nodes, Cassandra determines whether a node fails by continuously calculating the network performance, workload, and other conditions of a single node. The reason for node failure may be hardware failure or network interruption, etc. The interruption of node is usually brief but sometimes lasts for a long time. The interruption of a node does not mean that the node is permanently unavailable, so it will not be permanently removed from the network ring. Other nodes will periodically detect whether the node is back to normal through the Gossip protocol. If you want to remove it permanently, you can use nodetool to remove it manually.

When the node recovers from the interruption, it will lack the most recently written data. This part of the data is temporarily saved by other replica nodes, called Hinted Handoff, which can be automatically recovered from here. However, if the node interruption time exceeds the value set by maxhintwindowinms (default 3 hours), this part of the data will be discarded. In this case, you need to use nodetool repair to manually perform data repair on all nodes to ensure data consistency.

Dynamic expansion

The initial version of Cassandra implemented dynamic expansion of nodes through consistent Hash . The advantage of this is that each increase or decrease of nodes will only affect adjacent nodes, but this will bring a problem that causes uneven data, such as adding new nodes. When the data is reduced, the data on this machine will be migrated to the adjacent machine, and other machines cannot share the work, which will inevitably cause performance problems. Since version 1.2, Cassandra has introduced the concept of virtual nodes (Virtual Nodes), assigning multiple virtual nodes (default is 256) to each real node, these nodes are not arranged in the order of hash value, but random, so that in When a node is added or removed, many real nodes will participate in the data migration, thus achieving load balancing.

Recommended configuration

  • Memory: 16G~64G, the bigger the better, reducing the number of times of reading and writing to the disk
  • CPU: 8Core+, high concurrent requests
  • Disk: When writing commit log and flush memtable to sstable, you need to access disk. Using SSD to put commit log will not improve write performance. Putting data will significantly improve but the cost is too high. The ideal way is to use SSD as the secondary level of Row Cache. For index, see this article: Cassandra:An SSD Boosted Key-Value Store  (requires own ladder)
  • RAID: Data disks do not need RAID, Cassandra itself needs to copy multiple copies of data to different machines, and has a JBOD mechanism that can automatically detect damaged disks
  • File system: XFS, close to infinity on 64-bit, maximum support 16T on Ext4 64-bit
  • Network card: Gigabit network card

For more information, please refer to: Apache Cassandra 2.0 Document

 

 

Original link: http://yikebocai.com/2014/06/cassandra-principle/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324775543&siteId=291194637
Recommended