Sequence and concurrency control of object read and write requests in Ceph

In distributed systems, it is often necessary to consider the read-write sequence and concurrent access of objects (or records, files, data blocks, etc.). Generally speaking, if two objects have no shared resources, concurrent access can be performed; if there are shared parts, these resources need to be locked. For concurrent reading and writing of the same object (especially when concurrent writing and updating), it is necessary to pay attention to the sequentiality and the control of concurrent access to avoid data confusion. This article mainly introduces the order of reading and writing objects in ceph and the concurrency guarantee mechanism.

1. The concept of PG

Ceph introduces the concept of PG (Placement Group), which is a logical group that is mapped to a set of OSDs through the crush algorithm. Each PG contains a complete copy relationship, such as three data copies distributed to one PG. on 3 different OSDs. The following quote from a picture in the ceph paper can be more intuitively understood. Divide the file into multiple objects according to the specified object size, each object is mapped to a certain PG according to the hash, and then mapped to certain OSDs on the backend according to the crush algorithm:

ceph-chapter13-1

2. Concurrency control of different objects

Different objects may fall into the same PG. In the ceph implementation, the PG will be locked in the OSD processing thread, and the PG will not be released until the transaction is placed in the journal queue in queue_transactions (take filestore as an example). lock. It can be seen from here that for different objects in the same PG, concurrency control is performed through PG locks. Fortunately, there is no IO of objects involved in this process, which will not affect efficiency too much; Concurrent access can be performed directly.

void OSD::ShardedOpWQ::_process(uint32_t thread_index, heartbeat_handle_d *hb ) {
	...

	(item.first)->lock_suspend_timeout(tp_handle);

	...

	(item.first)->unlock();
}

3. Concurrent sequence control of the same object

From the above introduction, we can know that access to the same object will also be restricted by PG lock, but this is in the processing logic of OSD. For access to the same object, the situation to be considered is more complicated. From the perspective of user usage scenarios, there are two usage methods. for example:

1) In the case of a client, the client’s update processing logic for the same object is serial, and it has to wait for the completion of the previous write request before performing the next read or write update;

2) Concurrent access to the same object by multiple clients, such an application scenario is similar to NFS, which is rarely possible in current distributed systems, and involves data consistency issues caused by simultaneous updates of multiple clients, generally requires Cluster file system support;

For multi-client scenarios, ceph's rbd cannot be guaranteed, and cephfs may be able to (haven't studied it in depth, pending evidence). Therefore, here we mainly explain the scenario where a single client accesses the ceph rbd block device. Let’s look at an extreme example: the same client has sent two asynchronous write requests to the same object. Use this example to explain.

3.1 Sequence guarantee of tcp messages

Anyone who knows tcp knows that tcp uses sequence numbers to ensure the order of messages. Every time the sender sends data, tcp assigns a serial number to each data packet, and waits for the receiving host to confirm the assigned serial number within a specific time, if the sender does not receive it within a specific time Receiver's acknowledgment, the sender will retransmit the data packet. The receiver uses the serial number to confirm the received data in order to detect whether the data sent by the other party is lost or out of order, etc. Once the receiver receives the serialized data, it will reorganize the data into a data stream in the correct order and passed to the application layer for processing. Note: The serial number of tcp is used to ensure the sequence of messages on the same tcp connection.

3.2 Sequence guarantee of ceph message layer

The ceph message has a seq sequence number. Taking the simple message model as an example, there are 3 sequence numbers in the Pipe: in_seq, in_seq_acked, and out_seq.

  • in_seq is the serial number of the received message for Reader

  • in_seq_acked indicates the sequence number of the message that was successfully processed and responded, and is set after the receiving end receives the message from the sending end and completes the processing, and sends the ack to the sending end successfully;

  • out_seq is generated by the sender. Generally, a sequence number is randomly generated when a new connection is created, and then out_seq is incremented and assigned to the sequence number m->seq of the message when the subsequent message is sent.

Note: See Pipe::reader() and Pipe::writer() functions

ceph-chapter13-2

When the network exception causes the tcp connection to be interrupted, it will call Pipe::fault() for processing, that is, close the socket, call requeue_sent to put the message that has not received the ack back into the head of the out_q queue (put it at the head so that it can be processed first) , and out_seq will decrement, and will continue to try to re-establish the connection. In this way, the seq carried by the re-established connection retransmitted message is still the same as before.

Because the sender connection exception call Pipe::fault() will close the socket and perform tcp connection close processing. When the receiver continues to read, it will read 0 and consider tcp_read to fail, so it will also call Pipe::fault() to call shutdown_socket to close the socket. Therefore, in the case of re-establishing the connection after the connection is abnormally disconnected, in_seq will not follow the previous sequence number, but still depends on the out_seq generated by the sender. The order of messages can thus be guaranteed.

3.3 PG layer order guarantee and object lock mechanism

When fetching messages from the message queue for processing, the osd-side processing op is divided into multiple shards, and each shard can be configured with multiple threads, and PG is mapped to different shards according to the modulo method. In addition, when OSD is processing PG, it adds a write lock to PG when it is taken out from the message queue, and the lock is released only after the request is sent to the back end of the store. The PG layer on the OSD side is also processed in an orderly manner.

When writing to an object, the lock operation ondisk_write_lock() will be performed on the object, and the read request for an object will first perform the lock operation ondisk_read_lock() on the object. These two operations are mutually exclusive. When an object is still being written, trying ondisk_read_lock() will wait forever; similarly, when an object is being read, trying ondisk_write_lock() will also wait. . The place where the two are locked and unlocked is shown in the following figure: the lock is held during the process of reading data; the write request will not release the lock until the data is written to the underlying file system (file system cache), so that subsequent reads of this object It can be read directly from the file system cache (or read from the disk after the data is flushed to the disk).

ceph-chapter13-3

Note: The two lock operations here limit the concurrency of reading and writing on the same object, and have no control over the concurrency of reading and reading, writing and writing (of different objects). Theoretically, there will be no problem with two read requests on the same object (in fact, they will not happen at the same time, because the PG lock is added first), but will two write requests on the same object cause data confusion? ? Don't forget, the PG lock has been added before entering do_op(), and the PG lock is released only after the request is sent to the store layer. Therefore, two write requests for the same object will not enter the PG layer for processing concurrently. Yes, the previous write request must be processed by the PG layer in order, and then arrive at the store layer for processing (by another thread), and then the latter write request will enter the PG layer for processing and then be sent to the store layer. Therefore, read and write requests on the same object must be processed at the PG layer in order. Then the question arises, is there any order guarantee for the two write requests of the same object when they are processed by the store layer?

3.4 Store layer order guarantee

Ceph's store layer supports various storage backends and manages them in the form of plug-ins. Filestore is used as an example to illustrate. There are several ways to write to filestore, here we take filestore journal writeahead as an example. After the write request reaches the filestore (the entry is FileStore::queue_transactions()), an OpSequencer will be generated (if the PG has been generated before, it will be obtained directly, each PG has an osr, the type is ObjectStore::Sequencer, osr-> p refers to OpSequencer), OpSequencer is used to ensure the order of op operations in PG, and how to use it will be introduced later.

For a transaction that encapsulates a write request (each op has a seq sequence number, which is incremented), it will be placed in completions in order, and then placed in writeq (the end of the writeq queue), and write_thread will be notified to process. Use aio in write_thread to asynchronously write the transaction to the journal, put the IO information into aio_queue, and then use write_finish_cond to notify write_finish_thread for processing. In write_finish_thread, for the completed IO, it will be placed in the finisher queue of the journal in order according to the seq sequence number of the completed op (because aio does not guarantee the order, so the seq sequence number of the op is used to ensure the order of processing after completion), if If the op before an op has not been completed, the op will wait until all the ops before it are completed before being placed in the finisher queue together. See the following code:

int FileStore::queue_transactions(Sequencer *posr, vector<Transaction>& tls,
				  TrackedOpRef osd_op,
				  ThreadPool::TPHandle *handle)
{
	...

	OpSequencer *osr;
	assert(posr);
	if (posr->p) {
		osr = static_cast<OpSequencer *>(posr->p.get());
		dout(5) << "queue_transactions existing " << osr << " " << *osr << dendl;
	} else {
		osr = new OpSequencer(next_osr_id.inc());
		osr->set_cct(g_ceph_context);
		osr->parent = posr;
		posr->p = osr;
		dout(5) << "queue_transactions new " << osr << " " << *osr << dendl;
	}

	...

	if (journal && journal->is_writeable() && !m_filestore_journal_trailing) {
		...

		uint64_t op_num = submit_manager.op_submit_start();
    	o->op = op_num;

		if (m_filestore_journal_parallel) {
			dout(5) << "queue_transactions (parallel) " << o->op << " " << o->tls << dendl;
		
			_op_journal_transactions(tbl, orig_len, o->op, ondisk, osd_op);
		
			// queue inside submit_manager op submission lock
			queue_op(osr, o);
		} else if (m_filestore_journal_writeahead) {
			dout(5) << "queue_transactions (writeahead) " << o->op << " " << o->tls << dendl;
		
			osr->queue_journal(o->op);
		
			_op_journal_transactions(tbl, orig_len, o->op,
				new C_JournaledAhead(this, osr, o, ondisk),
				osd_op);

		} else {
			assert(0);
		}
		submit_manager.op_submit_finish(op_num);

		...
	}

	...
}

void FileJournal::submit_entry(uint64_t seq, bufferlist& e, uint32_t orig_len,
			       Context *oncommit, TrackedOpRef osd_op)
{
	...

	completions.push_back(
		completion_item(
			seq, oncommit, ceph_clock_now(g_ceph_context), osd_op));

	if (writeq.empty())
		writeq_cond.Signal();

	writeq.push_back(write_item(seq, e, orig_len, osd_op));
}

void FileJournal::write_thread_entry()
{
	...

	#ifdef HAVE_LIBAIO
		if (aio)
			do_aio_write(bl);
		else
			do_write(bl);
	#else
		do_write(bl);
	#endif

	...
}

In the finisher processing function of the journal, the op will be put into the queue of OpSequencer in order, and will be put into the queue of FileStore::op_wq, and the thread pool of FileStore::OpWQ calls FileStore::_do_op(), first osr-> apply_lock.Lock() performs the lock operation, and then takes the op from the queue for processing (call osr->peek_queue(), there is no dequeue), and then writes data to the Filesystem after the operation is completed, in FileStore::_finish_op( ) in osr->dequeue(), and osr->apply_lock.Unlock(). That is, OpSequencer is used to control the concurrency of writing IO to filesystem in the same PG, but writing IO of different PGs can be processed concurrently in the thread pool of OpWQ.

In general, in FileStore, the seq of op is used to control the sequence of persistent writing to journal, and then OpSequencer is used to ensure the sequence of writing data to the file system, and FIFO is used to ensure access during the whole process. The order of the queue. It can be seen that the two write requests for the same object are processed in the FileStore in sequence, and the processing is completed in sequence.

There is also an OpSequencer mechanism in BlueStore. For example, in BlueStore, OpSequencer is used to ensure that subsequent requests will not be processed until the previous request is processed. Therefore, two write requests for the same object will be processed in order, and for a read request for an object, if there are unfinished write requests for this object, it will wait until the write process is completed before reading.

3.5 Request order sent by primary to replica

With the order of the message layer and the order of primary processing, the request is also ordered when it is sent to the replica (that is, two writes to the same object), and the order of the two writes received by the replica The order of the two writes of the primary is the same (guaranteed by the order of the same tcp connection and the order of the ceph message layer), so that when the replica side is processing, the processing to the osd and subsequent store layers is also in order of. In this way, the two writes to the same object on the primary and replica will not be out of order and cause inconsistency.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132718178