[Ceph] Design and use of Bufferlist in Ceph

original:

If you have to find a class in the entire Ceph, I think it is the Bufferlist. The reason is simple, because the Bufferlist is responsible for managing all the memory in Ceph. All operations involving memory in Ceph, whether it is msg allocates memory to receive messages, or OSD constructs a persistent representation (encode/decode) of various data structures, and then to actual disk operations, all use bufferlist as the basis.

ceph::buffer is a very low-level implementation of ceph, responsible for managing ceph's memory. The design of ceph::buffer is more complicated, but it does not contain any content itself, mainly including buffer::list, buffer::ptr, buffer::raw, and buffer::hash. These three classes are defined in src/include/buffer.h and src/common/ buffer.cc .

buffer::raw: Responsible for maintaining the reference count of physical memory nref and release operations.
buffer::ptr: pointer to buffer::raw.
buffer::list: Represents a list of ptrs (std::list<bufferptr>), which is equivalent to forming a larger virtual continuous memory with N ptrs.

buffer::hash: The effective hash of one or more bufferlists.

The relationship between these three types of buffer can be represented by the following diagram:

In the figure, blue represents bufferlist, orange represents bufferptr, and green represents bufferraw.

In this figure, there are three segments of system memory actually occupied, which are three segments of memory represented by raw0, raw1, and raw2. Among them:
raw0 is used by ptr0, ptr1, and ptr2.
raw1 is used by ptr3, ptr4, and ptr6. Raw2 is used
by ptr5 and ptr7
. List0 is composed of ptr0-5, and list1 is composed of ptr6 and ptr7.

From this picture, we can see the design idea of the bufferlist: For the bufferlist, we only care about a ptr. Bufferlist connects ptr together as a continuous memory usage. Therefore, you can iterate all the contents of the entire bufferlist byte by byte through bufferlist::iterator, without worrying about how many ptrs there are, let alone how these ptrs correspond to system memory; you can also use The bufferlist::write_file method directly outputs the contents of the bufferlist to a file; or the bufferlist::write_fd method writes the contents of the bufferlist to a certain fd.

The opposite of bufferlist is bufferraw, which is responsible for managing system memory. Bufferraw only cares about one thing: it maintains the reference count of the system memory it manages, and releases this memory when the reference count is reduced to 0—that is, when there is no more ptr to use this memory.

The connection between bufferlist and bufferraw is bufferptr. Bufferptr cares about how to use memory. Each bufferptr must have a bufferraw to provide system memory for it, and then ptr decides which part of this memory to use. Bufferlist can only correspond to the system memory through ptr, and bufferptr can exist independently, but most of the ptr still serve the bufferlist, and there are not many scenarios where independent ptr is used.

By introducing an intermediate level such as ptr, the way bufferlist uses memory can be very flexible. Here are two scenarios:

1. Fast encode/decode
In Ceph, it is often necessary to encode one bufferlist into another bufferlist. For example, when msg sends a message, usually the logic layer such as osd obtained by msg is passed to its bufferlist, and then msg returns You need to add a message header and a message trailer to this bufferlist, and the message header and message trailer are also represented by the bufferlist. At this time, msg usually constructs an empty bufferlist, and then encodes the message header, message tail, and content to this empty bufferlist. The encode between the bufferlists actually only needs to be the copy of the ptr, and does not involve the application and copy of the system memory, which is more efficient.

2. Once allocated, used many times.
We all know that calling functions such as malloc to apply for memory is a very heavyweight operation. Using ptr as an intermediate layer can alleviate this problem, that is, we can apply for a larger piece of memory at one time, that is, a larger bufferraw, and then every time memory is needed, a bufferptr is constructed to point to different parts of the bufferraw. This eliminates the need to apply for memory from the system. Finally, adding these ptrs to a bufferlist can form a virtual continuous memory.

About the author 1: Dr. Yuan Dong, vice president of UnitedStack products, responsible for UnitedStack products, pre-sales and external cooperation; cloud computing expert, has rich experience in cloud computing, virtualization, distributed systems and enterprise applications; He has a deep understanding of distributed storage, unstructured data storage and storage virtualization, and has rich R&D and practical experience in the field of cloud storage and enterprise storage; he is a core code contributor to open source storage projects such as Ceph.

Related articles: https://www.jianshu.com/p/6c8b361cc665

Source code analysis ( http://bean-li.github.io/bufferlist-in-ceph/ )

buffer::raw

Before introducing the buffer list, we must first introduce buffer::raw and buffer::ptr. Compared with the buffer list, these two data structures are relatively easy to understand.

class buffer::raw {
    public:
        char *data;
        unsigned len;
        atomic_t nref;

        mutable simple_spinlock_t crc_spinlock;
        map<pair<size_t, size_t>, pair<uint32_t, uint32_t> > crc_map;
        
        ...
}

Note that in this data structure, data is a pointer that points to the real data, and len records the length of the data in the buffer::raw data area, and nref represents the reference count.

Note that the data pointed to by the data pointer may have different sources. The easiest one to think of is of course malloc. Secondly, we can use mmap to allocate space by creating an anonymous memory map, and even we can achieve zero-copy access to space through pipe + splice. Sometimes, when the space is allocated, alignment requirements are put forward, such as page alignment.

This is because these sources are different and requirements are different, and buffer::raw has some variants:

buffer:raw_malloc

The data source of this variant comes from malloc. Therefore, when it is created, a space of len length needs to be allocated through malloc. Not surprisingly, when it is destructed, free is used to release the space.

    class buffer::raw_malloc : public buffer::raw {
        public:
            explicit raw_malloc(unsigned l) : raw(l) {
                if (len) {
                    data = (char *)malloc(len);
                    if (!data)
                        throw bad_alloc();
                } else {
                    data = 0;
                }
                inc_total_alloc(len);
                inc_history_alloc(len);
                bdout << "raw_malloc " << this << " alloc " << (void *)data << " " << l << " " << buffer::get_total_alloc() << bendl;
            }
            raw_malloc(unsigned l, char *b) : raw(b, l) {
                inc_total_alloc(len);
                bdout << "raw_malloc " << this << " alloc " << (void *)data << " " << l << " " << buffer::get_total_alloc() << bendl;
            }
            ~raw_malloc() {
                free(data);
                dec_total_alloc(len);
                bdout << "raw_malloc " << this << " free " << (void *)data << " " << buffer::get_total_alloc() << bendl;
            }
            raw* clone_empty() {
                return new raw_malloc(len);
            }
    };

buffer::raw_mmap_pages

As the name suggests, it can also be guessed that the source of this data is the anonymous memory mapping allocated by mmap. Therefore, when destructuring, it is no surprise that munmap is used to unmap and return the space to the system.

    class buffer::raw_mmap_pages : public buffer::raw {
        public:
            explicit raw_mmap_pages(unsigned l) : raw(l) {
                data = (char*)::mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
                if (!data)
                    throw bad_alloc();
                inc_total_alloc(len);
                inc_history_alloc(len);
                bdout << "raw_mmap " << this << " alloc " << (void *)data << " " << l << " " << buffer::get_total_alloc() << bendl;
            }
            ~raw_mmap_pages() {
                ::munmap(data, len);
                dec_total_alloc(len);
                bdout << "raw_mmap " << this << " free " << (void *)data << " " << buffer::get_total_alloc() << bendl;
            }
            raw* clone_empty() {
                return new raw_mmap_pages(len);
            }
    };

buffer::raw_posix_aligned

Look at the name, you can see that there is an alignment requirement for space. The posix_memalign function under Linux is used to allocate memory space with alignment requirements. The space allocated by this allocation method is also released by the free function, and the space is returned to the system.

class buffer::raw_posix_aligned : public buffer::raw {
        unsigned align;
        public:
        raw_posix_aligned(unsigned l, unsigned _align) : raw(l) {
            align = _align;
            assert((align >= sizeof(void *)) && (align & (align - 1)) == 0);
#ifdef DARWIN
            data = (char *) valloc (len);
#else
            data = 0;
            int r = ::posix_memalign((void**)(void*)&data, align, len);
            if (r)
                throw bad_alloc();
#endif /* DARWIN */
            if (!data)
                throw bad_alloc();
            inc_total_alloc(len);
            inc_history_alloc(len);
            bdout << "raw_posix_aligned " << this << " alloc " << (void *)data << " l=" << l << ", align=" << align << " total_alloc=" << buffer::get_total_alloc() << bendl;
        }
        ~raw_posix_aligned() {
            ::free((void*)data);
            dec_total_alloc(len);
            bdout << "raw_posix_aligned " << this << " free " << (void *)data << " " << buffer::get_total_alloc() << bendl;
        }
        raw* clone_empty() {
            return new raw_posix_aligned(len, align);
        }
    };

There are also zero-copy methods based on pipe and splice later, we won't go into details. From the above function, it is not difficult to see that the buffer::raw series, just like its name, is really native and does not have too many twists and turns. It uses the API provided by the system to achieve the purpose of allocating space.

buffer::ptr

Buffer::ptr is based on the buffer::raw series. This class is also aliased as bufferptr.

src/include/buffer_fwd.h


#ifndef BUFFER_FWD_H
#define BUFFER_FWD_H

namespace ceph {
  namespace buffer {
    class ptr;
    class list;
    class hash;
  }

  using bufferptr = buffer::ptr;
  using bufferlist = buffer::list;
  using bufferhash = buffer::hash;
}

#endif

The member variables of this class are as follows. This class is an upgraded version of the raw class, and its _raw refers to a variable of the buffer::raw type.

        class CEPH_BUFFER_API ptr {
            raw *_raw;
            unsigned _off, _len;
          ......    
      }

Many operations are easy to think of:


   buffer::ptr& buffer::ptr::operator= (const ptr& p)
    {
        if (p._raw) {
            p._raw->nref.inc();
            bdout << "ptr " << this << " get " << _raw << bendl;
        }
        buffer::raw *raw = p._raw; 
        release();
        if (raw) {
            _raw = raw;
            _off = p._off;
            _len = p._len;
        } else {
            _off = _len = 0;
        }
        return *this;
    }
    
    buffer::raw *buffer::ptr::clone()
    {
        return _raw->clone();
    }
    
    void buffer::ptr::swap(ptr& other)
    {
        raw *r = _raw;
        unsigned o = _off;
        unsigned l = _len;
        _raw = other._raw;
        _off = other._off;
        _len = other._len;
        other._raw = r;
        other._off = o;
        other._len = l;
    }
    
   const char& buffer::ptr::operator[](unsigned n) const
    {
        assert(_raw);
        assert(n < _len);
        return _raw->get_data()[_off + n];
    }
    char& buffer::ptr::operator[](unsigned n)
    {
        assert(_raw);
        assert(n < _len);
        return _raw->get_data()[_off + n];
    }
    
    int buffer::ptr::cmp(const ptr& o) const
    {
        int l = _len < o._len ? _len : o._len;
        if (l) {
            int r = memcmp(c_str(), o.c_str(), l);
            if (r)
                return r;
        }
        if (_len < o._len)
            return -1;
        if (_len > o._len)
            return 1;
        return 0;
    }

bufferlist

The bufferlist is our destination. The first two classes are actually relatively easy to understand, but the bufferlist is relatively complicated.

bufferlist is an alias for buffer::list:

#ifndef BUFFER_FWD_H
#define BUFFER_FWD_H

namespace ceph {
  namespace buffer {
    class ptr;
    class list;
    class hash;
  }

  using bufferptr = buffer::ptr;
  using bufferlist = buffer::list;
  using bufferhash = buffer::hash;
}

#endif


class CEPH_BUFFER_API list {
            // my private bits
            std::list<ptr> _buffers;
            unsigned _len;
            unsigned _memcopy_count; //the total of memcopy using rebuild().
            ptr append_buffer;  // where i put small appends

Multiple bufferptrs form a list, which is the bufferlist. Member variables are not too difficult to understand. It is more about the iterator of the bufferlist. Understand the iterator, it is not difficult to understand the various operation functions of the bufferlist.

To understand the bufferlist iterator, you first need to understand the meaning of the iterator member variables.

_buffersPtr is a linked list, _lenthe entire _bufferstotal length of all the data ptr, _memcopy_countfor memcopy byte counts, and append_bufferis a buffer for optimizing append operation, it can be seen bufferlist discontinuous data stored in the linked list .

                        bl_t* bl;
                        list_t* ls;  // meh.. just here to avoid an extra pointer dereference..
                        unsigned off; // in bl
                        list_iter_t p;
                        unsigned p_off;   // in *p

bl: pointer to the bufferlist
ls: pointer to the member _buffers of the bufferlist
p: The type is std::list::iterator, used to iterate through the bufferptr in the bufferlist
p_off: the offset of the current position in the corresponding bufferptr
off: If the entire bufferlist is regarded as a buffer::raw, the current position is the offset of the entire bufferlist

This progressive relationship is more obvious, from the macro bufferlist, to an internal bufferptr, and then to an offset position in the raw data area of the bufferptr. In addition, it also contains the offset off of the current position in the entire bufferlist.

Note that p_off and off are easy to misunderstand, please read the seek function and try to figure it out carefully

seek(unsigned o), as the name implies, is to move the position to o, of course o refers to o in the entire bufferlist. Ceph implements a more general advance, accepting an int type input parameter.

If o>0, it means moving backward, if o is less than 0, it means moving forward. The data area pointed to by the current bufferptr may be crossed during the movement.

    template<bool is_const>
        void buffer::list::iterator_impl<is_const>::advance(int o)
        {
            //cout << this << " advance " << o << " from " << off << " (p_off " << p_off << " in " << p->length() << ")" << std::endl;
            if (o > 0) {
                p_off += o;
                while (p_off > 0) {
                    if (p == ls->end())
                        throw end_of_buffer();
                    if (p_off >= p->length()) {
                        // skip this buffer
                        p_off -= p->length();
                        p++;
                    } else {
                        // somewhere in this buffer!
                        break;
                    }
                }
                off += o;
                return;
            }
            while (o < 0) {
                if (p_off) {
                    unsigned d = -o;
                    if (d > p_off)
                        d = p_off;
                    p_off -= d;
                    off -= d;
                    o += d;
                } else if (off > 0) {
                    assert(p != ls->begin());
                    p--;
                    p_off = p->length();
                } else {
                    throw end_of_buffer();
                }
            }
        }

    template<bool is_const>
        void buffer::list::iterator_impl<is_const>::seek(unsigned o)
        {
            p = ls->begin();
            off = p_off = 0;
            advance(o);
        }

In addition, it is also very interesting to get the ptr of the current position. Understanding this function also helps to understand the meaning of the five members of the iterator.

template<bool is_const>
    buffer::ptr buffer::list::iterator_impl<is_const>::get_current_ptr() const
    {
        if (p == ls->end())
            throw end_of_buffer();
        return ptr(*p, p_off, p->length() - p_off);
    }

The buffer::raw corresponding to multiple bufferptr forms a possibly discontinuous buffer list, so it may be inconvenient to use. For this consideration, ceph provides a rebuild function. The function of this function is to simply create a buffer::raw to provide the same space and content.

    void buffer::list::rebuild()
    {
        if (_len == 0) {
            _buffers.clear();
            return;
        }
        ptr nb;
        if ((_len & ~CEPH_PAGE_MASK) == 0)
            nb = buffer::create_page_aligned(_len);
        else
            nb = buffer::create(_len);
        rebuild(nb);
    }

    void buffer::list::rebuild(ptr& nb)
    {
        unsigned pos = 0;
        for (std::list<ptr>::iterator it = _buffers.begin();
                it != _buffers.end();
                ++it) {
            nb.copy_in(pos, it->length(), it->c_str(), false);
            pos += it->length();
        }
        _memcopy_count += pos;
        _buffers.clear();
        if (nb.length())
            _buffers.push_back(nb);
        invalidate_crc();
        last_p = begin();
    }

It is not difficult to see the meaning of rebuild from the following test code, which is to divide zeros into wholes and rebuild a buffer::raw to provide space

  {
    bufferlist bl;
    const std::string str(CEPH_PAGE_SIZE, 'X');
    bl.append(str.c_str(), str.size());
    bl.append(str.c_str(), str.size());
    EXPECT_EQ((unsigned)2, bl.buffers().size());
    bl.rebuild();
    EXPECT_EQ((unsigned)1, bl.buffers().size());
  }

After understanding the above content, the remaining thousands of lines of code in the bufferlist have basically become a running account. It is not difficult to understand, so I won't repeat it here.

"Ceph source code reading buffer" https://zhuanlan.zhihu.com/p/96659509

《ceph：bufferlist实现》https://www.it610.com/article/1231080926906257408.htm

ceph blog: http://bean-li.github.io/

Bufferlist is an alias of buffer::list, and its origin is described in detail in http://bean-li.github.io/bufferlist-in-ceph/

The meaning of the p, p_off, and off fields can be understood by the advance(int o) function

Simply put, p refers to the iterator pointing to _buffer, p_off is the offset in the _raw of the current buffer::ptr, and off is the current offset when the _raw in the _buffer is regarded as a whole shift. As shown in the figure below, when o is 1000, p is ptr3, p_off is 200 (derived from 1000-500-300), and off is 1000.