ZLMediaKit high concurrent implementation principle

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/tanningzhong/article/details/88798661

Project Introduction

ZLMediaKit is a high-performance streaming media services framework that currently supports rtmp / rtsp / hls / http- flv streaming protocols. The project has supported linux, macos, windows, ios, android platform, supported encoding formats including H264, AAC, H265 (H265 supported only rtsp); model uses a non-blocking multi-threaded programming multiplexer IO (the linux using epoll, other platforms using select).

The framework is based on C ++ 11 developed, to avoid raw pointers, reducing memory copy, streamlined code reliable, high concurrency, in linux platform, a single process can take advantage of multi-core CPU; maximum drained CPU, network card performance; easily reach Gigabit network card performance limits. But also to the high level of performance, so extremely low latency, seconds to open the screen.

After several iterations ZLMediaKit current version, many upgrades programming model optimization; has become mature and stable, but also a variety of production environments has been verified, this article focuses ZLMediaKit principle and achieve high performance characteristics of the project.

Network Model Comparison

Unlike single-threaded multi-SRS coroutine, node.js / redis single-threaded, NGINX multi-process model; ZLMediaKit using a single multithreaded process model. So why ZLMediaKit to adopt such a programming model it?

Years of C ++ as a server back-end development engineers, many years of work experience tells me, as a server program, demanding for stability; a server performance can almost, but never easily core dump; service interruption, restart, exceptions, for a line has been operational projects, the result is disastrous. So how are we to ensure the stability of the server? Currently has the following means:

  • Single-threaded model
  • + Threaded coroutine
  • + Single-threaded multi-process
  • Multi-thread lock +
  • Deprecated C / C ++

The advantage of using a single-threaded model is simple and reliable server, regardless of the resource contention issues mutually exclusive, so you can more easily achieve high stability; typical project using this model are redis, node.js. But because it is a single-threaded model, the disadvantages are also obvious; it just can not take full advantage of multi-core CPU count force in the multi-core cpu, performance bottlenecks mainly CPU (we should have had experience in the implementation of keys in redis * slowly waiting) .

img

+ Threaded without distinction on a pure solution of the single-threading model coroutine, the main difference between them programming style. Pure single-threaded model used is non-blocking everywhere callback way to achieve high concurrency, this model will issue the so-called callback hell, programming would be more trouble. + Coroutine single-threaded programs is simplified programming, using natural blocking programming style, the coroutine library internal management task scheduling, also essentially non-blocking. But the relatively low-coroutine library involved with the system are closely related, so cross-platform is not very good to do, and design to achieve a higher threshold coroutine library. SRS is using this programming model, due to limitations coroutine library, SRS will not run on windows.

In order to solve the above single-threaded model, many servers use single-threaded multi-process programming model; In this model, a simple and reliable performance of existing single-threaded model, but also to maximize the performance of multi-core CPU, and a process hang it will not affect other processes, like NGINX is this programming model; but this model also has its limitations. In this model, are isolated between sessions, two sessions may be running on different process; this has led to difficulties in communication between sessions. For example, user A is connected to the server process A, user B connected to the server B process; If you want to accomplish a data exchange between the two, it will be extremely difficult, this must be done through inter-process communication. The inter-process communication costs and overhead is relatively large, it is also more difficult to program. But if the data do not need to interact across sessions (for example, http server), then this model is particularly suitable, so NGINX as http server is also very successful, but if it is the kind of needs such as instant messaging services to communicate between sessions, then this kind of development model is not very suitable. But now more and more services are required to support the deployment of distributed clusters, so the defect single-threaded multi-process solutions become increasingly obvious.

Since the C / C ++ is a kind of static strongly typed language, exception handling simple and crude, quick to core dump. C / C ++ design philosophy is to find errors early exposure, in a sense, the collapse is kind of a good thing, because this will cause your attention, so you can locate early detection and solve problems, rather than to delay the issue can not be resolved when exposed to you again. But to do so for the average person, C / C ++ is not very friendly, and human beings are not as rigorous as the machine, a little negligence is inevitable, Moreover, some small problems would not hurt, does not require destructive to deal with a core dump. And C / C ++ learning curve abnormal hardship, a lot of people have several years to no avail, so many people have expressed abandoned pit, turned to go / erlang / node.js and the like.

However, C / C ++ because of its performance advantages, as well as historical reasons, in some scenes is the best choice, and C / C ++ is the real cross-platform language; Moreover, with the introduction of smart pointers, memory management is no longer a problem ; and support lambad syntax, context-bound no longer to make the program difficult. With the support of the new features of C ++, static perfect reflection of the compiler, the modern C ++ programming increasingly simple and quick. ZLMediaKit using C ++ 11 is the new standard for high-performance streaming media services as well as the framework of the relevant concept completed.

Above other programming models different, ZLMediaKit uses a multi-threaded development model; and traditional multi-threading model different; ZLMediaKit using C ++ smart pointer 11 to do memory management, memory management can be perfect in a multi-thread switch when thread and share their life cycle. Meanwhile mutex particle size reduction to the extreme, almost negligible. Therefore, the use of multi-threading model ZLMediaKit performance loss is very low, the performance of each thread is almost comparable to a single-threaded model, but also can be fully drained every core CPU performance.

Detailed network model

ZLMediaKit automatically creates several epoll instance (non-linux platform to select) according to the number of cpu core at startup; these epoll instance, there will be a thread to run epoll_waitfunction to wait for a trigger event.

RTMP to ZLMediaKit of service, for example, in a creation TcpServertime, ZLMediaKit this Tcp services will be added to each listening socket epoll instance, so if you receive a new RTMP playback request, then multiple instances in epoll kernel under scheduling, automatic lighter load thread accept trigger event, the following is the code snippet:

template <typename SessionType>
void start(uint16_t port, const std::string& host = "0.0.0.0", uint32_t backlog = 1024) {
   start_l<SessionType>(port,host,backlog);
   //自动加入到所有epoll线程监听
   EventPollerPool::Instance().for_each([&](const TaskExecutor::Ptr &executor){
      EventPoller::Ptr poller = dynamic_pointer_cast<EventPoller>(executor);
      if(poller == _poller || !poller){
         return;
      }
      auto &serverRef = _clonedServer[poller.get()];
      if(!serverRef){
      	//绑定epoll实例
         serverRef = std::make_shared<TcpServer>(poller);
      }
      serverRef->cloneFrom(*this);
   });
}


void cloneFrom(const TcpServer &that){
		if(!that._socket){
			throw std::invalid_argument("TcpServer::cloneFrom other with null socket!");
		}
		_sessionMaker = that._sessionMaker;
		//克隆一个相同fd的Socket对象
		_socket->cloneFromListenSocket(*(that._socket));
		_timer = std::make_shared<Timer>(2, [this]()->bool {
			this->onManagerSession();
			return true;
		},_poller);
		this->mINI::operator=(that);
        _cloned = true;
	}

After receiving the accept event server, it creates a TcpSessiontarget and bind to the epoll instance (the same time the corresponding peer fdadded to the relevant epoll monitor). Each Tcp connection will correspond to a TcpSessiontarget, in data exchange after the client and server, the TcpSessionobject handles all relevant business data with them, and everything will be over the lifetime of the event after the object is triggered by the epoll thread, so that the server each thread can epoll uniform assigned to a reasonable number of clients. The following event processing logic server accept snippet:

// 接收到客户端连接请求
    virtual void onAcceptConnection(const Socket::Ptr & sock) {
		weak_ptr<TcpServer> weakSelf = shared_from_this();
        //创建一个TcpSession;这里实现创建不同的服务会话实例
		auto sessionHelper = _sessionMaker(weakSelf,sock);
		auto &session = sessionHelper->session();
        //把本服务器的配置传递给TcpSession
        session->attachServer(*this);

        //TcpSession的唯一识别符,可以是guid之类的
        auto sessionId = session->getIdentifier();
        //记录该TcpSession
        if(!SessionMap::Instance().add(sessionId,session)){
            //有同名session,说明getIdentifier生成的标识符有问题
            WarnL << "SessionMap::add failed:" << sessionId;
            return;
        }
        //SessionMap中没有相关记录,那么_sessionMap更不可能有相关记录了;
        //所以_sessionMap::emplace肯定能成功
        auto success = _sessionMap.emplace(sessionId, sessionHelper).second;
        assert(success == true);

        weak_ptr<TcpSession> weakSession(session);
		//会话接收数据事件
		sock->setOnRead([weakSession](const Buffer::Ptr &buf, struct sockaddr *addr){
			//获取会话强引用
			auto strongSession=weakSession.lock();
			if(!strongSession) {
				//会话对象已释放
				return;
			}
            //TcpSession处理业务数据
			strongSession->onRecv(buf);
		});


		//会话接收到错误事件
		sock->setOnErr([weakSelf,weakSession,sessionId](const SockException &err){
		    //在本函数作用域结束时移除会话对象
            //目的是确保移除会话前执行其onError函数
            //同时避免其onError函数抛异常时没有移除会话对象
		    onceToken token(nullptr,[&](){
                //移除掉会话
                SessionMap::Instance().remove(sessionId);
                auto strongSelf = weakSelf.lock();
                if(!strongSelf) {
                    return;
                }
                //在TcpServer对应线程中移除map相关记录
                strongSelf->_poller->async([weakSelf,sessionId](){
                    auto strongSelf = weakSelf.lock();
                    if(!strongSelf){
                        return;
                    }
                    strongSelf->_sessionMap.erase(sessionId);
                });
		    });
			//获取会话强应用
			auto strongSession=weakSession.lock();
            if(strongSession) {
                //触发onError事件回调
				strongSession->onError(err);
			}
		});
	}
 通过上诉描述,我们应该大概了解了ZLMediaKit的网络模型,通过这样的模型基本上能榨干CPU的算力,不过CPU算力如果使用不当 ,也可能白白浪费,使之做一些无用的事物,那么在ZLMediaKit中还有那些技术手段来提高性能呢?我们在下节展开论述。

Close mutex

On a discussion, we know that TcpSessionis ZLMediaKit key element in the server most calculations are done in TcpSession. A TcpSessioncharge epoll example of a life cycle, other threads can not directly manipulate the TcpSessionobject (a thread to be switched by a corresponding thread epoll to complete the operation); therefore in a sense TcpSeesionis the thread model; so ZLMediaKit for TcpSessioncorresponding io-free operation of the network protection mutex, ZLMediaKit run as a server mode, substantially no lock; in this case, lock impact on performance is almost negligible. The following is closed ZLMediaKit mutex code fragment:

virtual Socket::Ptr onBeforeAcceptConnection(const EventPoller::Ptr &poller){
    	/**
    	 * 服务器模型socket是线程安全的,所以为了提高性能,关闭互斥锁
    	 * Socket构造函数第二个参数即为是否关闭互斥锁
    	 */
		return std::make_shared<Socket>(poller,false);
	}

//Socket对象的构造函数,第二个参数即为是否关闭互斥锁
Socket::Socket(const EventPoller::Ptr &poller,bool enableMutex) :
		_mtx_sockFd(enableMutex),
		_mtx_bufferWaiting(enableMutex),
		_mtx_bufferSending(enableMutex) {
	_poller = poller;
	if(!_poller){
		_poller = EventPollerPool::Instance().getPoller();
	}

    _canSendSock = true;
	_readCB = [](const Buffer::Ptr &buf,struct sockaddr *) {
		WarnL << "Socket not set readCB";
	};
	_errCB = [](const SockException &err) {
		WarnL << "Socket not set errCB:" << err.what();
	};
	_acceptCB = [](Socket::Ptr &sock) {
		WarnL << "Socket not set acceptCB";
	};
	_flushCB = []() {return true;};

	_beforeAcceptCB = [](const EventPoller::Ptr &poller){
		return nullptr;
	};
}

//MutexWrapper对象定义,可以选择是否关闭互斥锁
template <class Mtx = recursive_mutex>
class MutexWrapper {
public:
    MutexWrapper(bool enable){
        _enable = enable;
    }
    ~MutexWrapper(){}

    inline void lock(){
        if(_enable){
            _mtx.lock();
        }
    }
    inline void unlock(){
        if(_enable){
            _mtx.unlock();
        }
    }
private:
    bool _enable;
    Mtx _mtx;
};

Avoid memory copy

Under the traditional model of multi-threaded, so the data will be forwarded thread switching problem, in order to ensure thread-safe, memory copy is generally used to circumvent the problem; and subcontracting data processing is also very difficult to achieve without using the memory copy. But this streaming media business logic, could watch a live broadcast of the same user is massive, if every time you do distribute copies of memory, then the cost is very impressive, it will seriously drag on server performance.

ZLMediaKit doing media data forwarding, do not copy the memory, the conventional multi-threaded C ++ programming difficult to do this, but we are blessing in C ++ 11, the use of reference counting, ingenious solution to multi-threaded memory problems lifecycle management, the following is RTMP servers do media distributing avoid memory copy of the code snippet:

void RtmpProtocol::sendRtmp(uint8_t ui8Type, uint32_t ui32StreamId,
        const Buffer::Ptr &buf, uint32_t ui32TimeStamp, int iChunkId){
    if (iChunkId < 2 || iChunkId > 63) {
        auto strErr = StrPrinter << "不支持发送该类型的块流 ID:" << iChunkId << endl;
        throw std::runtime_error(strErr);
    }
	//是否有扩展时间戳
    bool bExtStamp = ui32TimeStamp >= 0xFFFFFF;

    //rtmp头
	BufferRaw::Ptr bufferHeader = obtainBuffer();
	bufferHeader->setCapacity(sizeof(RtmpHeader));
	bufferHeader->setSize(sizeof(RtmpHeader));
	//对rtmp头赋值,如果使用整形赋值,在arm android上可能由于数据对齐导致总线错误的问题
	RtmpHeader *header = (RtmpHeader*) bufferHeader->data();
    header->flags = (iChunkId & 0x3f) | (0 << 6);
    header->typeId = ui8Type;
    set_be24(header->timeStamp, bExtStamp ? 0xFFFFFF : ui32TimeStamp);
    set_be24(header->bodySize, buf->size());
    set_le32(header->streamId, ui32StreamId);
    //发送rtmp头
    onSendRawData(bufferHeader);

    //扩展时间戳字段
	BufferRaw::Ptr bufferExtStamp;
    if (bExtStamp) {
        //生成扩展时间戳
		bufferExtStamp = obtainBuffer();
		bufferExtStamp->setCapacity(4);
		bufferExtStamp->setSize(4);
		set_be32(bufferExtStamp->data(), ui32TimeStamp);
	}

	//生成一个字节的flag,标明是什么chunkId
	BufferRaw::Ptr bufferFlags = obtainBuffer();
	bufferFlags->setCapacity(1);
	bufferFlags->setSize(1);
	bufferFlags->data()[0] = (iChunkId & 0x3f) | (3 << 6);
    
    size_t offset = 0;
	uint32_t totalSize = sizeof(RtmpHeader);
    while (offset < buf->size()) {
        if (offset) {
            //发送trunkId
            onSendRawData(bufferFlags);
            totalSize += 1;
        }
        if (bExtStamp) {
            //扩展时间戳
            onSendRawData(bufferExtStamp);
            totalSize += 4;
        }
        size_t chunk = min(_iChunkLenOut, buf->size() - offset);
        //分发流媒体数据包,此处规避了内存拷贝
        onSendRawData(std::make_shared<BufferPartial>(buf,offset,chunk));
        totalSize += chunk;
        offset += chunk;
    }
    _ui32ByteSent += totalSize;
    if (_ui32WinSize > 0 && _ui32ByteSent - _ui32LastSent >= _ui32WinSize) {
        _ui32LastSent = _ui32ByteSent;
        sendAcknowledgement(_ui32ByteSent);
    }
}

//BufferPartial对象用于rtmp包的chunk大小分片,规避内存拷贝
class BufferPartial : public Buffer {
public:
    BufferPartial(const Buffer::Ptr &buffer,uint32_t offset,uint32_t size){
        _buffer = buffer;
        _data = buffer->data() + offset;
        _size = size;
    }

    ~BufferPartial(){}

    char *data() const override {
        return _data;
    }
    uint32_t size() const override{
        return _size;
    }
private:
    Buffer::Ptr _buffer;
    char *_data;
    uint32_t _size;
};

When we send RTP packet also apply the same principle to avoid memory copies.

Use objects circulating pool

Memory is a global mutex to open up the destruction of too many new / delete programs not only reduce performance, but also lead to memory fragmentation. ZLMediaKit make use of the circulation tank to avoid these problems, RTP packet when circulating bath using the following code snippet:

RtpPacket::Ptr RtpInfo::makeRtp(TrackType type, const void* data, unsigned int len, bool mark, uint32_t uiStamp) {
    uint16_t ui16RtpLen = len + 12;
    uint32_t ts = htonl((_ui32SampleRate / 1000) * uiStamp);
    uint16_t sq = htons(_ui16Sequence);
    uint32_t sc = htonl(_ui32Ssrc);

   	//采用循环池来获取rtp对象
    auto rtppkt = ResourcePoolHelper<RtpPacket>::obtainObj();
    unsigned char *pucRtp = rtppkt->payload;
    pucRtp[0] = '$';
    pucRtp[1] = _ui8Interleaved;
    pucRtp[2] = ui16RtpLen >> 8;
    pucRtp[3] = ui16RtpLen & 0x00FF;
    pucRtp[4] = 0x80;
    pucRtp[5] = (mark << 7) | _ui8PlayloadType;
    memcpy(&pucRtp[6], &sq, 2);
    memcpy(&pucRtp[8], &ts, 4);
    //ssrc
    memcpy(&pucRtp[12], &sc, 4);
    //playload
    memcpy(&pucRtp[16], data, len);

    rtppkt->PT = _ui8PlayloadType;
    rtppkt->interleaved = _ui8Interleaved;
    rtppkt->mark = mark;
    rtppkt->length = len + 16;
    rtppkt->sequence = _ui16Sequence;
    rtppkt->timeStamp = uiStamp;
    rtppkt->ssrc = _ui32Ssrc;
    rtppkt->type = type;
    rtppkt->offset = 16;
    _ui16Sequence++;
    _ui32TimeStamp = uiStamp;
    return rtppkt;
}

Socket Set related marks

After opening TCP_NODELAY can improve server responsiveness, more important to delay requirements for some sensitive services (such as ssh service), open TCP_NODELAY mark. But for the streaming service, and a steady stream since the data amount is relatively large, it is possible to reduce the off TCP_NODELAY ACK packet number, make full use of bandwidth resources.

Another marker is MSG_MORE improve network throughput; this effect is marked when data is transmitted, the server caches the data before a certain one transmission package; and business scenarios such as RTSP, it is particularly suitable MSG_MORE marker; since RTP packets are generally small (less than a MTU), the number of packets can be greatly reduced by MSG_MORE marker.

ZLMediaKit in dealing with players during the handshake is turned on and turned off MSG_MORE TCP_NODELAY, the aim is to improve the latency of data exchange during the handshake, reducing time-consuming to establish links, increase the speed of opening video. After a successful handshake, ZLMediaKit TCP_NODELAY closes and opens MSG_MORE; this can reduce the number of packets of data, improve network utilization.

Sending bulk data

Network programming, we should have used the send / sendto / write function, but writev / sendmsg function should be used very much. ZLMediaKit sendmsg using bulk data transmission function to do so when the network is not good or the server load is relatively high, can significantly reduce system call (system call overhead is relatively large) number of times, to improve program performance. The following is the code fragment:

int BufferList::send_l(int fd, int flags,bool udp) {
    int n;
    do {
        struct msghdr msg;
        msg.msg_name = NULL;
        msg.msg_namelen = 0;
        msg.msg_iov = &(_iovec[_iovec_off]);
        msg.msg_iovlen = _iovec.size() - _iovec_off;
        if(msg.msg_iovlen > IOV_MAX){
            msg.msg_iovlen = IOV_MAX;
        }
        msg.msg_control = NULL;
        msg.msg_controllen = 0;
        msg.msg_flags = flags;
        n = udp ? send_iovec(fd,&msg,flags) : sendmsg(fd,&msg,flags);
    } while (-1 == n && UV_EINTR == get_uv_error(true));

    if(n >= _remainSize){
        //全部写完了
        _iovec_off = _iovec.size();
        _remainSize = 0;
        return n;
    }

    if(n > 0){
        //部分发送成功
        reOffset(n);
        return n;
    }

    //一个字节都未发送
    return n;
}

Batch thread switch

Multi-threaded model, a streaming media server doing media data distribution, definitely do a thread switch. First, the purpose of the thread switching thread-safe, to prevent multiple threads simultaneously operate an object or resource; the second is to take full advantage of multi-accounting of force to prevent the forwarding performance single-threaded become a bottleneck. ZLMediaKit In doing media forwarding, also use the thread switching for data distribution to multiple threads. But the thread switching overhead is relatively large, if a thread switch too many times, it will seriously affect server performance.

Now we assume a scenario: RTMP Streaming Client A push a broadcast to the server, this broadcast is hot, assumptions need to be up to 10K times threads have both 10K users are watching this live, then we distribute a RTMP packet switch and then send the data? Although ZLMediaKit thread switching is relatively light, but this is also a frequent thread switching could not carry.

ZLMediaKit in dealing with such issues, the thread switching in batches to minimize the number of thread switch. If you say this 10K of users are spread across 32 core cpu, then ZLMediaKit be up to 32 times a thread switch, so ZLMediaKit will greatly reduce the number of thread switches, while using multiple threads to distribute data, greatly improving network throughput, the following is batch thread switching code fragment:

void emitRead(const T &in){
        LOCK_GUARD(_mtx_map);
        for (auto &pr : _dispatcherMap) {
            auto second = pr.second;
            //批量线程切换
            pr.first->async([second,in](){
                second->emitRead(in);
            },false);
        }
    }

//线程切换后再做遍历
void emitRead(const T &in){
        for (auto it = _readerMap.begin() ; it != _readerMap.end() ;) {
            auto reader = it->second.lock();
            if(!reader){
                it = _readerMap.erase(it);
                --_readerSize;
                onSizeChanged();
                continue;
            }
            //触发数据分发操作
            reader->onRead(in);
            ++it;
        }
	}

Rvalue reference copy using

ZLMediaKit also make use of the right to circumvent the copy reference value memory copy, do not start discussion here.

Other features

Optimization timely push to open the flow rate

Some applications require device-side scenario began to push the stream, then immediately watch the APP application scenarios. Traditional rtmp server for this scenario is without any optimization if APP play request has not been established before the flow reaches the push, it will lead to failure APP players, so the success rate will be reduced to open the video, the user experience is very bad.

ZLMediaKit when the application for the scene, made a special optimization; implementation principle is as follows:

1, received a play request, immediately check whether there has been a media source, if there is a successful return to play, otherwise go to step 2.

2, snoop media source registration event, while adding player timeout timer, and does not respond to the player and then return. The logic proceeds to step 3 or step 4.

3, media sources registered, then immediately respond to player to play successfully, delete Play timeout timer, and removable media to register event listeners.

4, timeout timer to trigger the response player fails to play, while playing delete timeout timer, and removable media to register event listeners.

As used ZLMediaKit streaming media server, and can request playback apparatus APP end plug flow simultaneously.

Performance Comparison Test

Currently ZLMediaKit did some performance testing, see Address: Benchmark

When the test found, ZLMediaKit when the load is relatively low, its single-thread performance is about 50% SRS, the single thread would be able to support a 5K player, leading thanks to a local cycle network, network conditions when the main reason for this performance gap ideal, then sendmsg can not afford to send bulk will optimize around; and SRS using the merger write characteristics (that is, send a one-time cache data after about 300 milliseconds), the system can reduce the number of calls; lower if the load is relatively low, and real network environment, ZLMediaKit should be single-threaded performance with SRS gap is smaller, we can also find in the client relatively long time, ZLMediaKit single-threaded thread performance relatively large increase in the test report.

Since ZLMediaKit supports multithreading, you can take advantage of multi-core CPU performance, on multi-core server, CPU is no longer a performance bottleneck, in order to reduce latency live, we do not currently intend to join the merger write characteristics.

project address

Currently ZLMediaKit already open, the address is: ZLMediaKit

QQ chat group

542509000

Guess you like

Origin blog.csdn.net/tanningzhong/article/details/88798661