How to improve glog performance by 10 times

summary:  Optimize the glog source code, the performance is improved by 10 times

background

Recently, I am doing performance optimization for glog. I use the C++ version glog-0.3.4 for stress testing. The total amount of test data is 1.5g. I started 12 threads to write 133-byte log entries in a loop. The test result took 175s per second. About 8-9MB of throughput.
Based on this test, I performed a series of performance optimizations on glog. After optimization, it took 16s and the performance was 10 times that of the native version of glog.

optimization process

go to localtime function call

Looking at the glog source code, the two functions localtime and localtime_r are used when obtaining the date, and these two functions call __tz_convert, __tz_convert has the tzset_lock global lock, and the kernel-level futex lock is used every time the time is obtained, so optimize The first step is to remove the localtime function of glibc, use gettimeofday to obtain the seconds and time zone, and calculate the date in a pure CPU-consuming way. A slightly more complicated calculation is the conversion of leap year and leap month. After replacing this function, the time-consuming is reduced from 175s to 46s, and the performance is instantly improved by 4-5 times.

Reduce lock granularity

Looking at the source code of glog again, glog is a multi-threaded synchronous writing operation. The simplified code is lock();dosomething();fwrite();unlock(); fwrite itself is thread-safe, and narrowing the lock granularity needs to be changed to lock ();dosomething();unlock();fwrite(); Other variables are easy to handle, such as file names, etc. The hard thing is that fd will be changed during rotation, and fwrite() will use fd. I used the method of pointer management and reference counting. When rotating the file, assign current_fd_ to old_fd_, not delete or fclose directly. The simplified code is equal to: lock();dosomething();if(true) old_fd_ = current_fd_;currnt_fd_. incr();unlock();fwrite();currnt_fd_.decr(); When old_fd_ = 0, the fd pointer of delete and fclose will be really deleted. After optimization, the stress test takes 30s.

Introduce lock-free queue asynchronous IO

从第二次优化来看。锁热点已经很少了,性能也有不少提升,已经能满足OCS的需求,但是这种多线程同步堵塞写io的模式,一旦出现io hang住的情况,所有worker线程都会堵住。可以看下__IO_fwrite 这个函数,在写之前会进行__IO_acquire_lock() 锁住,写完后解锁。
为了避免所有线程卡住的情况,需要将多线程同步堵塞转换成单线程异步的io操作,同时避免引入新的锁消耗性能,所以引入无锁队列,算法复杂度为O(1),结构如图所示:
screenshot
每个生产者线程都有独自的无锁队列,生产者线程做日志的序列化处理等,整个glog有一个单线程的消费线程,消费线程只处理真正的io请求,无锁队列使用环形数组实现,引入tcmalloc做内存管理。消费线程也会有hang住的可能,因为无锁队列使用CAS,当队列满了的时候并不会无限增长内存,而是会重试几次后放弃本次操作,避免内存暴涨。改造后耗时33s。

小细节优化

glog在linux系统下缺省使用的是pthread_rw_lock,在第二步减少锁粒度的基础上,现已不需要内核态的读写锁,所以将rwlock替换成用户态的spinlock。另外__GI_fwrite的热点还是有一些,采用合并队列的方法减少一些写操作,再加上超时机制,防止缓存的日志不及时落地。总结起来的优化就是:

  • 向前合并队列写
  • glog缺省使用的读写锁和mutex锁,换成spinlock
  • 单条message buffer大小调整
  • fwrite设置file buffer

这些优化完成后耗时时间为16s。

使用场景

优化后的glog版本适合使用在需要高日志吞吐量的产品, 比如OCS这种分布式高并发高吞吐量的系统。

高性能日志系统总结

从以上优化可以总结出高性能的日志系统的特性:

  • 使用异步IO实现高并发的日志吞吐量,日志线程与worker线程解耦,worker线程只做序列化之类的工作,日志线程只做io,避免当磁盘满了等异常情况发生时主路径阻塞导致服务完全不可用,这在任何一个高并发的系统中都需要注意的。
  • 其他细节点特性:
    • 不使用localtime取日期,单测localtime和getimeofday 获取时间, gettimeofday 速度比localtime快20倍
    • 选用无锁队列可重试放弃操作,避免内存暴涨。
    • 使用内存池管理,比如tcmalloc
    • 对fd等关键指针做引用计数处理,避免大粒度的锁。
    • 展开全文

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326171968&siteId=291194637