Nearly 80% of Taobao's big data real-time computing platform, built from 0 experience and pits

Last Monday, Douyu TV, a live broadcast platform from Wuhan, announced the C round of financing. Tencent led the investment of 1.5 billion yuan. It was less than half a year after it obtained the B round of 100 million US dollars.

But Xiaoxun is more concerned about their big data architecture. As a company that has risen in more than 2 years, its traffic has experienced a leap from 0 to PB level.

Just in March this year, the head of Douyu's big data team attended the first Wuhan Developer Summit hosted by Jianxun, shared some experiences and pitfalls, combined with some information, Xiaoxun organized this post for students interested in big data. Reference and reference.


438218194319244192.jpg

About Wu Ruicheng: He joined Douyu in 2014 and became the first person in Douyu's big data team. He has experienced the leap from 100,000 to 10 million users of Douyu, and built Douyu's real-time big data computing platform from 0. Before joining Douyu, Wu Ruicheng worked in Taobao for three years, mainly working on HBase.


QQ screenshot 20160820143353.png

This is a very typical live broadcast room of Douyu, 55 open, the game is played well, the bull blows well, it is better in Douyu, everyone sees dense words, which is the barrage, the most popular scene of video live broadcast, barrage. When it is very popular, there will be gifts on it. The user will give the anchor a rocket, and the shark will ride the gift of the rocket. The value of this rocket is quite high.

The icons in the lower right corner are gifts, gifts given by users to the anchor, shark fins can be recharged and purchased, fish balls can be given, the right side is the contribution ranking of local tyrants, the more contributions, the higher the ranking, and the right side is the barrage area. The content and form are like this, but it is very popular now, and sometimes we can't predict phenomenal phenomena.

Main content to share

First, log retrieval, log global retrieval. It will be expanded later. This place mainly uses NginxPHP logs as examples.

Second, a real-time CEP system, a KV-like processing system.

Third, real-time stream computing, stream computing. strong text

1. Log retrieval


QQ screenshot 20160820183004.png

This is a diagram of the current big data structure. This diagram has only been sorted out recently. The more it is organized, the deeper the experience will be. Look at the red, green, and green blocks in this diagram. Students who have seen a lot of PPT may be used to it. The big data architecture diagram may look like this in the end, but every block on the diagram has stepped on countless pits, and it has become what it is now with the lessons of blood.

When I joined Douyu, one person was in charge of this large piece of things. Later, because of the increase in the number of websites, the personal throughput reached the upper limit, and the first batch of people was recruited. My first batch was cultivated within the group. There will be some java development, and I will join the big data team. From the very beginning, the small system will become bigger and bigger, and the current structure will be achieved.

The bottom layer is the data source layer, Nginx, PHP logs, the company's technology stack is more, it will be more and more painful to deal with, now the unified access layer is Kafka, and it is not fully connected yet.

The upper layer is data cleaning and format conversion including preliminary settlement.

The upper layer includes archived data implemented by MySQL, and the biggest one is offline computing based on Hadoop and YARN. Spark was launched last year, but the scope of application is still relatively small, mainly in the area of ​​risk control and personalized recommendation.

In addition, real-time computing is Hbase, which is familiar with previous experience. Everyone thinks that Hbase has many alternative products. I think Hbase holds massive thermal data in the first layer. I think Hbase's advantages are still very obvious. So I will keep Hbase used in this place all the time. In quick query, presto can be compared with self-service query.

The real-time computing on the right is mainly based on Storm, and this year's big goal is to introduce spark as the focus. For the consideration of the new framework, the secondary development ability or customization ability of small companies is weaker. We now mainly use some short and fast methods. For example, some major companies such as BAT, JD.com, and Meituan have stepped on the pit before we go on. In this way, our cost will be much lower, and we will follow up with spark. The spark community is so active that it is impossible to ignore it. It is now an order of magnitude more active than Hadoop.

On the far right is Elastic, which was initially introduced as a search engine, but the more you use it, the more refreshing it becomes. It is really an artifact, which will be expanded later.

Then there is the above service data layer, which is used by front-end websites and service layers, and then Dashboard, monitoring, personalized recommendation, user behavior analysis, risk control system, search engine, and other data applications.

Platform monitoring is very important.


a2mQZzj_copy.jpg

This is the Lambda architecture, the three-tier architecture, the P processing layer, and the acceleration layer to the service layer above. These three-tier architectures should cover most scenarios of big data team architecture.

Real-time log retrieval

Once upon a time: grep + awk

Evolution: rsync + Hive UDF

Now: ELK

At the beginning, we only had a few PHP instances. When something went wrong, I went to grep and awk. Then the scale increased, and the number of machines and application instances increased suddenly. We used rsync and HiveUDF to collect logs and cut them according to the time granularity. Broken and dragged over, and then used Hive to do some matching to form such a very rudimentary system.

Up to now, ELK, a very cool system, can support a large amount, expandable, full-text retrieval, and many times it is very convenient to locate the problem of technical team, and these features can meet the basic use. If we can help them make some alerts, including volume alerts and text alerts, it will be even better. This is what we are doing now.


3.png

This is the architecture diagram of real-time log retrieval. You can see the application scenario. At that time, flume was used the most, the application scenario became faster, and our business adjustment was very rapid. In the new scenario, we found that there are two scenarios that flume cannot satisfy. One scenario is the C++ scenario, its volume is too large, and their logs are written locally according to the instance folder; one is that Java is too resource-intensive, including CPU and memory, and later we felt that some changes should be made in this area.

The initial plan, because we have a C++ team to do service, they think we can build our own wheels, this wheel is relatively simple, and then I did a circle of comparison and found that in the new version of Logstash, there is a Beats component, which is implemented in golang of.

The middle of the architecture diagram is dominated by the Elastic technology stack, including the intermediate aggregation layer will be replaced in the near future, but some existing scenarios, if they have been stable, keep the status quo. Because we are relatively stable at this stage.

Flume's memory channel will be OOM when the amount is large. At this time, the overflowed amount will be placed on the disk, which can increase the throughput that Flume can withstand while ensuring efficiency, which makes flume very stable and has been used until now. .

Now we still use Flume as the aggregation layer, and we will use Kafka as the aggregation layer in the future. There are many scenarios, and some logs may need to be consumed again later or to be Pub-sub. Now the mode is difficult to implement, and Kafka must be used.

The log data goes to Elastic at the bottom of the picture, and the UI is made with Kibana. After Kibana 2.0, there will be a lot of inconveniences. I actually recommend secondary development. The cost of secondary development is not large, and it is relatively easy to get started. All interfaces can go through the API, which is easy to customize. The top of the picture is the Hdfs report.

FLUME

Options

Channel

Flume Monitoring

Let’s talk about some of the pits that have been stepped on.

First, the selection of Flume. At first, I liked it because it was an Apache product and thought it was stable. In the PPT of many companies, I slightly estimated that the probability of flume appearing is much higher than that of other products, so I did some stress tests and compared it. Not too much, so I chose flume, and now I need more detailed pressure testing if I want to use it new or change the model.

For the channel part, from memory to disk to the present, the two schemes are mixed together, but the resources are particularly resource-intensive.

For flume monitoring, we must first consider monitoring and then use the new technology stack.

ERK

  • ES vs Solr

  • ES plugin: KOPF cluster monitoring: hesd index operation;

  • ES read-write separation

  • Independent small clusters solve slow queries;

  • Avoid overly large indexes

  • Avoid using Range queries in the hottest queries;

  • JVM heapsize setting;

  • CMS vs G1

On Elastic, we compare it with Solr. You can take a look at the establishment of pure open source and open source products supported by commercial teams. Community activity and product iteration are not on the same order of magnitude. Elastic has now begun to focus on user experience. One point is that Solr has not taken into account.

Because Elastic is in addition to our most traditional search engine and text search at the beginning, now a larger piece can be used as our multi-dimensional self-service query, and the horizontal expansion ability is very strong. Because of the advantages of the data structure, there are some scenarios, including multi-dimensional The timely query of this piece has very powerful performance.

On the ES plugin, we use Kopf for monitoring and head for index operation.

ES read and write separation. The ES cluster topology is getting bigger and bigger. If it is used according to the default topology, it may not be able to meet many scenarios in terms of volume. For example, if the read and write are not separated, the query is very likely to directly overwhelm the nodes written online. It is recommended to have a dedicated node responsible for reading.

对于资源隔离,我们使用了几个小的Elastic的集群来满足各个功能。因为,Elastic是P2P的,无主。无主有一个问题,有时候没有办法很强的控制某些节点行为,这时候要做一些隔离,最见效的方式就是按照小集群直接做隔离。

避免索引过大。这一点大家如果能注意把不必要的字段建到索引能解决大部分。

最热的查询中避免用range查询。

JVM heapsize设置,我们现在一直使用32G,Hbase集群也是这样,尽管集群配置很高,Hbase的配置还是32G。

GC方面,我们使用的是CMS,在线上使用、压测表现看的话,G1稳定性和用户体验看来都会差一些。

二、实时CEP系统

  • 从前: Redis

  • 演进:HBase

  • 现在:TSDB

最开始我们做一个指标统计,大家把数据推到我们这边来做一些统计,然后借助redis做统计并最后把结果数据保存在Redis,简单的统计场景OK了,后来业务场景复杂了,产品线多了,redis单个实例肯定不够,可扩展性和数据规模是redis暂时无法越过的门槛,所以我们又很自然用到了Hbase。

Hbase使用有两大点需要注意:

第一,rowkey的设计,Hbase中除了rowkey没有索引可供使用。

第二,数据压缩,历史数据的压缩很关键。一个指标两个指标做抽样做一些归档很好做,但是怎么做到统一,而且还很简单,我们能直接拿来用,这个时候碰到open TSDB,一个时间序列存储方案。

最开始也用了InfluxDB,感觉有时候只要压力上来了之后,它可以没有征兆挂机,后来干脆就考虑到open TSDB。数据揣拽产生图形,基于OpenTSDB,能满足很大的量。

这个系统中真正性能考验的其实还是Hbase,Hbase OK,opentTSDB也就没有问题,我们会一直把这个方案做下去,基于open TSDB,我们可以很灵活做定制,它本身就是基于Hbase做了定制的特性,包括我刚刚说到对rowkey的设计。

对数据压缩,每一个指标每一个小时会有一个row,open TSDB帮我们做了。后面有定制需求我们从头开始做,这一块是比较简单的,底层Hbase性能是没有问题,越往后看,Hbase有很多地方它会做得越来越通用。因为它的性能这一块显性能没有问题,后面卡顿的问题会有明显的提升。


4.png

回到刚刚上面的图这是CEP系统,这个图上面,大家可以看一下。

从数据收集,第一个parser会走到Kafka,从spark走到Hbase,走到这一步就走到了业务系统,包括我们的监控系统,这是有一个业务流程,现在可以简单理解成某些指标大于阈值就觉得它的是一个嫌疑事件,需要告警的,简单理解就是这样,这一块马上引入规则引擎,这一块业务变化频率太快了,发布速度拖了后腿,在已经测试上了。

到后面有一些结果的存储,再有告警的推送,这个地方也是直接走到Hbase。后面有一些统计好的指标可以拿来用的,这个地方我们走到了open TSDB,这个图就没有重新再画,直接从Cloudera Blog上面借用,这个架构图和我们的系统是一模一样的。

  • 关于 Open TSDB:

  • 周期 IO 波动问题:①禁用 OpenTSDB 的Compation机制;②数据压缩过速度

Open TSDB,业务指标非常灵活,我们现在有一些CPU指标,打出来我们收集起来,各个指标汇集在一起,而且是秒级的力度,这个力度因为指标量大,时间粒度比较细,我们服务机器的服务数越来越大,现在还碰不到瓶颈。

  • 关于HBase:

  • Rowkey 设计是关键;

  • 不适宜多维度索引、需要事务、稳定性要求极高;

  • 关注写热点;

  • writebuffer,WAL,Autoflush;

  • 关闭compact/split、手动触发GC

关于Hbase使用。现在用Hbase的公司越来越多,2011年淘宝这一块就已经开始在线上大规模使用,Hbase这一块很稳定,从0.96之后就已经可以说到非常稳定,1.0有一些变化,1.0之后的Hbase是值得大家使用的。

rowkey设计可以写一本书,这里只做简单介绍。Hbase没有索引,所以rowkey非常关键,我们通过rowkey定位到数据,如果通过rowkey能约精确定位到数据,查询效率越高,用这个思路看看业务场景和看看使用,可以做一些相应的优化,做一些提升。

HBase不适宜的场景,包括多维度索引、需要事务、稳定性要求极高。

关注写热点,一般,按照默认的Region Split方案,上线后如果写压力比较大,都会有写热点的问题,这时需要考虑预建region。再就是写压内考虑writebuffer、WAL、autoflush,我写的要求很高,数据一致性要求很高那这事就不好办,只有做权衡,写性能上和数据一致上做权衡,下面三个参数只要你调了或者关了,可用性就会丢,有这个风险择,这是预先告诉大家。

对日志类的表化考虑关闭compact,手动触发GC。


y2i263I_副本.jpg

Open TSDB表设计和原数据和数据表。这是官方图,讲得非常透,大家看一下怎么保证维的很多,数据量很大的时候,能够基于open TSDB把这么一个系统做得高效,就是通过一套rowkey,还有右图按照时间力度做row的压缩,我觉得主要这两个特性保证它的性能。

这是跟open TSDB密切相关的两个点。

三、实时流计算

这一块我们现在斗鱼用得规模比较大,和大公司比可能就有一点小巫见大巫,但是我还是想分享一下,从0到1的过程,包括第三点,从1到1.1的过程。

  • 实时流计算

  • 以前:靠猜

  • 演进:Redis

  • 现在:Storm + Spark Streaming

流计算。比如,我们上了一个专题或者我刚开始提到,英雄联盟有一个决赛,线上有量,量有多大,只能根据卡不卡,只能主观上感觉卡不卡做一个评估。后台服务器的一些数据指标比较延时,刚开始靠猜,靠感觉,感觉要上机器了,要调一些流或者压力到另外一部分机上,靠感觉。

包括有一些上专题,比方说有一些活动,锤子或者魅族、乐视新品发布,他们的量,有时候没有能想象的大,有时候会非常大,但是我们没有办法做一些预案,所以这个时候我们就慢慢有了这个,这是我们最开始的一个迫于压力有了这样一个方案,redis实时统计的量。

用户多了,鸟就多了,各种羊毛党就越多,这一块有了一个风控,再一个个性推荐,用户多了之后,用户群体户越来越多样化,这一块就考虑个性推荐,千人千面,这一块是后来第二阶段的需求。就有了现在storm加spark Streaming的方案在跑。


zE3qUrF_副本.jpg

这是数据流的架构,最开始只有最上面的架构,web、APP,在Nginx Lua,这是锤子2发布会捐赠的一个项目,他把世界上最快的两个系统,一个是Nginx和Lua,加在一起性能非常好强悍。基于Lua和redis,性能好,又好用又稳定,又不吃资源。

到了Kafka这一层,就有了另外的一些数据,比方用户行为数据接入进来,关系表MySQL,我们没有其它的关系存储。到了Kafka出来之后就是storm,是线上规模用得最大,我刚才说的数据产品都是基于storm,后面简单介绍一下storm踩过一些坑。

Spark吞吐量是非常好的,因为两个数据模型就决定了他们两个侧重业务场景是不一样的,后面离线计算,这个中间有一个是数据应用层,我们可以从实时计算到数据应用层,会写到中间离线层,又有另外一批数据到前面的应用层,实时数据监控和其它应用。

  • 关于数据收集

  • 以前:堆PHP

  • 现在:OpenResty

刚刚讲了数据收集这一块,尤其用户行为数据,包括另外有一些服务层的服务,开始堆PHP,太耗资源,我们就发现OpenResty。

再用Storm,我先把这个罗列在这个地方,Storm优化主要就是基于这两个逻辑对象图。


6_副本.png

Storm的新版本中,已经剥离了对ZK的依赖。我们所有的调优调这几个对象的参数,比方提高并行度,我们要提高时间时效,就是基于这个图。

这个图中,数据流怎么从这个流程里面最快的流入,最快流出,这就是实时流计算的初衷或者说包括最终的解决方案,也就是一直在优化。就比方说我们在第一级Kafka或者redis出来之后进到storm,越简单越快把消息弄进来最好。弄进来之后越快把消息处理完统计完把数据推走,越快推走对压力越小,处理时效吞吐量越大。

如果我们做优化,会去分析在第一个bolt1或者bolt2,如果里面有堆积,是在哪一个逻辑里面堆积,会考虑增加并行度或简化它的逻辑,让数据流尽快从第一级到 第二级到第三级,流出数据流程,我们整个优化的思路就是这样。

bolt1、2到bolt3,想跟大家分享,我们很多时候优化Storm忽略一个点,Storm依赖外部资源会成会我们的瓶颈,我们的数据没办法往外面推,没办法落地,后面一层堆积也会直接制约我们优化的一个瓶颈。

我们最后往redis写,性能强悍,你一个storm没问题,当时用一个redis做一些hush,做分散,还是解决不了,后来把redis替换掉。

  • 关于 Storm 优化

  • Spout数和Kafka中Topic 的 Partition 数相匹配

  • 根据excute latency,找出各个 componet 的 process cost

  • 让 spout nextTuple 尽量简单

  • 提升 Storm Topology的性能要注意外部资源

这是我们在storm优化整体的思路,比较简单,主要几大块。 spout数和Kafka中的话题的partition数相匹配。 监控每一个执行的时效,去做监控,及时发现某一些componet要不要做优化。

我们最开始上storm就有了spark流,流利用在时空监控的场景,这是今年2016年的大方向。

  • 关于 Spark Streaming

  • 设置合理的批处理时间(batchDuration)

  • 缓存需要经常使用的数据

  • 集群task并行度

  • 使用Kryo序列化

这是流的简单使用有一些心得,踩过一些坑。批处理时间。换粗需要经常的使用的数据。集群task并行度,使用Kryo序列化。

这是我们踩过的巨坑,最后和大家强调一下。

踩过的巨坑

  • 监控

  • 安全

  • 余量

第一个踩过的巨坑就是监控。

我们有很多量,现象级的,百万级的用户立马在一秒到十秒用涌入一个直播间,这个直播间放在和其它直播间放在一个server上面,立马卡顿不卡用,如果在监控这一块,可以解决很多的一些告警和预警。包括有一些业务的指标监控,监控这一块非常重要。

今年做了比较大的一块,就是在做统一监控平台,现在我们也是在花主要的开发资源做这一块,因为我们前端有网站端后端有C++服务端,语言异构排查起来就存在,没法定位,第一反应大家很本能甩锅,就需要统一监控平台。

第二,安全。

最开始太粗放,我们最开始做网络隔离,我们的集群是第一次做了网络上的隔离,然后后来就包括人员越来越大,因为不可能是我一个人干,也不可能做这么多业务场景,用的人越来越多,包括其它团队,业务分析师做数据分析用到线上环境,这个地方安全非常重要。

第三,一定的余量。

预估业务、提需求,上机器这么一套下来,就一两个月,小公司不要扣这部分的成本,起码预留20%的量。

To do

探索式数据集市、推荐系统、风控系统,这是我们今年最大的三块目标。

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326708045&siteId=291194637