To praise one hundred million orders exploration and practice synchronized

To praise one hundred million orders exploration and practice synchronized

The machine does not learn 2019-05-19 13:54:15

First, the primer

There is praise SAAS provide business services, as more and more businesses use to praise, or search the details of the growing demand, and demand for the scene, before the mentioned order management architecture evolution and architecture, etc. In both AKF It has been reflected in the article, and query the data from different Nosql, how to synchronize the non-real time storage system will be a very interesting thing.

1.1 Synchronization Status

There are currently shown and business processes like order synchronization status quo as shown, using ES + HBase (tip1) architecture system to address the needs of search and details, using the canal (tip2) will be written to the database changes mq, and then use synchronous system to solve problems related to data synchronization, and follow-up will be described hereinafter issues like order synchronization faced and response options.

To praise one hundred million orders exploration and practice synchronized

 

 

Second, synchronous

2.1 to synchronize base - a single table synchronization

2.1.1 disorder problem

Single table as shown in FIG synchronization:

To praise one hundred million orders exploration and practice synchronized

 

 

Business scenario, in which any of a number of links, two of the same message appears if a concurrent primary key, while in the version control Nosql do not, the message is out of sequence may have resulted in problems, but on the contrary, if the parsed by sequentially binlog Meanwhile, the distribution is the result of each statement sql execution of a sequence SeqNo, the SeqNo ensure the orderly (tip3), then controlled by Nosql optimistic locking in order to solve the problem.

2.1.2 HBase synchronization

Hbase synchronization is relatively simple, the interior has a timestamp help control Hbase qualify each version, as long as the above-mentioned incoming timestamp order SeqNo, then you can ensure that the data is read out of each field is eventual consistency .

2.1.3 ES synchronization

ES table synchronization for a single scene may be performed by the operation of writing index, may be employed exteneral version index, also referred to control external version number, equally applicable SeqNo above do solve the problem to optimistic locking.

2.1.4 Synchronization process map

To praise one hundred million orders exploration and practice synchronized

 

 

2.2 Advanced synchronization - Synchronization multi-table

2.2.1 disorder problem

Make multiple correlation table synchronization based synchronization based on a single table synchronization, as shown:

To praise one hundred million orders exploration and practice synchronized

 

 

当数据库的两张表(不关心是否在一个实例上,如果不在,还更惨,SeqNo 还不一定能够保证有序)触发了更新操作,假设 t1 生成 binlog 的 SeqNo 小于 t2 生成 binlog 的 SeqNo,若 t1 这条消息因序号链路中的网络抖动或其它原因造成消费晚于 t2,也就是 t2 的 binlogSeq 先写入 Nosql 中,那么就会造成一个 t1 的数据无法写入到 Nosql 中。

2.2.2 HBase 同步

其实上述多表同步乱序的问题并不是绝对的,针对 Hbase 这种自带列版本号的将会自动处理或丢弃低版本数据,同时针对这种情况,设计成每个 table 表中的字段都会列入到 Hbase 中。举个例子,针对订单的情形存入订单主表和订单商品表

场景 1:1

针对订单主表,我们写入的数据以订单号做 hash,然后以 hash 值: 订单号作为主键降低热点问题,同时定义单 column family,qualifiy 格式为 表: 字段 value 为对应的 value 值,timestamp 为 SeqNo,如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

场景 1:n

针对订单商品表,我们写入的数据同样以订单号生成相应的 rowkey,同时定义单 column family,qualifiy 格式为 表: 字段: 对应记录的 id 值 value 为对应的 value 值,timestamp 为 SeqNo,如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

通过上述写入,能够针对具体到某个字段都有对应的 timestamp 值的更新,为后期写入更新数据能够更新到具体字段级别。

2.2.3 ES 同步

针对上面的同步乱序问题,ES 没有 HBase 这种列版本号,ES 只有 doc 级别的 version,如果上述真的出现 SeqNo2>SeqNo1, 且 SeqNo2 早于 SeqNo1 写入到 ES 中,则就会出现 SeqNo1 的内容无法写入,也就会造成顺序不一致的情况。那如何去解决这个多表同步问题呢?

既然会乱序,那让它有序就好了,数据保证有序不就能够解决这个事情嘛,让整个链路有序也就代表 canal 消费 binlog 数据保证有序且丢到 MQ 中有序,MQ 然后保证顺序投递到 Sync 消费处理程序中,通过消费一条消息然后 ack 告诉 MQ 是否成功,已达到保证所有数据全部有序(若多线程或多机器处理 MQ 中的多个分区都是会存在问题)。如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

如此只要保证 t1 表的数据和 t2 表中的数据在 ES 不互相关联,每个数据写入的时候按照 update 方式写入(如果不存在需要做一次 create 操作), 这样就能保证所有数据按照顺序执行。

2.3 配置化同步

上文已经讲到数据同步是由一个数据源(这个数据源可以来自于 MQ、Mysql 等)同步到另外一个数据源(Mysql、ES、HBase、Alert 等),也就是一个管道的过程。借鉴了一下 logstash 官网,同样处理流程分为 input、filter、output 组件,这些流程称之为 task 任务,如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

通过这些组件,抽象化出每个组件都有对应的配置,由这些配置来进行初始化组件,驱动组件去执行流程。简单来说,只需要在页面中配置一些组件,无需开发任何一行代码就能实现同步任务。如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

通过一系列的配置,就能配置出一个任务,针对业务逻辑,可以采用动态语言 groovy 来实行脚本化处理(复杂业务场景可以通过 UDF 函数来做支持),针对 mqinput 拿到的字段然后经过处理,经过过滤 filter 等,可以直接拿到相关的数据进行组装,然后配置化的写入到 ES 中,无需开发任何一行 java 代码即可实现流程自动配置化,针对复杂的需求也能够高效率支持。配置界面如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

2.3.1 性能瓶颈

上述就能解决 ES 多表同步的问题,但是同样会存在一些问题:

  • 性能瓶颈问题
  • 失败堆积问题

性能瓶颈:比如写入量超级大的场景情况下,而 Sync 消费程序只能针对 MQ 中的分区(kafka 的 partition 概念)消费,每个分区只能有一个线程去执行,消费速率与消费分区成正比,与消费 RT 成反比,尤其是大促场景下就会造成数据消费不过来,数据堆积严重问题。

失败堆积:因为是顺序消费,只要某个分区的某条消息消费失败,后续消息就会全部堆积,造成数据延迟率超高。所以建议用顺序队列的场景除非是业务量没有性能瓶颈的情况下可以采取使用,而怎么去解决顺序队列或者去掉顺序队列呢?

用顺序队列无非就是保证有序,因为 ES 没有 HBase 的字段级别版本号,目前订单采用的是用 HBase 做一层中间处理层,解决该问题,如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

通过借助 HBase 字段级别版本号帮助每个表保证表内部字段有序,同时 put 写入完数据之后,通过额外字段 version 做 increment 操作,当这两个写入动作完成之后立马 get 操作拿到 HBase 的数据写入到 ES 中,无论并发程度如何,最终至少有一次的 get 请求拿到的版本 version 字段是最大的,用该 version 作为 ES 的外部版本号解决 ES 版本号问题。

用此方案会有好处:

  • HBase 协助管理内部字段版本,同时根据内部操作,协助 ES 拿到对应的版本,且数据能拿到最新数据;
  • 去掉了顺序队列,HBase 具有良好的吞吐,相对于顺序队列拥有更大的吞吐量;
  • 横向拓展增大消费速率;
  • ES 可以采用 index 操作,性能更好。

当然也有弊端:HBase 存在抖动的情况,以及主备切换问题。

因为存在抖动或者准备切换问题,会造成数据不一致,我们该怎么去解决这个事情呢?

2.4 未来扩展

目前订单同步是通过加载配置文件形式来做的,也就是横向拓展的机器都会去加载同一份配置文件,各个任务通过异常解耦,理论上不会有影响,但是会存在加载任务的重要度的问题。

举个例子:

  • 我需要一台机器临时去消费数据解决线上问题;
  • 有个量级很大但又不是很重要的任务,想不影响其他任务的进行;
  • 要做对比,增量延迟对比或全量对比数据,但又不希望影响其他数据;
  • 查询日志需要所有机器查看查询(当然,公司有内部日志系统,可直接上去查看) 如此,可以让同步系统无状态化,每个任务的配置加载有任务配置平台来进行配置,指定相关的机器做相关的处理,扩容也可以动态申请扩容,如图所示,可以自由分配机器处理不同的任务。

 

To praise one hundred million orders exploration and practice synchronized

 

 

三、一致性保障

上文讲了有赞在处理订单的时候怎么讲数据同步到 ES 或 HBase,数据来源于 binlog,写入到 MQ,也就是说处理的来源来自于 MQ。简单一句话来讲:我们不生产消息,我们是消息的搬运工。“搬运工”的角色可以做一些事情,同样有赞在处理数据对比也是如此,这章讲讲“搬运工”可以做什么:

3.1 数据对比

上述一般情况下不会出问题,那如果出问题了怎么办,需要做数据对比,而数据来源就是我们刚刚抛弃的顺序队列,顺序队列有个缺点就是堆积,同样我们也可以利用堆积的特性,让其第一条消息堆积十分钟,那么后续消息基本上也会堆积十分钟,然后就可以消费这个消息进行数据拉取,拿到最新的数据进行数据对比,如图所示:

To praise one hundred million orders exploration and practice synchronized

 

 

通过对比结果发送到 alert 中,就可以知道哪些数据不一致,频率多少,这也是一种同步(mq->filter->alert)!

3.2 全量对比 / 数据刷入

上述我们讲到数据同步到 Nosql 中,但是只是讲了增量的一个过程,涉及到历史数据,就需要对历史数据进行迁移,同样,这也是一种数据同步,后面将会出相关博文怎么去做全量数据同步。

四、Tips

Tip-1:为什么采用 ES + HBase 处理搜索和详情?

Under normal circumstances, the company reaches a certain size, have similar needs or high-frequency full-text search key: value of time, we recommend ES + HBase architecture system to complete the search and the details of the demand, but in reality, the vast majority of cases production environment does not write data directly to the ES or Hbase, we will give priority written to the database, not double-write operation is due to the increase links affect business. Of course Hbase may be a little better, ES itself is non-real-time query system (Why are non-real-time, are interested can go and see ES read and write process), this case also created a quasi-real-time system of ES and HBase. For business, the quasi-real-time is to meet the related needs, such as business search order does not require real-time.

Tip-2: Why have praised the election canal resolve binlog, rather than a service message for data synchronization?

  • Data tables are changed, such as repair data, the service message will not trigger, resulting in not written to the corresponding Nosql caused. Data inconsistencies
  • Order-related issues can not be guarantees;
  • Service messages and not get all relevant data is written to the nosql.

Tip-3: SeqNo implementation, why not binlogoffset?

Because cana instance mysql Examples are 1: N (Recommended 1: 1), and most business scenarios typically in the same data on the same instance, by the time the Canal can stage process with the number of instances per second Combine. Such as: timestamp * 10000 + counter ++, and the reason is not binlogoffset example mysql hung up, then, binlogoffset may be out of order.

V. Conclusion

There praise transaction order management to undertake a synchronization task one hundred million traffic, faced with the needs of the many challenges, from mysql beginning to synchronize tasks product of today, from a single table synchronization to multi-table synchronization from a single index to multiple indexes from incremental to the full amount, we have a different solution, and now emerging in the search stage is to undertake one hundred million search and synchronous traffic, who are interested are welcome to join us to discuss together.

Guess you like

Origin blog.csdn.net/u013322876/article/details/90573504