MySQL to TiDB: The road to horizontal expansion of Hive Metastore

Author: vivo Internet Big Data Team-Wang Zhiwen


This article introduces vivo's exploration process on the road of horizontal expansion of big data metadata services. Based on the actual problems faced, it conducts research and comparative tests on the current mainstream horizontal expansion solutions, and selects the TiDB solution through multiple comparisons of data. Secondly, the entire expansion solution process, implementation problems and solutions are shared. This article has a very high reference value for developers who face the same dilemma in big data metadata performance.


1. Background


The big data metadata service Hive Metastore Service (hereinafter referred to as HMS) stores all metadata relied on in the data warehouse and provides corresponding query services, allowing computing engines (Hive, Spark, Presto) to accurately access massive data The specific data that needs to be accessed plays a decisive role in the stable construction of the offline data warehouse. The Hadoop cluster of the vivo offline data warehouse is built based on the CDH 5.14.4 version. The HMS version is chosen to follow the CDH major version. The currently used version is 1.1.0-cdh5.14.4.


Before the underlying storage architecture of HMS was upgraded, vivo used the MySQL storage engine. However, with the development of vivo's business, data has grown explosively, and the stored metadata has also grown to the level of 100 million (PARTITION_PARAMS: 810 million,

PARTITION_KEY_VALS: 350 million, PARTITIONS: 140 million). With such a large data base, our team often faces the performance bottleneck of machine resources. Many users often query certain large partition tables (500,000+ partitions) concurrently, and the use of machine resources The rate will be filled up, resulting in metadata query timeout, and in severe cases, the entire HMS cluster may even become unavailable. At this time, the recovery method can only temporarily stop all HMS nodes until the MySQL machine load is reduced, and then gradually resume services. For this reason, in view of the serious performance bottlenecks in the current MySQL solution, HMS urgently needs a complete horizontal expansion solution to solve the current urgent need.


2. Selection of horizontal expansion technology solutions


In order to solve the performance problem of HMS, our team has done a lot of research on HMS horizontal expansion solutions. Generally speaking, the industry's ideas for horizontal expansion of HMS are mainly divided into de-database expansion of MySQL or replacing MySQL with a high-performance distributed engine. . A relatively mature solution based on the first idea is the open source Waggle Dance by Hotels.com, which implements a cross-cluster Hive Metastore proxy gateway. It allows users to access data from multiple clusters at the same time. These clusters can be deployed in different platforms, especially cloud platforms. The current mainstream approach to the second idea is to replace the traditional MySQL engine with the distributed storage engine TiDB. In the Hive community, many companies have done a lot of testing on hive 2.x access to TiDB and applied it to production (click for details ) ).


2.1 Waggle Dance


Waggle-dance provides users with a unified entrance, routing requests from the Metastore client to the underlying corresponding Metastore service, while hiding the underlying Metastore distribution from users, thus integrating the Hive library table information of multiple Metastores at the logical level. Waggle-dance implements the Thrift API of Metastore, and the client does not need to be modified. For users, Waggle-dance is a Metastore. Its overall structure is as follows:


Waggle Dance Architecture


The most prominent feature of Waggle-dance's architecture is that it uses multiple different MySQL instances to share the pressure of the original MySQL instance. In addition, it has the following advantages:

  1. The user side can follow the usage of the Metastore client to configure multiple Waggle-dance connections, and switch to other Waggle-dance services when the current Waggle-dance connection service is unavailable.

  2. Waggle-dance only takes a few seconds to start. Coupled with its stateless service characteristics, Waggle-dance has efficient dynamic scalability. It can quickly launch new service nodes during peak business periods to spread the pressure. Some service nodes in the line release resources.

  3. As a gateway service, in addition to routing functions, Waggle-dance also supports subsequent customized development and differentiated deployment. The platform can add functions such as authentication and firewall filtering as needed.


2.2 TiDB


TiDB is an open source distributed relational database independently designed and developed by PingCAP. It is a converged distributed database product that supports both online transaction processing and online analytical processing (Hybrid Transactional and Analytical Processing, HTAP). It has horizontal expansion or contraction capabilities. Capacity, financial-grade high availability, real-time HTAP, cloud-native distributed database, compatibility with MySQL 5.7 protocol and MySQL ecosystem and other important features. In TiDB 4.x version, its performance and stability have been greatly improved compared with previous versions and meet the metadata query performance requirements of HMS. Therefore, we also conducted corresponding research and testing on TiDB. Combining HMS and big data ecology, the overall deployment architecture of using TiDB as metadata storage is as follows:


HMS on TiDB architecture   


Since TiDB itself has horizontal expansion capabilities, it can evenly distribute query pressure after expansion. This feature is our big killer to solve the query performance bottleneck of HMS. In addition, this architecture has the following advantages:

  1. Users do not need to make any changes; there are no changes on the HMS side, only the underlying storage it relies on changes.

  2. Without destroying the integrity of the data, there is no need to split the data into multiple instances to share the pressure. For HMS, it is a complete and independent database.

  3. In addition to introducing TiDB as a storage engine, no additional services are required to support the operation of the entire architecture.


2.3 Comparison between TiDB and Waggle Dance


The previous content gives a brief introduction and summary of the advantages of the Waggle-dance solution and the TiDB solution. The following lists the comparison of these two solutions in multiple dimensions:



Through the comparison of the above multiple dimensions, the TiDB solution is superior to the waggle-dance solution in terms of performance, horizontal expansion, operation and maintenance complexity, and machine cost, so we chose the former for online application. 


3. TiDB online plan


Choose the TiDB engine to replace the original MySQL storage engine. Since TiDB and MySQL cannot have a dual-master architecture, the HMS service must be completely stopped and restarted to switch to TiDB during the switching process. In order to ensure a smooth switching process and avoid any major problems later It can be rolled back in time if it happens. The following data synchronization structure was implemented before the switch to ensure that the MySQL and TiDB data are consistent before the switch and that MySQL still has the backup after the switch.


Data synchronization architecture before and after TiDB&MySQL goes online


In the above architecture, the only writable data source before switching is the source database master database, and all other TiDB and MySQL nodes are in a read-only state. If and only if all HMS nodes are stopped, the MySQL source database slave database and TiDB When the maximum data synchronization timestamp of the source database main library is consistent with the source database main library, the TiDB source database main library will be granted writable permissions, and the HMS services will be launched one by one after modifying the HMS underlying storage connection string.


After the above architecture is completed, the specific switching process can be started. The overall switching process is as follows:


HMS switches underlying storage process


Before ensuring normal synchronization of source MySQL and TiDB data, the following configurations need to be done on TiDB:

  • tidb_skip_isolation_level_check needs to be configured as 1, otherwise a MetaException will occur when starting HMS.

  • tidb_txn_mode needs to be configured as pessimistic to improve transaction consistency.

  • The transaction size limit is set to 3G, which can be adjusted according to the actual situation of your business.

  • The connection limit is set to a maximum of 3000, which can be adjusted according to the actual situation of your business.


此外在开启sentry服务状态下,需确认sentry元数据中NOTIFICATION_ID的值是否落后于HMS元数据库中NOTIFICATION_SEQUENCE表中的NEXT_EVENT_ID值,若落后需将后者替换为前者的值,否则可能会发生建表或创建分区超时异常。


以下为TiDB方案在在不同维度上的表现:

  1. 在对HQL的兼容性上TiDB方案完全兼容线上所有引擎对元数据的查询,不存在语法兼容问题,对HQL语法兼容度达100% 

  2. 在性能表现上查询类接口平均耗时优于MySQL,性能整体提升15%;建表耗时降低了80%,且支持更高的并发,TiDB性能表现不差于MySQL

  3. 在机器资源使用情况上整体磁盘使用率在10%以下;在没有热点数据访问的情况下,CPU平均使用率在12%;CPU.WAIT.IO平均值在0.025%以下;集群不存在资源使用瓶颈。

  4. 在可扩展性上TiDB支持一键水平扩缩容,且内部实现查询均衡算法,在数据达到均衡的情况下各节点可平摊查询压力。

  5. 在容灾性上TiDB Binlog技术可稳定支撑TiDB与MySQL及TiDB之间的数据同步,实现完整的数据备份及可回退选择。

  6. 在服务高可用性上TiDB可选择LVS或HaProxy等服务实现负载均衡及故障转移。


以下为上线后HMS主要API接口调用耗时情况统计:






左右滑动,查看更多···


四、问题及解决方案


4.1 在模拟TiDB回滚至MySQL过程中出现主键冲突问题


在TiDB数据增长3倍后,切换回MySQL出现主键重复异常,具体日志内容如下:



主键冲突异常日志


产生该问题的主要原因为每个 TiDB 节点在分配主键ID时,都申请一段 ID 作为缓存,用完之后再去取下一段,而不是每次分配都向存储节点申请。这意味着,TiDB的AUTO_INCREMENT自增值在单节点上能保证单调递增,但在多个节点下则可能会存在剧烈跳跃。因此,在多节点下,TiDB的AUTO_INCREMENT自增值从全局来看,并非绝对单调递增的,也即并非绝对有序的,从而导致Metastore库里的SEQUENCE_TABLE表记录的值不是对应表的最大值。


造成主键冲突的主要原因是SEQUENCE_TABLE表记录的值不为元数据中实际的最大值,若存在该情况在切换回MySQL后就有可能生成已存在的主键导致初见冲突异常,此时只需将SEQUENCE_TABLE里的记录值设置当前实际表中的最大值即可。


4.2 PARTITION_KEY_VALS的索引取舍


在使用MySQL引擎中,我们收集了部分慢查询日志,该类查询主要是查询分区表的分区,类似如下SQL:

#以下查询为查询三级分区表模板,且每级分区都有过来条件
SELECT PARTITIONS.PART_IDFROM PARTITIONS INNER JOIN TBLS ON PARTITIONS.TBL_ID = TBLS.TBL_ID AND TBLS.TBL_NAME = '${TABLE_NAME}' INNER JOIN DBS ON TBLS.DB_ID = DBS.DB_ID AND DBS.NAME = '${DB_NAME}' INNER JOIN PARTITION_KEY_VALS FILTER0 ON FILTER0.PART_ID = PARTITIONS.PART_ID AND FILTER0.INTEGER_IDX = ${INDEX1} INNER JOIN PARTITION_KEY_VALS FILTER1 ON FILTER1.PART_ID = PARTITIONS.PART_ID AND FILTER1.INTEGER_IDX = ${INDEX2} INNER JOIN PARTITION_KEY_VALS FILTER2 ON FILTER2.PART_ID = PARTITIONS.PART_ID AND FILTER2.INTEGER_IDX = ${INDEX3}WHERE FILTER0.PART_KEY_VAL = '${PART_KEY}' AND CASE WHEN FILTER1.PART_KEY_VAL <> '__HIVE_DEFAULT_PARTITION__' THEN CAST(FILTER1.PART_KEY_VAL AS decimal(21, 0)) ELSE NULL END = 10 AND FILTER2.PART_KEY_VAL = '068';


在测试中通过控制并发重放该类型的SQL,随着并发的增加,各个API的平均耗时也会增长,且重放的SQL查询耗时随着并发的增加查询平均耗时达到100s以上,虽然TiDB及HMS在压测期间没有出现任何异常,但显然这种查询效率会让用户很难接受。DBA分析该查询没有选择合适的索引导致查询走了全表扫描,建议对PARTITION_KEY_VALS的PARTITION_KEY_VAL字段添加了额外的索引以加速查询,最终该类型的查询得到了极大的优化,即使加大并发到100的情况下平均耗时在500ms内,对此我们曾尝试对PARTITION_KEY_VALS添加上述索引操作。


但在线上实际的查询中,那些没有产生慢查询的分区查询操作其实都是按天分区的进行一级分区查询的,其SQL类似如下:

SELECT "PARTITIONS"."PART_ID"FROM "PARTITIONS"  INNER JOIN "TBLS"  ON "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"    AND "TBLS"."TBL_NAME" = 'tb1'  INNER JOIN "DBS"  ON "TBLS"."DB_ID" = "DBS"."DB_ID"    AND "DBS"."NAME" = 'db1'  INNER JOIN "PARTITION_KEY_VALS" "FILTER0"  ON "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID"    AND "FILTER0"."INTEGER_IDX" = 0  INNER JOIN "PARTITION_KEY_VALS" "FILTER1"  ON "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID"    AND "FILTER1"."INTEGER_IDX" = 1WHERE "FILTER0"."PART_KEY_VAL" = '2021-12-28'  AND CASE     WHEN "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' THEN CAST("FILTER1"."PART_KEY_VAL" AS decimal(21, 0))    ELSE NULL  END = 10;


由于对PARTITION_KEY_VALS的PARTITION_KEY_VAL字段添加了索引做查询优化,会导致该类查询生成的执行计划中同样会使用idx_PART_KEY_VAL索引进行数据扫描,该执行计划如下:


走idx_PART_KEY_VAL索引执行计划


添加的idx_PART_KEY_VAL索引在该字段的具有相同值的数据较少时,使用该索引能检索较少的数据提升查询效率。在hive中的表一级分区基本是按天进行分区的,据统计每天天分区的增量为26w左右,如果使用idx_PART_KEY_VAL索引,按这个数值计算,查询条件为day>=2021-12-21 and day<2021-12-26的查询需要检索将近160w条数据,这显然不是一个很好的执行计划。


若执行计划不走idx_PART_KEY_VAL索引,TiDB可通过dbs、tbls检索出所有关联partition数据,在根据part_id和过滤条件扫描PARTITION_KEY_VALS数据并返回。此类执行计划扫描的数据量和需要查询的表的分区总量有关,如果该表只有少数的分区,则查询能够迅速响应,但如果查询的表有上百万的分区,则该类执行计划对于该类查询不是最优解。


不走idx_PART_KEY_VAL索引执行计划


针对不同执行计划的特性,整理了以下对比点:



在实际生产中元数据基本都是按天分区为主,每天增长大概有26w左右,且范围查询的使用场景较多,使用idx_PART_KEY_VAL索引查询的执行计划不太适合线上场景,故该索引需不适合添加到线上环境。


4.3 TiDB内存突增导致宕机问题


在刚上线TiDB服务初期,曾数次面临TiDB内存溢出的问题,每次出现的时间都随机不确定,出现的时候内存突增几乎在一瞬间,若期间TiDB的内存抗住了突增量,突增部分内存释放在很长时间都不会得到释放,最终对HMS服务稳定性带来抖动。


TiDB内存突增情况


通过和TiDB开发、DBA联合分析下,确认TiDB内存飙高的原因为用户在使用Dashboard功能分析慢查询引起;在分析慢查询过程中,TiDB需要加载本地所有的slow-query日志到内存,如果这些日志过大,则会造成TiDB内存突增,此外,如果在分析期间,用户点击了取消按钮,则有可能会造成TiDB的内存泄漏。针对该问题制定如下解决方案:

  1. 使用大内存机器替换原小内存机器,避免分析慢查询时内存不够

  2. 调大慢查询阈值为3s,减少日志产生

  3. 定时mv慢查询日志到备份目录


4.4 locate函数查询不走索引导致TiKV负异常


在HMS中存在部分通过JDO的方式去获取分区的查询,该类查询的过滤条件中用locate函数过滤PART_NAME数据,在TiDB中通过函数作用在字段中是不会触发索引查询的,所以在该类查询会加载对应表的所有数据到TiDB端计算过滤,TiKV则需不断扫描全表并传输数据到TiDB段,从而导致TiKV负载异常。


locate函数导致全表扫描


然而上述的查询条件可以通过like方式去实现,通过使用like语法,查询可以成功使用到PARTITIONS表的UNIQUEPARTITION索引过滤,进而在TiKV端进行索引过滤降低负载。


like语法走索引过滤


通过实现将locate函数查询转换为like语法查询,有效降低了TiKV端的负载情况。在HMS端完成变更后,TiKV的CPU使用率降低了将近一倍,由于在KV端进行索引过滤,相应的io使用率有所上升,但网络传输则有明显的下降,由平均1G降低到200M左右。


变更前后TiKV的负载情况


除TiKV负载有明显的降低,TiDB的整体性能也得到明显的提升,各项操作耗时呈量级降低。以下整理了TiDB增删改查的天平均耗时情况:


TiDB P999天平均耗时统计


4.5 get_all_functions优化


随着hive udf的不断增长,HMS的get_all_functions api平均耗时增长的也越来越久,平均在40-90s,而该api在hive shell中首次执行查询操作时会被调用注册所有的udf,过长的耗时会影响用户对hive引擎的使用体验,例如执行简单的show database需要等待一分钟甚至更久才能返回结果。


原get_all_functions api平均耗时


导致该api耗时严重的主要原因是HMS通过JDO方式获取所有的Function,在获取所有的udf时后台会遍历每条func去关联DBS、FUNC_RU两个表,获取性能极低。而使用directSQL的方式去获取所有udf数据,响应耗时都在1秒以内完成,性能提升相当明显。以下为directSQL的SQL实现逻辑:


select FUNCS.FUNC_NAME,  DBS.NAME,  FUNCS.CLASS_NAME,  FUNCS.OWNER_NAME,  FUNCS.OWNER_TYPE,  FUNCS.CREATE_TIME,  FUNCS.FUNC_TYPE,  FUNC_RU.RESOURCE_URI,  FUNC_RU.RESOURCE_TYPEfrom FUNCSleft join FUNC_RU on FUNCS.FUNC_ID = FUNC_RU.FUNC_IDleft join DBS on FUNCS.DB_ID = DBS.DB_ID


五、总结


我们从2021年7月份开始对TiDB进行调研,在经历数个月的测试于同年11月末将MySQL引擎切换到TiDB。由于前期测试主要集中在兼容性和性能测试上,忽略了TiDB自身可能潜在的问题,在上线初期经历了数次因慢查询日志将TiDB内存打爆的情况,在这特别感谢我们的DBA团队、平台运营团队及TiDB官方团队帮忙分析、解决问题,得以避免该问题的再次发生;与此同时,由于当前HMS使用的版本较低,加上大数据的组件在不断的升级演进,我们也需要去兼容升级带来的变动,如HDFS升级到3.x后对EC文件读取的支持,SPARK获取分区避免全表扫描改造等;此外由于TiDB的latin字符集支持中文字符的写入,该特性会导致用户误写入错误的中文分区,对于此类型数据无法通过现有API进行删除,还需要在应用层去禁止该类型错误分区写入,避免无用数据累积。


经历了一年多的实际生产环境检验,TiDB内存整体使用在10%以内,TiKV CPU使用平稳,使用峰值均在30核内,暂不存在系统瓶颈;HMS服务的稳定性整体可控,关键API性能指标满足业务的实际需求,为业务的增长提供可靠支持。在未来三年内,我们将保持该架构去支撑整个大数据平台组件的稳定运行,期间我们也将持续关注行业内的变动,吸收更多优秀经验应用到我们的生产环境中来,包括但不限于对性能更好的高版本TiDB尝试,HMS的性能优化案例。


END

猜你喜欢


本文分享自微信公众号 - vivo互联网技术(vivoVMIC)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

开源框架 NanUI 作者转行卖钢材,项目暂停开发 苹果 App Store 免费榜第一是黄色软件 TypeScript 刚刚流行起来,为什么大牛们就开始抛弃了? TIOBE 10 月榜单:Java 跌幅最大,C# 逼近 Java Rust 1.73.0 发布 男子受 AI 女友怂恿刺杀英国女王,被判入狱九年 Qt 6.6 正式发布 路透社:RISC-V 技术成为中美科技战的新战场 RISC-V:不受任何单一公司或国家的控制 联想计划推出 Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/vivotech/blog/10114822