Elasticsearch system tuning practice code behind the bright 1.6 billion

Background: Recently, in order to better prevention and control of epidemic strikes after coming solve complex and effective movement of people for investigation of suspected cases. February 9, Tencent parties jointly launched the "health and epidemic prevention code", people only need to apply for their own health information covered by the two-dimensional micro-letter code, access to electronic travel documents, you can easily access to public places during the epidemic.
 
Currently, Tencent health and epidemic prevention code has landed Beijing, Guangdong, Sichuan, Yunnan, Shanghai and other 20 provincial-level administrative regions, covering more than 300 cities and counties, a total of more than 1.6 billion people lit code, covering more than 900 million people, a total of 60 visits broken billion.


 
As the health and epidemic prevention code architecture and developer, how to choose a wide variety of storage products in the most appropriate of a business, how to efficiently support rapid iterative development system within a limited period of time. In addition, in a burst of national under epidemic emergency context, how to react quickly to one trillion data access challenges, the paper system tuning Secret code behind Elasticsearch practice healthy for everyone.

Selection Elasticsearch and technical considerations
Scenario involves numerous disease prevention health code, including community mutual sweep, bayonet access, home quarantine and so on. Therefore, in terms of the type of data needed to support common query structured information such as travel time, vehicle information, but also like a long text information such as street / community / district name. In addition, along with the need to adjust the epidemic prevention and control, the need to add or delete fields with rapid adjustment function; in terms of inquiry, not only support traditional query structured information, but also support keyword search technology, mass data aggregation analysis and geographical location calculation technique.


 
Data stored in the selection process, we do a comparison and reflection some of the mainstream products:
 
Traditional relational databases such as MySQL, in transactional applications and multi-service multi-table associated with the query terms with outstanding performance, but faced with a complex variety of data types health code system, it is stretched especially text keyword search capability. Tencent cloud-based ES lucene query engine constructed by the inverted index structure, can quickly record required by keyword search to find huge amounts of data at one trillion scale, still able to achieve millisecond query response. Compared to using a command like a traditional relational database to match the search, the search query efficiency by nearly a hundred times.

Inverted index structure

 
For more popular products NoSQL MongoDB, though, and ES as to meet the diverse data types are supported, and you can always increase the dynamic field does not affect the normal business of queries written according to the needs of business, but the same lack of critical text word search capabilities. And compared to the ES, the absence of aggregation capability analysis and graphical UI component mass data. ES Tencent cloud storage column by doc_value frame structure and polymerization, including support points barrel by keywords, time-tub, the tub from the points, averaging, sum, geographical boundaries, etc., up to 60 polymerizable operator.


 
Kibana mating component UI capabilities, huge amounts of data can be analyzed by graphically. At the same time, in the form of a graphical configuration reports, etc., to simplify the development of complex operations analysts, and ultimately make the epidemic prevention departments and related personnel, through data analysis capabilities ES, so that in the case of epidemic prevention and control of the fingertips.
 
For storage of massive data, although a considerable number of large data products, such as the number of hive warehouse, Hbase, etc., has a massive data storage capacity, and to have some data analysis capabilities, but compared to the ES, not only the entire technology stack and the architecture is heavy, many open source components requiring maintenance, usually require a dedicated operation and maintenance team carry out routine maintenance of the cluster. For developers, the development of methods and interfaces are more complicated for developers after the initial contact platform for big data requires quite a lot of the basics to get started to begin to develop.
 
Tencent cloud ES style using Restful API, in debugging much simpler to use, and provides up to 10+ official SDK and the SDK 20+ community, basically covering all the major market development language. Community ecology is very active, complete documentation, rich ecological components. By integrating SDK and ecological components, reducing the amount of coding work, greatly accelerated the process development, you can efficiently respond to the emergency needs of business development on the line.

Explosive growth of business data, how rapid expansion
With the epidemic prevention and control spread across the country, access to health codes provinces increased rapidly, until now the number has reached 1.6 billion scan code, covering 900 million users. How to deal with rapid business growth in data query, data storage systems constitute a great challenge.
ES Tencent cloud distributed architecture, the index data partition algorithm is divided into a plurality of data slices (Shard), equally distributed over multiple nodes of the cluster. The node capacity and data slices, the index data can be written linearly extended query throughput, a single instance of this is the traditional database not available.
 
Since the data code inside the health system, with the progress of the epidemic will continue to grow and difficult to predict the magnitude of the final data, more flexible need to be able to increase the storage space ES. On the user-built cluster, if you need to upgrade the configuration node, usually you need to purchase plug the new storage device, or you need to add new nodes to the cluster, waiting for data to be migrated from the old nodes. This process usually lasts hours to days long, usually determined by the size of the data cluster.
 
ES Tencent built on cloud IaaS base layer, a hard disk cloud CVM and CBS, have a certain storage capacity is calculated separation. Storage can be dynamically expanded for ES node is completely transparent, without being detected. Such data similar health code size of the growing demand, once the extended operation of the storage space is reduced from the previous level of hours or days to the second level, and all cluster change operations can be performed on Tencent cloud console, greatly reduces the complexity of the operation and maintenance of the cluster configuration changes, the background business people freed from the heavy operation and maintenance work.

ES Tencent cloud service high-availability technology architecture
在疫情防控任务面前,任何环节不容有失,需要存储系统能7*24小时不间断的提供服务,提高服务可用性,保证整个健康码系统的稳定性。

腾讯云ES支持多可用区集群容灾功能

 
腾讯云ES服务支持多可用区容灾的功能,当一个可用区因为机房电力、网络等故障的原因导致不可用时,另外一个可用区的节点仍然能稳定、不间断的提供服务,保障客户业务的可靠性。
 
这也是基于ES的分布式原理,当用户选择使用支持多可用区容灾的腾讯云ES集群后,系统会为用户在多个可用区部署节点,且节点会平均的部署到各个可用区机房中。由于索引数据是可以进行分片,且设置副本。根据ES的分片分配原理,所有的分片及副本会平均的分布在所有的节点之上。这就保证了,如果设置的副本数和可用区数目一致,当有一个节点乃至一个可用区机房不可用,剩余节点中的分片仍是一份完整的数据,且主从分片可以自动切换,集群仍然可以持续的对外提供写入查询服务。
 
防疫工作机构及人员需要每天及时掌握疫情的防控情况,需要不定时的对数据进行汇总分析查询。然而,在全国海量的防疫数据场景下,集群很容易由于不严谨的聚合分析语句导致大量的数据在节点内存中进行分桶,排序等计算,从而使节点发生OOM的问题,造成节点乃至整个集群的雪崩。
 
为了防止此类情况发生导致的集群不可用,腾讯云ES在存储内核上开发了基于实际内存的熔断限流机制。当集群发现部分节点的JVM使用率超过设定的熔断阀值,会进行服务降级,梯度的拦截部分查询的请求,直至JVM使用率超过95%会最终熔断,阻止所有的查询请求。这个熔断的请求拦截机制会覆盖Rest层及Transport层,通过将熔断提前至Rest层,可以尽早的将请求进行拦截,降低集群在熔断状态下的查询压力。通过这些措施,保证了健康码小程序海量查询下的服务可用性及查询性能。

数据安全,万无一失
近年来,海内外数据泄露事件频出。健康码系统使用的腾讯云ES在安全方面做了最高级别的优化,包括支持配置内外网访问黑白名单,支持集群权限认证功能。 极大的提高了数据的安全性。并且,不同用户集群之间通过VPC进行网络隔离,杜绝了潜在的黑客入侵的风险。


 
腾讯云ES支持基于COS的增量数据备份功能,用户可以通过ES原生的索引生命周期管理功能,定时增量的备份底层的数据文件到腾讯云对象存储COS。当需要的时候,可以随时将数据备份恢复至任意的集群,保证了数据的安全可靠性,做到数据的万无一失。
 
结语:腾讯防疫健康码目前已广泛应用于机场、火车站、高速公路、商业区、小区等防疫检查卡口,有效帮助卫生防疫部门全面排查疑似患者,阻击疫情的快速传播。健康码能如此稳定安全的支撑10亿级别的数据访问,腾讯云ES在数据搜索查询、高并发、弹性扩展以及安全领域的技术功不可没,后续腾讯云将继续针对用户需求,不推打磨技术和产品,为更多用户提供稳定安全可靠的Elasticsearch服务。
发布了964 篇原创文章 · 获赞 11 · 访问量 3万+

Guess you like

Origin blog.csdn.net/xiaoyaGrace/article/details/105363989