Explore TiDB in special calls

Explore TiDB in special calls

First, why study TiDB

      Special calls big data platform through open source and self-study combination, now on-line sets of clusters to meet different business needs. At present, the use of Hbase, Elasticsearch, Druid, Spark, Flink in large data storage and computing. Big Data technology can be described as flourishing, Hundred Flowers, different techniques have been targeted scene. the actual situation, choose the right technology is not an easy thing.

      With increasing access to the core business of big data platform, with pain point we encounter the following main problems in OLAP.

  • With big data applications based on in-depth analysis and calculation using sql increasingly strong demand for analysis, but is now on-line support for large data clusters are relatively weak sql
  • At present large data into the data clusters mainly in the way of wide tables, resulting in higher costs when application data imputation and underlying data released late change
  • Some business data warehouse ETL process is based on a complex model of T + 1, high latency, not real-time response to business changes
  • Since each of the data cluster mainly for a particular scene, in many cases, duplication of data storage, which resulted in an increase in storage costs, but also lead to data inconsistency
  • Now entered HDFS \ Druid \ ES data in the history data update, the higher the cost. Reduced flexibility

 Big Data technology is developing rapidly, we have always wanted to adopt new technology we can solve these problems, we are concerned that the current NewSql technology has landed the product, and many companies use. Decided to try to introduce NewSql within our platform technology to solve our problem of pain points, we first look at NewSql.

 

figure 1

     As shown, the development of a database of experienced RDBMS, NoSql and now NewSql 1. Each technology has different corresponding products. Behind each database technology, has the typical theoretical support .2003 created a Google GFS distributed file system, BigTable paper in 2006 gave birth to the Hadoop eco, after the 2012 F1 Spanner and papers published in 2013, the industry that pointed out the future development of relational databases.

With the development of big data technology, and in fact the boundaries Sql NoSql gradually blurred, for example, now has over hbase phoenix, HiveSql, SparkSql and so on. There are also some idea that NewSql = Sql + NoSql. Different technologies have their best match the scene .Spanner and F1 is considered the first NewSql in distributed systems technology serving production environment, based on the concept of open-source products are mainly as CockroachDB, TiDB. combination of community activity and related cases, technical support. We decided NewSql technology introduced TiDB.

Two, TiDB introduction

a) Introduction

    TiDB PingCAP company is subject to Google  Spanner  /  F1  inspired paper designed HTAP open source distributed database that combines the best features of traditional RDBMS and NoSQL. TiDB compatible with MySQL, supports unlimited horizontal expansion, with the strong consistency and high availability. TiDB design goal is 100% and 80% of the OLTP scene OLAP scene, more sophisticated OLAP analysis can be done by TiSpark project.

b) the overall architecture

 

 

  • TiDB Server

   SQL requests TiDB Server is responsible for receiving, processing SQL associated logic, and the data required to find the storage address is calculated by the PD TiKV, TiKV interact with the data acquisition, the final result is returned. TiDB Server is stateless, which itself does not store data, is only responsible for computing, unlimited horizontal expansion, can provide unified access address by the load balancer component (such as LVS, HAProxy or F5) outside.

  • PD Server

 Placement Driver (referred to as PD) is the entire cluster management module, whose main job is threefold: First, the meta-information stored in the cluster (which TiKV a Key storage nodes); the second is TiKV cluster scheduling and load balancing (such as data migration, Raft group leader of migration); the third is assigned a globally unique transaction ID and incremental.

  • TiKV Server

   TiKV Server is responsible for storing data, from the outside TiKV is to provide a distributed transaction Key-Value storage engine. The basic unit of stored data is Region, each Region is responsible for storing a data Key Range (from StartKey close to EndKey left and right open interval), each node is responsible for more than TiKV Region. TiKV use Raft agreement to do replication, disaster recovery and data consistency. Region as a copy management unit, a plurality of different nodes constituting Region on a Raft Group, a copy of each other. Load balancing among a plurality of data by the PD TiKV scheduling, there is also a unit of scheduling in Region.

² core features

p highly compatible with MySQL - without having to modify the code can easily migrate from MySQL to TiDB

p elastic extension level - easily cope with high concurrency, mass data scene

p Distributed Transaction - TiDB 100% support standard ACID transactions

p high availability - based on Raft of majority voting protocol can provide 100% data consistency guarantees strong financial level

p HTAP one-stop solution - a memory to handle OLTP & OLAP, ETL without the traditional tedious process.

Which relate to distributed storage and distributed computing, we can refer TIDB official website will not be discussed here.

c) TiSpark

In dealing with large, complex calculations, PingCAP in conjunction with the figure said Tikv big data and the current ecological Spark, provides another open source product TiSpark. Have to say this is a clever design makes full use of existing enterprises now Spark cluster resources, no additional .TiSpark in a new cluster architecture and core principles briefly described as follows

 

figure 2

  • Spark the Catalyst TiSpark depth integration engine, can provide precise control of the calculation, Spark can efficiently read data in TiKV, indexing support enumeration to achieve high speed.
  • Pushed down by a variety of computing to reduce the size of Spark SQL data need to be addressed in order to speed up queries; use of statistical information built TiDB choice better query plan.
  • From a data cluster of view, TiSpark + TiDB allows users without the need for fragile and difficult to maintain ETL, transactional analysis and the two working directly in the same platform, simplifying system architecture and operation and maintenance.
  • In addition, users can help TiSpark project for data processing using a variety of tools Spark ecosystem available on TiDB. For example, data analysis and the ETL TiSpark; TiKV as a data source using machine learning; scheduling system means report generation timing and so on.

Third, the current application

Since many users have already deployed production system, we did not put to the test once again relatively large effort, through a simple performance tests later, set up our first TIDB cluster, try to use on our business. At present, used in our calculations offline, i.e. based query and some scenes, according to the subsequent usage, the primary key to adjust the cluster size and increase our our online application.

l The current cluster configuration

 

 

image 3

L Application Architecture Planning

 

 

    Based TiDB we plan a complete data stream processing logic from data access to the data show, due TIDB highly compatible MySql, so there is a lot of sophisticated tools can be used to show the data source access and UI, such as Flume, Grafana, Saiku Wait.

l Application Brief

² charging power sharing statistics

每个用户使用特来电的充电桩进行充电时,车辆的BMS数据、充电桩数据、环境温度等数据是实时的保存到大数据库中。我们基于保存的用户充电数据,需要按照一定的时间展示全国的充电功率 比如展示过去一天,全国的充电功率变化曲线,每隔15分钟或者30分钟进行一次汇总。随着我们业务规模的增加,此场景的计算也逐步进行了更新换代.

 

 

           目前我们单表数据量接近20亿,每天的增量接近800万左右.使用tidb后,在进行离线计算分析时,我们的业务逻辑转成了直接在我们的离线计算平台通过sql的方式进行定义和维护,极大的提高了维护效率,同时计算速度也得到了大幅提升.

²  充电过程分析

上面我们讲了,我们已经有了充电过程中的宝贵的海量数据,如何让数据发挥价值,我们基于充电数据进行充电过程的分析就是其中的一个方式,比如分析不同的车型在不同的环境(自然、电池等)下,充电的最大电压和电流的变化情况,以及我们充电桩的需求功率满足率等.

 

 

 针对海量的历史数据计算我们使用了TiSpark进行计算,直接使用了我们现有的spark集群,在使用spark进行计算时,由于分配的资源比较少,时间多一些,后来和tidb技术人员交流了解到,提升配置和调整部分参数后,性能还会提升不少。这个场景中我们充分利用了tidb和tispark 进行协同工作,满足了我们的业务需求。

四、  总结及问题

²  最佳应用场景

结合我们的线上验证,我们认为使用tidb,主要有以下几个优势

p  Sql支持度相对于现有的集群支持度较好,灵活性和功能性大大增强

p  可以进行表之间的join运算,降低了构造宽边的复杂度以及因此带来的维护成本

p  历史数据方便修改

p  高度兼容Mysql 生态下对应的成熟软件较多(开发工具、展现、数据接入)

p  基于索引的sql性能在离线计算上基本可以满足我们需求,在即系查询上最适合海量数据下进行多维度的精确查询,类似与“万里挑一”的场景.

p  使用TiSpark 进行复杂的离线计算,充分利用了现有的集群,数据存储做到了一份,同时也降低了运维成本

²  目前的定位

结合我们的实际现状,现阶段我们主要用于进行离线计算和部分即系查询的场景,后期随着应用的深入,我们逐步考虑增加更多的应用以及部分oltp场景.

²  发现的问题

在进行线上验证的时候,我们也发现了目前几个问题

p  TiDB在高并发下海量数据的聚合查询时,性能需要提升

p  由于TiDB 目前还没有充分利用查询缓存,同一个连续多次执行时,执行效率没有变化

p  目前发现部分即系查询场景在数据量增加时,性能有所下降,联合TiDB技术人员诊断后(作为开源产品,TIDB的技术支持很给力),确定是由于目前没有采用SSD硬盘导致,考虑到目前的应用还比较少,后续应用丰富后,考虑SSD

Guess you like

Origin www.cnblogs.com/pbc1984/p/11074449.html