Even interview a gun on 3 sub-library sub-table of MySQL

1, face questions

Why sub-library sub-table (high concurrent design the system, how to design the database level)? What used sub-library sub-table middleware? Different sub-library sub-table middleware has what advantages and disadvantages? You exactly how the database how to split vertically or horizontally split?


2, the interviewer psychological analysis

In fact, this is definitely wander high concurrency, because the sub-library sub-table must be in order to support high concurrency, large amount of data to both questions.

And now to tell the truth, especially Internet companies like interviews, so what will basically be divided library sub-table so common technical questions, do not ask it is not, but if you do not know that it is also justified!


3, face questions Analysis

(1) Why did you divide the library table? (High concurrent design the system, how to design the database level?)

To put it plainly, sub-library sub-table are two different things children, we can not be confused, regardless of sub-library may be light table, it may be light points table, regardless of the library, are likely. I would let you throw out a scene.

If we are a small start-up companies (or a BAT company had just begun a new department), now 200,000 registered users, active users every day to 10,000, the amount of data on a single table 1000 every day, then the peak per second Up to 10 concurrent requests. . . Days, such a system, just to find a few years of work experience, and with just a few out of the training, just dry as can be. Figure:

18922bda03254ce9bf724583ba2b3b8c



The results did not think we could be so much good luck, run into a CEO took us down the broad road, business is developing rapidly, and after a few months, the number of registered users has reached 20 million! Number one million active users every day! Single table daily amount of data 100,000! The maximum peak of up to 1000 requests per second!

The company also slide down two rounds of financing, tense several hundred million yuan ah! The company reached a staggering valuation of hundreds of millions of dollars! This is the rhythm of small unicorn!

Well, now I have been feeling the pressure a bit big, so why then? Because more than 100,000 data every day, a month more than 3 million data, and now we have millions of single table data, immediately breaking tens of millions. But barely able insisted. The peak of the request is now 1000, we deployed a line of several machines, load balancing out a bit, database support 1000 QPS is also okay. Figure:

a82f6e1bd5ad4f8a8081126ccb49edc5



But we are now starting to feel a little worried, then Zezheng it. . . . . .

再接下来几个月,我的天,CEO太牛逼了,公司用户数已经达到1亿,公司继续融资几十亿人民币啊!公司估值达到了惊人的几十亿美金,成为了国内今年最牛逼的明星创业公司!天,我们太幸运了。

但是我们同时也是不幸的,因为此时每天活跃用户数上千万,每天单表新增数据多达50万,目前一个表总数据量都已经达到了两三千万了!扛不住啊!数据库磁盘容量不断消耗掉!高峰期并发达到惊人的5000~8000!别开玩笑了,哥。我跟你保证,你的系统支撑不到现在,已经挂掉了!

好吧,所以看到你这里你差不多就理解分库分表是怎么回事儿了,实际上这是跟着你的公司业务发展走的,你公司业务发展越好,用户就越多,数据量越大,请求量越大,那你单个数据库一定扛不住。

比如你单表都几千万数据了,你确定你能抗住么?绝对不行,单表数据量太大,会极大影响你的sql执行的性能,到了后面你的sql可能就跑的很慢了。一般来说,就以我的经验来看,单表到几百万的时候,性能就会相对差一些了,你就得分表了。

分表是啥意思?就是把一个表的数据放到多个表中,然后查询的时候你就查一个表。

比如按照用户id来分表,将一个用户的数据就放在一个表中。然后操作的时候你对一个用户就操作那个表就好了。这样可以控制每个表的数据量在可控的范围内,比如每个表就固定在200万以内。如图:

fe4869c98d3240fb80af2547037ce060



分库是啥意思?就是你一个库一般我们经验而言,最多支撑到并发2000,一定要扩容了,而且一个健康的单库并发值你最好保持在每秒1000左右,不要太大。那么你可以将一个库的数据拆分到多个库中,访问的时候就访问一个库好了。

这就是所谓的分库分表,为啥要分库分表?你明白了吧


(2)用过哪些分库分表中间件?不同的分库分表中间件都有什么优点和缺点?

这个其实就是看看你了解哪些分库分表的中间件,各个中间件的优缺点是啥?然后你用过哪些分库分表的中间件。

比较常见的包括:cobar、TDDL、atlas、sharding-jdbc、mycat

cobar:阿里b2b团队开发和开源的,属于proxy层方案。早些年还可以用,但是最近几年都没更新了,基本没啥人用,差不多算是被抛弃的状态吧。而且不支持读写分离、存储过程、跨库join和分页等操作。

TDDL:淘宝团队开发的,属于client层方案。不支持join、多表查询等语法,就是基本的crud语法是ok,但是支持读写分离。目前使用的也不多,因为还依赖淘宝的diamond配置管理系统。

atlas:360开源的,属于proxy层方案,以前是有一些公司在用的,但是确实有一个很大的问题就是社区最新的维护都在5年前了。所以,现在用的公司基本也很少了。

sharding-jdbc:当当开源的,属于client层方案。确实之前用的还比较多一些,因为SQL语法支持也比较多,没有太多限制,而且目前推出到了2.0版本,支持分库分表、读写分离、分布式id生成、柔性事务(最大努力送达型事务、TCC事务)。

而且确实之前使用的公司会比较多一些(这个在官网有登记使用的公司,可以看到从2017年一直到现在,是不少公司在用的),目前社区也还一直在开发和维护,还算是比较活跃,个人认为算是一个现在也可以选择的方案。

mycat:基于cobar改造的,属于proxy层方案,支持的功能非常完善,而且目前应该是非常火的而且不断流行的数据库中间件,社区很活跃,也有一些公司开始在用了。但是确实相比于sharding jdbc来说,年轻一些,经历的锤炼少一些。

所以综上所述,现在其实建议考量的,就是sharding-jdbc和mycat,这两个都可以去考虑使用。

sharding-jdbc这种client层方案的优点在于不用部署,运维成本低,不需要代理层的二次转发请求,性能很高,但是如果遇到升级啥的需要各个系统都重新升级版本再发布,各个系统都需要耦合sharding-jdbc的依赖;

mycat这种proxy层方案的缺点在于需要部署,自己及运维一套中间件,运维成本高,但是好处在于对于各个项目是透明的,如果遇到升级之类的都是自己中间件那里搞就行了。

通常来说,这两个方案其实都可以选用,但是我个人建议中小型公司选用sharding-jdbc,client层方案轻便,而且维护成本低,不需要额外增派人手,而且中小型公司系统复杂度会低一些,项目也没那么多;

但是中大型公司最好还是选用mycat这类proxy层方案,因为可能大公司系统和项目非常多,团队很大,人员充足,那么最好是专门弄个人来研究和维护mycat,然后大量项目直接透明使用即可。

我们,数据库中间件都是自研的,也用过proxy层,后来也用过client层


(3)你们具体是如何对数据库如何进行垂直拆分或水平拆分的?

水平拆分的意思,就是把一个表的数据给弄到多个库的多个表里去,但是每个库的表结构都一样,只不过每个库表放的数据是不同的,所有库表的数据加起来就是全部数据。

水平拆分的意义,就是将数据均匀放更多的库里,然后用多个库来抗更高的并发,还有就是用多个库的存储容量来进行扩容。

垂直拆分的意思,就是把一个有很多字段的表给拆分成多个表,或者是多个库上去。每个库表的结构都不一样,每个库表都包含部分字段。

一般来说,会将较少的访问频率很高的字段放到一个表里去,然后将较多的访问频率很低的字段放到另外一个表里去。因为数据库是有缓存的,你访问频率高的行字段越少,就可以在缓存里缓存更多的行,性能就越好。这个一般在表层面做的较多一些。

这个其实挺常见的,不一定我说,大家很多同学可能自己都做过,把一个大表拆开,订单表、订单支付表、订单商品表。

还有表层面的拆分,就是分表,将一个表变成N个表,就是让每个表的数据量控制在一定范围内,保证SQL的性能。否则单表数据量越大,SQL性能就越差。一般是200万行左右,不要太多,但是也得看具体你怎么操作,也可能是500万,或者是100万。你的SQL越复杂,就最好让单表行数越少。

Well, whether it is a library or sub-sub-table, we say above all those database middleware can support. Middleware is basically that you can do after sub-sub-table database, middleware based on a field value you specify, for example, userid, automatically routed to the corresponding library up, then automatically routed to the corresponding table to go .

You have to think about how your project in the sub-library sub-table? In general, split vertically, you can do at the table level, particularly for some of the fields in the table to do some split; split level, you can say is complicated by not carry, or too much data, not bearing capacity you to the demolition, in what field to split, you want good; points table, you think about it, even if you split each library to go, and concurrent capacity are ok, but the table each library or too, then you score sheet to separate the table to ensure that the amount of data in each table is not great.

But also the embodiment here two kinds of sub-library sub-table, a method in accordance with the range of points, each library is a contiguous data, for example by the general range of time, but this is less generally used, because it is prone to hot issues, they are playing a lot of traffic on the most recent data; or hash look uniformly dispersed in accordance with a field that is more commonly used.

range of points, advantage is that the expansion of the back when it is easy, as long as you are ready, to prepare a library every month on it, to a new month, when, naturally, will write a new a library; shortcomings, but most requests are access to the latest data. Actual production range, look at the scene, not just your users access to the latest data, but even now access data and historical data, as shown:

403e4590d51a4eef9dbc61ea6fed29fc



hash points system, benefit is that the average amount of data can be allocated to the library and did not request the pressure; the downside is that said expansion is too much trouble, so there will be a process of a data migration, as shown:

73c2bb72547e44caa8b7854d8acae5c5


Little brother feel good essays give a point to focus on it, give it more gradually.

Finally, share an interview book "Java Core knowledge finishing .pdf", covering the JVM, locks, high concurrency, reflection, Spring principle, micro-services, Zookeeper, databases, data structures, and so on. Add to my personal fan base (Java technology stack architecture: 644 872 653) for a free way to receive.


Guess you like

Origin blog.51cto.com/14480698/2470314