The difference between thinking and MySql database sub-division and partition table

A. Skelter

  I said many times, do not stick to one technology that technology is the same. It is important programming ideas, ideas are the most important. When a large amount of data, the need has thought points to refine the grain size. When the amount of data is too fragmented, the need to have thought fit to rough granularity.

1.1 points

  Many have used the technology programming ideas points,? Here a few examples of these are thought points

  • Centralized services to the development of distributed services
  • ? From Collections.synchronizedMap (x) to 1.7ConcurrentHashMap then 1.8ConcurrentHashMap, refinement lock granularity while still guarantee thread safety
  • From AtomicInteger to LongAdder, ConcurrentHashMap the size () method. Dispersion thought to reduce the number of cas, enhanced multi-threading to accumulate a number of
  • JVM's G1 GC algorithm, the heap is divided into many Region to memory management
  • Hbase of RegionServer, the management data into a plurality Region
  • Development is usually not thread-pool resources are isolated

2.2 Appropriate

  Many technology is also applied to the programming ideas together, and a few examples here, these are the ideas together

  • TLAB (Thread Local Allocation Buffers), thread local distribution cache. Avoid multi-threading conflict, improve distribution efficiency target
  • Escape analysis, the variable instantiation of memory allocated on the stack directly, without going into the heap, the end of the thread stack space is recovered. Temporary reduction in the amount of heap allocated objects
  • Under CMS GC algorithm, although the use of labeling to clear, but there are also configured to support consolidation of memory fragmentation. Such as: -XX: UseCMS-CompactAtFullCollection (whether finishing FullGC, Stop The World becomes long) and -XX: CMSFullGCs-BeforeCompaction (compressed several times after finishing FullGC)
  • Lock coarsening, when JIT find a series of successive operations are repeated lock and release the lock on the same object, will increase the range of lock synchronization
  • kafka network data transmission has some data configuration, reducing network overhead. Such as: batch.size and linger.ms etc.
  • Development is usually called a batch are not obtaining interface

II. Partition

  All paper-based MySql InnoDB

  Having said that, then say the body, said first partition, because before the blogger wrote a blog MySql partition so there will not be a little extra ink to write, see specific: https://www.cnblogs.com/GrimMjx /p/10526821.html

2.1 implementation

  How they implement the above link there to write, just remember where if a primary key or unique index exist in the table, the partitioning column must be an integral part of the unique index.

  This is the points database, application transparent, the code without modifying anything.

2.2 internal documents

  Go to the data directory, if you do not know the location of the directory can be performed:

   Then look inside the file:

 

  We can see from the figure, there are two types of files, documents, and .frm files .ibd

  • .frm file: file table structure
  • .ibd file:? InnoDB, the data and index files are in the same .ibdata (Your results may be executed .MYD .MYI index file and data files, it does not matter, this is MyIsAm storage engine, InnoDB file corresponding to the .ibd ). Order because this table is divided into five areas, so there are five such documents
  • .par file: the results you do, there may or may not .par file. Note: Starting with MySql 5.7.6, no longer create .par partition definition files. Partition definition data stored in the internal dictionary.

2.3 Data Processing

  After the partition table to improve the performance of MySql. If a table, then it is only a .ibd file, a large B + tree. If the part table, zoning rules will be divided into different zones, i.e. a larger B + tree, into a plurality of smaller trees.

  (PS: If you want to study a clustered index B + tree can put how many rows of data, see: https://www.cnblogs.com/GrimMjx/p/10540263.html )

  Reading certainly enhance the efficiency, if you take the partitioning key index, then go the corresponding partition of the secondary index B + tree, walk the corresponding partition of the clustered index B + tree.

  If you do not take the partition key will be executed once all partitions will. Cause multiple logical IO! If you want to see the usual development sql statement partition query can explain partitons select xxxxx statement. We can see a select statement left several partitions. 

mysql> explain partitions select * from TxnList where startTime>'2016-08-25 00:00:00' and startTime<'2016-08-25 23:59:00';  
+----+-------------+-------------------+------------+------+---------------+------+---------+------+-------+-------------+  
| id | select_type | table             | partitions | type | possible_keys | key  | key_len | ref  | rows  | Extra       |  
+----+-------------+-------------------+------------+------+---------------+------+---------+------+-------+-------------+  
|  1 | SIMPLE      | ClientActionTrack | p20160825  | ALL  | NULL          | NULL | NULL    | NULL | 33868 | Using where |  
+----+-------------+-------------------+------------+------+---------------+------+---------+------+-------+-------------+  
row in set (0.00 sec)

III. Sub-library sub-table

  When a table over time and business, the amount of data will become increasingly large library table. Data operations also will be growing. The limited resources of a physical machine, eventually carrying the data amount, data processing capabilities are limited. This time it will use sub-library sub-table to undertake the kind of super-large table, stand-alone does not fit.

  Is different from the partition, partition usually placed in the stand-alone, with more of a time range partitioning, easy archiving. But sub-library sub-table need to implement the code, internal mysql partition is achieved. Sub-library sub-table and partitions are not in conflict, it can be used in combination.

3.1 realized

3.1.1 sub-library sub-table standard

  • Storage occupancy 100G +
  • Daily incremental data 200w +
  • Single Table number of 100 million +

3.1.2 sub-library sub-table fields

  Sub-library sub-table field values ​​is very important

  1. In most scenes the field is a field of inquiry
  2. Numeric

  General use userId, can satisfy the above conditions

3.2 Distributed Database Middleware

  Distributed database middleware divided into two types, proxy and client architecture. proxy mode has MyCat, DBProxy other client architecture has TDDL, Sharding-JDBC like. So what difference does it make proxy and client architecture? What do each of which has advantages and disadvantages? In fact, look at a map can be known.

  proxy mode, then we select and update statements are sent to the agent, the agent is operated by the specific underlying database. It must require the agent itself needs to ensure high availability, otherwise there is no database downtime, proxy hung up, then walked away.

  客户端模式通常在连接池上做了一层封装,内部与不同的库连接,sql交给smart-client进行处理。通常仅支持一种语言,如果其他语言要使用,需要开发多语言客户端。

  

  各自的优缺点如下: 

3.3 内部文件

  找了一个分库分表+分区的例子,基本上和分区表的差不多,只是多了多了很多表的.ibd文件,上面有文件的解释:

[miaojiaxing@Grim testmydata]# ls | grep 'base_info'
base_info_00.frm
base_info_00#P#p_2018.ibd
base_info_00#P#p_2019.ibd
base_info_00#P#p_2020.ibd
base_info_00#P#p_2021.ibd
base_info_00#P#p_init.ibd
base_info_00#P#p_max.ibd
base_info_01.frm
base_info_01#P#p_2018.ibd
base_info_01#P#p_2019.ibd
base_info_01#P#p_2020.ibd
base_info_01#P#p_2021.ibd
base_info_01#P#p_init.ibd
base_info_01#P#p_max.ibd
base_info.frm
base_info.ibd

3.4 问题

3.4.1 事务问题

  既然分库分表了,那么肯定涉及到分布式事务,如何保证插入到不同库的多条记录能够要么同时成功,要么同时失败。有些同学可能想到XA,XA性能差而且不需要使用mysql5.7。柔性事务是目前主流的方案,TCC模式就属于柔性事务。

  对于分布式事务问题每家公司有自己的实现,华为用saga,阿里用TXC,蚂蚁用DTX,支持FMT模式和TCC模式。

3.4.2 join问题

  tddl、MyCAT等都支持跨分片join。但是尽力避免跨库join,比如通过字段冗余的方式等。

  如果出现了这种情况且中间件支持分片join,那么可以这样使用。如果不支持可以手工查询。

四.总结

  分表和在用途上不一样,分表是为了承接超大规模的表,单机放不下那种。分区的话则一般都是放在单机里的,用的比较多的是时间范围分区,方便归档。性能稳定上的话都是一个个子表,差不多,区别应该是分区表是mysql内部实现的,会比分表方案少一点数据交互

Guess you like

Origin www.cnblogs.com/GrimMjx/p/11772033.html