Summary of Distributed Database Learning

It is a summary of the recent learning of distributed databases (limited to relational databases).
  • When should a distributed database be considered?

    Although a distributed database seems to be a cool and a perfect solution for all mass storage and reading, it is unavoidable that it is more troublesome to implement than a stand-alone database. So when do you need to consider distributed databases? According to past experience, if your system/application is smaller than the following scale, then you don't need to consider distributed database.

    ■Enterprise-level aspects

    Enterprise-level applications that individuals have come into contact with, as data warehouses, import data from dozens of systems every day, and distribute them to dozens of systems at the same time. A system of this size, running for 7 or 8 years, is only enough for an Oracle. Of course, its real-time requirements are not high. Other systems, large amounts of data are also useful to Oracle, DB2 cluster (usually 2 servers), running for more than 10 years is enough.




    Although I don't have much experience in the Internet, but a scale like iteye.com, according to the article shared by its former webmaster, only uses cache, not distributed database. So, if your system/application is smaller than the above two in the long run, or equivalent, you don't need to consider distributed database.

  • Enterprise-level database solutions

    For enterprise-level databases, such as Oracle, SQL-Server, DB2, etc., each has its own cluster (cluster) solution. However, cluster is not a distributed database in the true sense, but only solves the problem of load balancing of the database under high availability (HA). The biggest disadvantage is that each database is redundant. The so-called redundancy means that the data in each database is exactly the same. Because of the same data, the load problem can be easily solved. However, if the amount of data rises to a certain level, it will also cause great pressure on each database in the cluster.
    Even so, just like the above "whether to use a distributed database", your system/application may run for 10 years, and you may not encounter such a problem: because the performance of enterprise-level databases is very high, it is also easy to extend the performance through hardware. . It is an exaggeration to say that even some people will not encounter a situation where they must use a distributed database in their entire IT career.

    The most important concept of a distributed database is sharding, the most complex and challenging of which is the horizontal splitting of table data. Such features are gradually appearing in enterprise-level databases.

    For example SQL-Server:
    http://www.infoq.com/cn/news/2011/02/SQL-Sharding
    but the cost is very high.

    Oracle:
    http://www.eygle.com/archives/2015/11/2015_oow_oracle_sharding.html
    is just the beginning. As mentioned in the article, in Oracle's application scenario, this requirement is not very much.
    DB2: There is almost no information, we will investigate later.

    One of the hallmarks of enterprise databases is that they are expensive, and with that comes very high performance. Therefore, if you are deploying enterprise-level applications, the general idea should be to improve the performance of a single database (database design + program optimization + hardware improvement). As a last resort, Cluster can be used.
    If you want to split table data horizontally, the number of databases will inevitably increase, and the cost will increase greatly. At this time, you should not consider enterprise-level databases.

  • Currently popular distributed database solutions Distributed databases are

    currently the most widely used among Internet companies -- of course, we are limited to large Internet companies. For example iteye.com, which is also an internet company, but there is no need to use a distributed database. And many applications of BAT must be used because there are hundreds of millions of users.

    When you need to use hundreds of servers, people generally choose free databases, such as MySQL. So most of the articles on the Internet are solutions for MySQL.

    For MySQL solutions, generally divided into two categories.
    The first category: Multi-database redundant solutions.
        Representative product: Gelera. http://galeracluster.com/
        The main feature of this type of product is that it can provide multiple main libraries, and the synchronization between the main libraries is real-time without delay. But each database holds all the data, so it
        is redundant. This type of product does not have the function of horizontally splitting tables, so it is similar to the enterprise-level cluster mentioned above. The biggest feature is saving money.

    The second category: the scheme of horizontally splitting the table
        When the amount of data is large and the cluster has been unable to solve it, the vertical splitting, that is, the sub-library, is generally used first. Assign different functions to different databases.
        There is also a view that vertical segmentation is to segment the fields of a table and assign them to different tables.
        Although vertical segmentation can separate the load of the database according to business requirements, for each database, if the data in the table is too large, another solution needs to be adopted,
        that is, horizontal segmentation, and the data of a table is divided according to a certain rule. Assigned to different tables in different databases, this is generally called sharding.
        MySQL provides a sharding solution: MySQL Cluster (don't look at the name of the cluster, in fact, it also provides the function of sharding).
        http://www.mysql.com/products/cluster/features.html
        The function is very powerful, but there are some limitations, such as the database engine cannot use innoDB, and must use NDB.

    The third category: the sub-library customized by each company, the sub-table scheme
        uses its own customized scheme, and each company follows almost the same idea: vertical segmentation (sub-library), horizontal segmentation (sub-table).
        There is a lot of information on the Internet, and there is also a lot of information shared by various companies. But because this is indeed a valuable technology, everyone's sharing is "just taste it", the general idea is shared, and the details
        will never be shared. So after reading a lot of information, you will find that the ideas are similar, and it depends on how you do it.
        This is a good sharing from Yunqi Community: https://yq.aliyun.com/edu/lesson/44?spm=5176.100242.lessonh1.36.loWn19 , which is a lot of
        details.

    The fourth category: using cloud services 
        Many large cloud service providers basically provide distributed database services. If you want to save costs and reduce risks, this is definitely a good choice.

  • My own thoughts

    If one day I will face a distributed database scenario and the solution provided by MySQL cannot be used, how will I design the architecture?
    The approximate architecture should look like this:



     

    For the blue "distributed manager", it should be a simulation of a database connection, or a proxy. It has nothing to do with the specific programming language, in short it is like a database. And the client (whether it is Java, C#, PHP), when connecting to it, is no different from connecting to an ordinary database.

    After that, the blue "distributed manager" will send the request to the green "distributed manager", and the green "distributed manager" is responsible for connecting to the database and responsible for specific data processing.

    ▶ ACID Satisfaction
    A: Atomicity. Because the storage of data relies on off-the-shelf database products, atomicity is not a problem.
    C: Consistency and I: Isolation. Because the data is distributed across different databases, the transaction functionality of the database cannot be used. With a large number of databases, even a 2-phase commit can cause increased load and instability. Therefore, the distributed manager must control things, that is, each data record must have some fields that control the information of things, such as thing ID, thing status, etc. And when we read the data, we must also refer to the information of these things.
    If a transaction updates 3 databases, there is no problem in the data preparation phase, but an error occurs in the submission phase (eg, deadlock, network failure, machine failure), then how to ensure the integrity of the transaction?
    Personally, I think the method in the distributed database shared by the Yunqi community above is good: that is, the green "distributed manager" above. When it submits data to the database, if it finds that the submission fails, then submit it again until until successful.
    D: Storage. Because data storage relies on off-the-shelf database products, storage is not a problem.

    ▶ What is the difference between blue and green "distributed managers"?
    According to personal assumptions, they should be the same thing, but under different configurations, they play different roles.
    The blue "distributed manager" has functions such as transaction information control and connection number control.
    The green "distributed manager" has the functions of connecting to the database, sending SQL, and returning the result.
    So, in theory, green can also be replaced with blue, and this level can be extended all the time (of course, there is no point in having too many levels).

    ▶How to realize the function of sharding?
    It is best to ask each table to fix a field with a name, and then the blue "distributed manager" will distribute the data to different databases according to certain rules (such as taking the modulo, or segmenting by the number interval). . Or this field is configurable.

    ▶ How to query the data distributed to different databases?
    This should be the most challenging job after sharding the data. It is possible to require the data of the tables with related services, and their sharding field values ​​are the same, so that even if there are multiple tables, the related parts will always be allocated to the same database.
    For example, Table A and Table B, the first 10,000 pieces of data, the value of the sharding field of the two tables is A001, the last 10,000 pieces of data, the value of the sharding field of the two tables is A002. This can at least ensure that the first 10,000 pieces of data are allocated to a database, and the last 10,000 pieces of data are also allocated to a database.

    When querying, if the query condition specifies the sharding field, it can directly locate the database. If the sharding field is not specified, each database is queried, and each query result is aggregated to the green "distributed manager" with the largest query result. If there is an operation of merging the result set by order by, group, etc., create a temporary table, put all the results into this temporary table, and then process the order by, group, etc. Of course, this will affect the speed.

    Therefore, if possible, try to avoid joint query, the client query results separately, and it is best for the client to control. This also requires good design of business data in advance.

    In short, the above are some personal assumptions, which are still very complicated to implement, such as transaction management, connection management, coordination of multiple "distributed managers", and parsing of SQL statements.
 http://yananay.iteye.com/blog/2288296

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326988609&siteId=291194637