SQL or NoSQL? Look Selection from database schema evolution of storage

I. Introduction

Are you in to a wave of high-volume database system almost played CPU, the daily high CPU trouble? Are you tangled precariously between various NoSQL, in the end what the best choice? You're yesterday's me today, this is the original intention of writing this article.

This article is for months I always wanted to write an article, but also always wanted a learning content, as Internet practitioners, we need to know relational database (MySQL, Oracle) can not meet all of our requirements for storage, Therefore, the selection of the underlying storage, the understanding of each storage engine is very important. But also because the work experience over a period of time, for this to have some more thought, I think through their own summary of the piece written for everyone to share.

Second, structured data, unstructured data and semi-structured data

Beginning of the article, talk about structured data, unstructured data and semi-structured data, because different data characteristics, will directly affect the selection of technically storage engine.

First structured data, by definition, refers to a structured data and the logical representation of a two-dimensional table to implement the data structure, data format and length strictly follow the specification, also referred to as line data, characterized by: data in units, information indicating a row of data entities, attributes of each data row are the same. E.g:

 

 

Therefore relational database features fit perfectly structured data, relational database is the most important relational data storage and management engine.

Unstructured data refers to the data structure irregular or incomplete, there is no predefined data model, inconvenient to use two-dimensional table to show the logical data such as office documents (Word), text, images, HTML, each class reporting, and other video and audio.

Data interposed between the structured and unstructured data is a semi-structured data, which is a form of structured data, although this does not comply with the two-dimensional logical data model structure, but with the relevant tags, used to separate semantic elements and to layer records and fields. Common semi-structured data have XML and JSON, for example:

 

Joe Smith

18

12345

 

This structure is also referred to as self-described structure.

Third, the way to do relational database schema evolution of storage

First, we look at ways to use a relational database schema evolution phases of a system development of enterprise (due to the paper written SQL and NoSQL, so only to storage as a starting point, does not involve similar MQ, ZK these middleware content):

 

 

Phase One: stage just development of enterprises, the simplest, an application server with a relational database, each read and write database.

Stage two: either use MySQL or Oracle or other relational databases, databases typically do not first become a performance bottleneck, usually with the expansion of business scale, an application server could not carry over the upstream flow and an application server single point of failure problem, so add the application server and the inlet flow rate using one Nginx do load balancing, to ensure the uniform flow hit the application server.

Phase Three: With the continued expansion of business scale, this time due to both read and write on the same database, database some performance bottlenecks, then simply do a layer separate read and write, each write the main library, reading library equipment , between master and slave binlog library by synchronizing data, database performance can be largely solved this problem stage.

Phase IV: the development of enterprises getting better and better, more and more business, and do a separate database to read and write the pressure is still growing, at this time how to do it, could not carry a database, then we divide several You, do sub-library sub-table, on the table to do a vertical split, split level to the library to do. To expand the database as an example, expanding the two databases, a certain single number (e.g., the transaction order number) to a certain rule (e.g., modulus), the transaction order number is modulo 2 to 1 to 0 in the database throw, transaction order number of 2 to 1, modulo 2 to throw the database, by writing such a way that the average flow rate of the database to the two databases. Generally uses sub-library sub-table Shard way through a middleware facilitates connection management, data monitoring and the client database without perception IP.

1, the advantages of relational databases

The above way, it seems to solve the problem (in fact, really can solve a lot of problems), normal relational database to do some reading and writing + separate sub-library sub-table, two 1W + supports reading and writing of QPS is not a big problem. But limited itself to a relational database, this architecture program still has obvious shortcomings, the following advantages of the use of relational database storage scheme is the way to do some analysis first conducted, and then analyze the shortcomings of the latter part of a technology to fully understand the advantages and disadvantages of the technology is a prerequisite for selection.

1) easy to understand

Because the two-dimensional table logic row + column is very close to the concept of a logical world, the relational model mesh relative to other models, such as the level more easily understood.

2) easy to operate

Generic SQL relational database language makes the operation very convenient, such as support for complex queries join.

3) data consistency

Support for ACID properties, can maintain consistency between data, this is one very important reason to use a database, for example, with the bank transfer, be transferred to John Doe John Doe 100 yuan, 100 yuan deducted Joe Smith, John Doe plus 100 yuan, At the same time we must succeed or fail at the same time, otherwise it will cause the user's capital loss.

4) stability data

Persists data to disk, there is no risk of data loss, support mass data storage.

5) stable service

The most commonly used relational database product MySQL, Oracle outstanding server performance, service stability, usually very little downtime exception.

2, the shortcomings of relational databases

Then, we look at the shortcomings of relational database, it is quite obvious.

1) high concurrent IO under pressure

Storing data in rows, wherein even if only for a column operation, the entire row of data will be read into memory from the storage device, resulting in higher IO.

2) maintaining an index of the price paid for the big

In order to provide a rich query capabilities, usually hot table will have multiple secondary indexes Once you have a secondary index, the new data must be accompanied by all the new secondary index, update data also must be accompanied by all second update the index, which inevitably reduces the literacy relational database and index more literacy worse. There is a chance you can look at their company's database, in addition to the data files will inevitably take up space, the index account for the fact, and a lot of space.

3) to maintain data consistency big price to pay

Data consistency is the core of a relational database, but the same price in order to maintain the consistency of data is very large. We all know that SQL standard defines a different transaction isolation levels, from low to high is read uncommitted, read committed, repeatable degree of serialization, the end of the transaction isolation level, concurrent abnormalities may occur more, but in general, the stronger concurrency can provide.

So in order to ensure transactional consistency, the database will need to provide concurrency control and recovery both techniques, the former is used to reduce concurrency exception, which can guarantee a transaction with the database state is not destroyed when the system is abnormal. For concurrency control, the core idea is locked, either optimistic or pessimistic locking lock, as long as the higher level of isolation, then read and write performance will inevitably worse.

4) After the horizontal scaling caused by intractable problems

Previously mentioned, with the expansion of business scale, after the database is one way to do sub-libraries, made sub-libraries, data migration (data of a database according to certain rules hit two libraries), cross-database join (orders data, there are problems of user data, two data not in the same library), distributed transaction processing need to be considered, especially in distributed transaction processing, the industry currently do not have a particularly good solution.

5) convenient extension table structure

Because the database is stored in the structured data, so Schema table structure is fixed, extension is inconvenient, if the need to modify the table structure needs to be performed DDL (Data Definition Language) statements modified during the modification will result in a lock table, part service is unavailable.

6) full-text search function is weak

For example, like "% China really great%" only results on "2019 China is great, I love the motherland," can not search, "China was so great" this text that does not have the word capacity, and like query " % China really great "under these criteria, can not hit the index, it will lead to greatly reduce the search efficiency.

Write so much, I understand the core or the first three points, which reflects the relational database capability under high concurrency is a bottleneck, especially in write / update frequently the case, the bottleneck is the result of high CPU database , SQL execution is slow, the client reported errors such as database connection pool is not enough, so people such as spike this scenario, we absolutely can not go directly to the deduction of inventory through a database.

A friend might say, the database capacity under high concurrency bottleneck, my company money, plus CPU, swap SSDs continue to buy database server plus points library does not do enough.

The problem is that this is a very low cost way to spend 10 million to achieve the effect, for other possible ways to reach 1 million, regardless of personnel, server input-output ratio of Leader is a failure of the Leader, and relational the way the database is limited by the characteristics of its own, could have spent the money may not be able to achieve the desired effect.

As for what a way to spend one million of the 10 million effect can be achieved to spend it? You can continue to look down, which is what we want to say NoSQL.

Fourth, the way of doing architecture combined with the evolution of NoSQL store

Like the above analysis, database storage engine as a relational data model, relational data is stored, it has advantages, but there are obvious shortcomings, it is often in the case of companies have been expanding, and will not blindly count on through enhance the ability of the database to solve data storage problems, but introduces additional storage, that is, we are talking about NoSQL.

NoSQL full name Not Only SQL, refers to non-relational databases, is a complement to relational databases, with particular attention to add the word, which means NoSQL and relational databases are not antagonistic relationship, both have pros and cons of each other, choose the right storage engine is the right approach at the appropriate scene.

NoSQL is relatively simple cache:

 

 

For those who read much more than write data, the introduction of buffer layer, each read is read from the cache, the cache can not read, go to the database to take, and then written to the cache after take complete, the data do failure mechanism is usually no big problem. Generally speaking, the cache performance optimization is the first choice of most effective is the obvious solution.

However, the cache memory is usually KV type and capacity is limited (based memory), can not solve all the problems, so further optimization, we continue to introduce other NoSQL:

 

 

Database, caching work in parallel with other NoSQL, give full play to the characteristics of each NoSQL. Of course, much better in terms of performance NoSQL relational database, while often also accompanied by the loss of some of the features of the more common is the lack of transaction functions.

The following look at the common NoSQL products and their representatives, and the advantages and disadvantages of each NoSQL application scenarios and do some analysis to facilitate familiar with the characteristics of each NoSQL, convenient technology selection.

1, KV type NoSQL

Representatives: Redis. KV NoSQL type the name suggests is a non-relational database stored in the form of key-value pairs, is the simplest and easiest to understand is the most familiar kind of NoSQL, and therefore faster heart too. Redis, MemCache is one of the representatives, Redis NoSQL type KV is the most widely used NoSQL, KV Redis databases with, for example, the biggest advantage I conclude on two points:

Based on the data memory, the high write efficiency;

KV data type, the time complexity is O (1), query speed.

Therefore, the biggest advantage is the high performance NoSQL KV type, use Redis comes with BenchMark do benchmarking, TPS can reach 100,000 level, performance is very strong. There are also all the same Redis NoSQL type KV are more obvious shortcomings:

According to the investigation only K V, not according to Charles V K;

Single query, only KV way, does not support conditional queries, multi-criteria query the only way is data redundancy, but this will be a great waste of storage space;

Memory is limited, can not support mass data storage;

Similarly, because KV NoSQL type of storage memory is risk, there will be loss of data based on.

In summary, KV type NoSQL cache is the most appropriate scene scene:

Read far more than writing;

Strong ability to read;

No persistent demand, can tolerate data loss, lost anyway, and then it wants to write a query.

For example user query information based on the user ID, user ID each time according to a query cache, found the data directly to return, finding out which according to the relational database query id data is written to a cache.

2, search type NoSQL

Representatives: ElasticSearch. Traditional relational databases primarily through index to achieve the purpose of fast query, but in a scene full-text search, the index is powerless, like a query to not meet all the needs of fuzzy matching, and secondly, the use of too restrictive and improper use is likely to cause slow query, search type NoSQL born precisely to solve the weak relational database full-text search capability issues, elasticSearch search NoSQL type of representative products.

The principle of full-text search is inverted index, we look at what is the inverted index. Say inverted index is what we look forward index, the traditional forward index is a document -> key mapping, such as "Tom is my friend" this sentence, will split it into "Tom", "is", "my", "friend" four words, the document is scanned at the time of the search, check out qualifying. In this way a very simple principle, but because of its low retrieval efficiency, basically no practical value.

Inverted index is exactly the opposite, it is the keyword -> mapping document, I click on the table will show a relatively clear:

 

I mean there are now "Tom is Tom", "Tom is my friend", "Thank you, Betty", "Tom is Betty's husband" four words, the search engine will cut this sentence according to certain segmentation rules into N keywords and keyword dimension to maintain the number of occurrences of each keyword in the text. So the next search "Tom", because of the word Tom in "Tom is Tom", "Tom is my friend", "Tom is Betty's husband" three words have appeared, so this record will be retrieved three , and because "Tom is Tom" this sentence "Tom" appeared twice, so this is the highest on record "Tom" the word matching, the first show. This is the basic principles of search engine inverted index, a keyword that appears in a document, then inverted index has two parts:

Document ID;

Location happens in this document.

Can learn by analogy, we search for "Betty Tom" these two words is the same, the search engine will "Betty Tom" cut into "Tom", "Betty" two words, according to the rate specified by the developer to meet, such as fill rate = 50 %, as long as there have been one of the two words are recorded in the record is retrieved, and then show as matching.

Search type NoSQL to ElasticSearch example, its advantages are:

Support segmentation scenario, full-text search, which is different from the most important feature of a relational database;

Supports conditional queries, aggregate operations support, similar to the Group By relational database, but more powerful, suitable for data analysis;

Write data files without the risk of losing, it can easily scale in a clustered environment, can carry PB-level data;

High availability, auto-discovery new or node failures, restructuring and rebalancing data to ensure data is secure and accessible.

Similarly, ElasticSearch there are obvious disadvantages:

Memory performance thanks to the top of the point to note is used when needed most, very hardware resource eat, eat memory, large amount of data at 64G + SSD is basically a standard, regarded as a database of Hermes. Why specifically mention the memory, because the memory of this stuff is very valuable, the same configuration twice as much memory, almost a month to spend a few hundred dollars;

As ElasticSearch memory used in any place, these probably are as follows:

① Indexing Buffer ---- ElasticSearch is the first in memory to generate inverted index based Luence, Lucene, and then periodically to Segment File way to brush disks, each Segment File is actually a full inverted index;

② Segment Memory ---- inverted front of said index is based on keywords, Lucene 4.0 will be after all the keywords in a manner such FST data structure will start when all of the keywords in the full amount is loaded into memory, speed up query speed, the official recommended to stay at least half of the system memory to the Lucene;

③ all kinds of cache ---- Filter Cache, Field Cache, Indexing Cache, etc., used to improve query performance analysis, such as Filter Cache for caching the result set of used Filter;

④ Cluter State Buffer ---- ElasticSearch is designed for each Node can respond to user requests, so each Node's memory contains a copy of the cluster have a state, a very large cluster of this status information can be very large .

There is a delay between reading and writing, data is written almost like 1s are read to, this is normal, when so much is written automatically added to the index certainly affect performance;

Data structure flexibility is not high, ElasticSearch this thing, once established, would not be able to modify the field type, and if the establishment of a field in the data table does not add full-text indexing, would like to add, you can only delete the entire table and then rebuilt.

Therefore, the search for the most appropriate scene type NoSQL is conditional search full-text search, especially the scene, as an alternative to relational databases.

In addition, search databases as well as a particularly important application scenarios. We may think that once the database to do a sub-library sub-table after the original can be done in a single table aggregate operations, statistical operations if all fail? For example, I put 16 points on the Orders table library, 1024 table, the order data is scattered in the 1024 table, I want to count orders yesterday, the highest single turnover in Zhejiang Province which pen is how to do? I want to put all the orders yesterday chronological ordering page show how to do? This is another major role in the type of NoSQL document, we can put the data points table after unification fight in the document type NoSQL, the use of search and aggregation capabilities document type NoSQL query completion of the full amount of data.

As for why put it in the back of KV NoSQL type as the second write it, because usually search type NoSQL will be pre-cached as a layer to protect the relational database.

3, column-NoSQL

Representatives: HBase. Inline NoSQL, one of the most representative of the technology era of big data to HBase represented.

NoSQL column column-based storage, then what is columnar storage of it, the concept of inline SQL and relational databases have the same primary key, except that the relational database is organized in rows of data:

 

 

See per line name, phone, address three fields, which is stored in the line manner, and this can be observed id = 2, the data field is not even phone, it is also occupy space.

Column storage a completely different way, which is organized by data for each column:

 

 

 

What good does it do? To the following points:

Only when the query specified column will be read, will not read all of the columns;

Saving storage space, Null values ​​are not stored, there will be many times in a duplicate data (especially enumeration data, gender, status, etc.), such data is compressible, the compression ratio is typically a line database in 3: 1 ~ 5: 1 compression ratio columnar databases generally around 8: 1;: 1 to 30

Data is organized into columns together, a disk IO may be a one-time data read into memory.

The second point when it comes to data compression, what does it mean to the more common way of example compression dictionary table:

 

Figure carefully understand it, you should understand.

Then continue to talk about the advantages and disadvantages, column-based NoSQL, with HBase as the representative of advantages:

Unlimited mass data storage, PB level data just stored, based on the underlying HDFS (Hadoop file system), data persistence;

Read and write performance, as long as there is no misuse of data caused by hot, casual play, read and write;

Scale is one of the most convenient and non-relational database relational database, only need to add new machines can implement a linear increase data capacity, low cost and available on the server, cost savings;

Itself has no single point of failure, high availability;

Data can be stored in structured or semi-structured;

Theoretically unlimited number of columns, HBase itself requires only the number of column families, it is recommended 1 to 3.

Having said that HBase advantages, but also disadvantages HBase to say the time:

Hadoop HBase is part of the ecology, so it in itself is a relatively heavy products, depends on many Hadoop components, data is small no need to use, operation and maintenance is still a little complicated;

Under KV-style, does not support conditional query, or query conditions it is very, very weak, HBase scan a batch of data in the case or Scan provides prefix match this API, the inquiry unless the conditions defined more RowKey for data redundancy;

It does not support paging query, because it can not count the total number of data.

Therefore HBase more suitable for the kind of scene KV type and amount of data growth can not predict the future, in addition to using HBase still needs some experience, mainly in the RowKey design.

4, document type NoSQL

Representatives: MongoDB. Frankly, based on my experience, I can only compare NoSQL document-type shallow experience, so this part can only be used in conjunction with previous online article generally tell you about.

What is the document type NoSQL it, document type NoSQL NoSQL refers to incorporate one semi-structured data is stored as a document, the document type is usually NoSQL JSON or XML format to store data, so there is no Schema document type of NoSQL, because there is no Schema feature, we can store and read data randomly, so there is a document type NoSQL inconvenient problem solving relational database table structure expansion.

NoSQL MongoDB is a document type of representative products, but also one of the star products of all NoSQL products, so here to MongoDB example. As I understand it, as the document type NoSQL, MongoDB is a fully relational database and the underlying product, as we look from the store:

 

See, relational database is a step by step in each field memory is stored in a JSON string MongDB inside. Relational data can be a name, phone index, MongoDB can use the same command createIndex indexed column, after indexing can greatly improve query efficiency. For others, it is a big basic concepts, basic is similar between the two:

 

 

 

Thus, for MongDB, we just understood as a Free-Schema relational database to get away, clear its advantages and disadvantages, advantages:

No predefined fields, easy extension field;

Compared to the relational database, superior read and write performance, hit the secondary index query will not be slower than relational databases for non-indexed fields of inquiry is a comprehensive win.

Disadvantage is that:

It does not support transactional operations, despite claims to support the transaction after MongoDB4.0, but the effect is to be observed;

The association between multi-table queries are not supported (although there are ways embedded documents), join queries or require multiple operations;

Occupy a larger space, this is a design problem MongDB of space + space pre-allocation mechanism does not release after deleting data, only with db.repairDatabase to fix before release;

Currently did not find a relationship MongoDB database Navicat for MySQL This sophisticated tools such as operation and maintenance.

All in all, MongDB usage scenario may largely on the standard relational database, but more suitable for processing those who do not join, there is no strong consistency requirements and data table Schema will always change.

Fifth, the contrast between the database and NoSQL and a variety of NoSQL

The last part, make a conclusion, the final analysis, this paper is two topics:

When to select a relational database, when the choice of non-relational databases;

Use of non-relational databases, which use non-relational database.

First, it was the first topic, select the relational database and non-relational databases, in my understanding there is nothing more than two considerations:

 

First, not much explanation should all understand, non-relational databases are to obtain higher performance by sacrificing the ACID properties, assuming a relatively strong demand consistency between the two tables, then such data is not fit in non-relational database.

Second, the core data do not take non-relational database, such as user table, the Orders table, but it has a premise that this category will have a variety of core data query mode, for example, a user table has ABCD four fields, according to possible check AB, AC may according to the investigation, the investigation may be according to the D, the core data is assumed, but it is a form KV, such as the user chats, then HBase a deposit on the bin.

Years of work experience point of view, especially the non-core data log, a kind of water do not write intermediate data in a relational database, this type of data usually has two characteristics:

Write well above the reading;

Write huge amount.

By using a relational database as storage engine, it will greatly reduce the ability of relational databases, read and write properly QPS is not high core services will be affected by this kind of a drag to read and write data.

Followed by a second question, if we use a non-relational database as storage engine, then how the selection? In fact, the above article basically wrote here just to make a summary (all faults will not be reflected in the affairs of this point, because it is shared by all compared to relational databases NoSQL question):

 

But specifically described herein, the selection must be combined with the actual situation rather than scripted, such as:

The beginning of enterprise development, obviously a relational database able to get support and one year of architecture, to engage in a large and comprehensive technical solution out;

There are some conditions for data query and more, do more suitable ElasticSearch relational database storage to reduce pressure, but the company limited cost, this kind of data in this case you can try to continue to use a relational database as storage;

There is a simple data format type, KV type is a large and growing, but the company HBase no talent in this area, the operation and maintenance may have some difficulty, for practical consideration, may be a while before the top of relational databases.

So, if you do not consider the actual situation, although some proper storage engine is more appropriate, but the use of force but just the opposite, in short, for their own is the best.

 

Guess you like

Origin www.cnblogs.com/shqnl/p/11479159.html