Sql Or NoSql, after reading this one you will understand Sql Or NoSql, after reading this one you will understand

This article is reproduced from May CJ: Sql Or NoSql, after reading this one you will understand

Foreword

Are you in to a wave of high-volume database system almost played CPU, the daily high CPU trouble? Are you tangled precariously between various NoSql, in the end the choice that is best? You're yesterday's me today, this is the original intention of writing this article.

This article is for months I always wanted to write an article, but also always wanted a learning content, as Internet practitioners, we need to know relational database (MySql, Oracle) can not meet all of our requirements for storage, Therefore, the selection of the underlying storage, the understanding of each storage engine is very important. But also because the work experience over a period of time, for this to have some more thought, I think through their own summary of the piece written for everyone to share.

 

Structured data, unstructured data and semi-structured data

Beginning of the article, talk about structured data, unstructured data and semi-structured data, because different data characteristics, will directly affect the selection of technically storage engine.

The first is the structured data, in accordance with the definition of the structure of data and refers to a logical expression implemented by the two-dimensional table data structure, data format and length strictly follow the specification, also referred to as line data , characterized by: data in units of one line information representing a data entity, attribute data of each row is the same. E.g:

Therefore relational database features fit perfectly structured data, relational database is the most important relational data storage and management engine.

Unstructured data, referring to the data structure irregular or incomplete, there is no predefined data model, inconvenient to use two-dimensional table to show the logical data such as office documents (Word), text, images, HTML, each class reporting, and other video and audio.

Data interposed between the structured and unstructured data is a semi-structured data, which is a form of structured data, although not consistent with this structure of two-dimensional logical data model, but with the relevant tags, used to separate semantic elements and to layer records and fields . Common semi-structured data have XML and JSON, for example:

<person>
    <Name> John Doe </ name>
    <age>18</age>
    <phone>12345</phone>
</person>

This structure is also referred to as self-described structure.

 

Way of doing relational database schema evolution of storage

First, we look at ways to use a relational database schema evolution phases of a system development of enterprise (due to the paper written Sql and NoSql, so only to storage as a starting point, does not involve similar MQ, ZK these middleware content):

Phase One: stage just development of enterprises, the simplest, an application server with a relational database, each read and write database.

Stage two: either use MySQL or Oracle or other relational databases, databases typically do not first become a performance bottleneck, usually with the expansion of business scale, an application server could not carry over the upstream flow and an application server single point of failure problem, and therefore increase the flow rate of the inlet and the application server using one Nginx do load balancing , to ensure the uniform flow hit the application server.

Phase Three: With the continued expansion of business scale, this time due to both read and write on the same database, database some performance bottlenecks, then simply do a layer separate read and write , each write the main library, reading library equipment , between master and slave binlog library by synchronizing data, database performance can be largely solved this problem stage

Phase IV: the development of enterprises getting better and better, more and more business, and do a separate database to read and write the pressure is still growing, at this time how to do it, could not carry a database, then we divide several You, do sub-library sub-table , on the table to do a vertical split, split level to the library to do. To expand the database as an example, expanding the two databases, a certain single number (e.g., the transaction order number) to a certain rule (e.g., modulus), the transaction order number is modulo 2 to 1 to 0 in the database throw, transaction order number of 2 to 1, modulo 2 to throw the database, by writing such a way that the average flow rate of the database to the two databases. Generally uses sub-library sub-table Shard way through a middleware facilitates connection management, data monitoring and the client database without perception ip

 

The advantages of relational database

The above way, it seems to solve the problem (in fact, really can solve a lot of problems), normal relational database to do some reading and writing + separate sub-library sub-table, two 1W + supports reading and writing of QPS is not a big problem. But limited itself to a relational database, this architecture program still has obvious shortcomings, the following advantages of the use of relational database storage scheme is the way to do some analysis first conducted, and then analyze the shortcomings of the latter part of a technology to fully understand the advantages and disadvantages of the technology is a prerequisite for selection.

  • Easy to understand

  Because the two-dimensional table logic row + column is very close to the concept of a logical world, the relational model mesh relative to other models, such as the level more easily understood

  • Easy to operate

  Generic SQL relational database language makes the operation very convenient, such as support for complex queries join, Sql + dimensional relationship is a relational database of the most incomparable advantages, this ease of use is very close to the developer

  • Data consistency

  Support for ACID properties, can maintain consistency between data, this is one very important reason to use a database, for example, with the bank transfer, be transferred to John Doe John Doe 100 yuan, 100 yuan deducted Joe Smith, John Doe plus 100 yuan, At the same time we must succeed or fail at the same time, otherwise it will cause capital loss of users

  • Data stable

  Persists data to disk, there is no risk of losing data, mass data storage support

  • Services stable

  The most commonly used relational database products MySql, Oracle server performance, service stability, usually very little downtime abnormal

 

Disadvantage of relational databases

Then, we look at the shortcomings of relational database, it is quite obvious.

  • IO pressure under high concurrency

  Storing data in rows, wherein even if only for a column operation, the entire row of data will be read into memory from the storage device, resulting in higher IO

  • The price to pay for the maintenance of large index

  In order to provide a rich query capabilities, usually hot table will have multiple secondary indexes Once you have a secondary index, the new data must be accompanied by all the new secondary index, update data also must be accompanied by all second update the index, which inevitably reduces the literacy relational database and index more literacy worse. There is a chance you can look at their company's database, in addition to the data files will inevitably take up space, the index account for the fact, and a lot of space

  • To maintain data consistency big price to pay

  Data consistency is the core of a relational database, but the same price in order to maintain the consistency of data is very large. We all know that SQL standard defines a different transaction isolation levels, from low to high is read uncommitted, read committed, repeatable degree of serialization, the lower the transaction isolation level, the more concurrent abnormalities that may occur, but the stronger concurrency general can provide. So in order to ensure transactional consistency, the database will need to provide concurrency control and recovery both techniques, the former is used to reduce concurrency exception, which can guarantee a transaction with the database state is not destroyed when the system is abnormal. For concurrency control, the core idea is locked, either optimistic or pessimistic locking lock, as long as the higher level of isolation, then read and write performance will inevitably worse

  • After the horizontal expansion caused by intractable problems

  Previously mentioned, with the expansion of business scale, after the database is one way to do sub-libraries, made sub-libraries, data migration (data of a database according to certain rules hit two libraries), cross-database join (orders data, there are problems of user data, two data not in the same library), distributed transaction processing need to be considered, especially in distributed transaction processing, the industry currently are not particularly good solution

  • Table structure is not convenient extension

  Because the database is stored in the structured data, so schema table structure is fixed, extension is inconvenient, if the table structure needs to be modified, need to perform DDL (data definition language) statements modified during the modification will result in a lock table, service unavailable portion

  • Full-text search function is weak

  For example, like "% China really great%" only results on "2019 China is great, love the motherland," can not search, "China was so great" this text that does not have the word capacity, and like query " % China really great "under these criteria, can not hit the index, it will lead to greatly reduce the query efficiency

Write so much, I understand the core or the first three points, a problem which is reflected in the relational database capability under high concurrency is a bottleneck , especially in write / update frequently under the circumstances, the result of bottlenecks is the high database CPU, Sql slow implementation, the client reported errors such as database connection pool is not enough, so people such as spike this scenario, we absolutely can not go directly to the deduction of inventory through a database.

A friend might say, the database capacity under high concurrency bottleneck, my company money, plus CPU, swap SSDs continue to buy database server plus points do not like the library, the problem is this is a very low cost way to spend 10 million to achieve the effect, for other possible ways to reach 1 million, regardless of personnel, server input-output ratio of Leader is a failure of the Leader, and the way relational databases, is limited by its own features may have spent the money may not be able to achieve the desired effect. As for what a way to spend one million of the 10 million effect can be achieved to spend it? You can continue to look down, which is what we want to say NoSql.

 

NoSql way of doing architecture combined with the evolution of storage

Like the above analysis, database storage engine as a relational data model, relational data is stored, it has advantages, but there are obvious shortcomings, it is often in the case of companies have been expanding, and will not blindly count on through enhance the ability of the database to solve data storage problems, but will introduce other storage, which is what we say NoSql.

NoSql full name Not Only SQL, refers to non-relational database, a relational database supplements , special attention to add the word, which means NoSql antagonistic relationship with the relational database is not, both have pros and cons of each other, choose the right storage engine is the right approach at the appropriate scene.

NoSql cache is relatively simple:

针对那些读远多于写的数据,引入一层缓存,每次读从缓存中读取,缓存中读取不到,再去数据库中取,取完之后再写入到缓存,对数据做好失效机制通常就没有大问题了。通常来说,缓存是性能优化的第一选择也是见效最明显的方案。

但是,缓存通常都是KV型存储且容量有限(基于内存),无法解决所有问题,于是再进一步的优化,我们继续引入其他NoSql:

数据库、缓存与其他NoSql并行工作,充分发挥每种NoSql的特点。当然NoSql在性能方面大大优于关系挺数据库的同时,往往也伴随着一些特性的缺失,比较常见的就是事务功能的缺失。

下面看一下常用的NoSql及他们的代表产品,并对每种NoSql的优缺点和适用场景做一下分析,便于熟悉每种NoSql的特点,方便技术选型。

 

KV型NoSql(代表----Redis)

KV型NoSql顾名思义就是以键值对形式存储的非关系型数据库,是最简单、最容易理解也是大家最熟悉的一种NoSql,因此比较快地带过。Redis、MemCache是其中的代表,Redis又是KV型NoSql中应用最广泛的NoSql,KV型数据库以Redis为例,最大的优点我总结下来就两点:

  • 数据基于内存,读写效率高
  • KV型数据,时间复杂度为O(1),查询速度快

因此,KV型NoSql最大的优点就是高性能,利用Redis自带的BenchMark做基准测试,TPS可达到10万的级别,性能非常强劲。同样的Redis也有所有KV型NoSql都有的比较明显的缺点:

  • 只能根据K查V,无法根据V查K
  • 查询方式单一,只有KV的方式,不支持条件查询,多条件查询唯一的做法就是数据冗余,但这会极大的浪费存储空间
  • 内存是有限的,无法支持海量数据存储
  • 同样的,由于KV型NoSql的存储是基于内存的,会有丢失数据的风险

综上所述,KV型NoSql最合适的场景就是缓存的场景:

  • 读远多于写
  • 读取能力强
  • 没有持久化的需求,可以容忍数据丢失,反正丢了再查询一把写入就是了

例如根据用户id查询用户信息,每次根据用户id去缓存中查询一把,查到数据直接返回,查不到去关系型数据库里面根据id查询一把数据写到缓存中去。

 

搜索型NoSql(代表----ElasticSearch)

传统关系型数据库主要通过索引来达到快速查询的目的,但是在全文搜索的场景下,索引是无能为力的,like查询一来无法满足所有模糊匹配需求,二来使用限制太大且使用不当容易造成慢查询,搜索型NoSql的诞生正是为了解决关系型数据库全文搜索能力较弱的问题,ElasticSearch是搜索型NoSql的代表产品。

全文搜索的原理是倒排索引,我们看一下什么是倒排索引。要说倒排索引我们先看下什么是正排索引,传统的正排索引是文档-->关键字的映射,例如"Tom is my friend"这句话,会将其切分为"Tom"、"is"、"my"、"friend"四个单词,在搜索的时候对文档进行扫描,符合条件的查出来。这种方式原理非常简单,但是由于其检索效率太低,基本没什么实用价值。

倒排索引则完全相反,它是关键字-->文档的映射,我用张表格展示一下就比较清楚了:

意思是我现在这里有四个短句:

  • "Tom is Tom"
  • "Tom is my friend"
  • "Thank you, Betty"
  • "Tom is Betty's husband"

 

搜索引擎会根据一定的切分规则将这句话切成N个关键字,并以关键字的维度维护关键字在每个文本中的出现次数。这样下次搜索"Tom"的时候,由于Tom这个词语在"Tom is Tom"、"Tom is my friend"、"Tom is Betty's husband"三句话中都有出现,因此这三条记录都会被检索出来,且由于"Tom is Tom"这句话中"Tom"出现了2次,因此这条记录对"Tom"这个单词的匹配度最高,最先展示。这就是搜索引擎倒排索引的基本原理,假设某个关键字在某个文档中出现,那么倒排索引中有两部分内容:

  • 文档ID
  • 在该文档中出现的位置情况

 

可以举一反三,我们搜索"Betty Tom"这两个词语也是一样,搜索引擎将"Betty Tom"切分为"Tom"、"Betty"两个单词,根据开发者指定的满足率,比如满足率=50%,那么只要记录中出现了两个单词之一的记录都会被检索出来,再按照匹配度进行展示。

搜索型NoSql以ElasticSearch为例,它的优点为:

  • 支持分词场景、全文搜索,这是区别于关系型数据库最大特点
  • 支持条件查询,支持聚合操作,类似关系型数据库的Group By,但是功能更加强大,适合做数据分析
  • 数据写文件无丢失风险,在集群环境下可以方便横向扩展,可承载PB级别的数据
  • 高可用,自动发现新的或者失败的节点,重组和重新平衡数据,确保数据是安全和可访问的

同样,ElasticSearch也有比较明显的缺点:

  • 性能全靠内存来顶,也是使用的时候最需要注意的点,非常吃硬件资源、吃内存,大数据量下64G + SSD基本是标配,算得上是数据库中的爱马仕了。为什么要专门提一下内存呢,因为内存这个东西是很值钱的,相同的配置多一倍内存,一个月差不多就要多花几百块钱,至于ElasticSearch内存用在什么地方,大概有如下这些:
    • Indexing Buffer----ElasticSearch基于Luence,Lucene的倒排索引是先在内存里生成,然后定期以Segment File的方式刷磁盘的,每个Segment File实际就是一个完整的倒排索引
    • Segment Memory----倒排索引前面说过是基于关键字的,Lucene在4.0后会将所有关键字以FST这种数据结构的方式将所有关键字在启动的时候全量加载到内存,加快查询速度,官方建议至少留系统一半内存给Lucene
    • 各类缓存----Filter Cache、Field Cache、Indexing Cache等,用于提升查询分析性能,例如Filter Cache用于缓存使用过的Filter的结果集
    • Cluter State Buffer----ElasticSearch被设计为每个Node都可以响应用户请求,因此每个Node的内存中都包含有一份集群状态的拷贝,一个规模很大的集群这个状态信息可能会非常大
  • 读写之间有延迟,写入的数据差不多1s样子会被读取到,这也正常,写入的时候自动加入这么多索引肯定影响性能
  • 数据结构灵活性不高,ElasticSearch这个东西,字段一旦建立就没法修改类型了,假如建立的数据表某个字段没有加全文索引,想加上,那么只能把整个表删了再重建

因此,搜索型NoSql最适用的场景就是有条件搜索尤其是全文搜索的场景,作为关系型数据库的一种替代方案。

另外,搜索型数据库还有一种特别重要的应用场景。我们可以想,一旦对数据库做了分库分表后,原来可以在单表中做的聚合操作、统计操作是否统统失效?例如我把订单表分16个库,1024张表,那么订单数据就散落在1024张表中,我想要统计昨天浙江省单笔成交金额最高的订单是哪笔如何做?我想要把昨天的所有订单按照时间排序分页展示如何做?这就是搜索型NoSql的另一大作用了,我们可以把分表之后的数据统一打在搜索型NoSql中,利用搜索型NoSql的搜索与聚合能力完成对全量数据的查询

至于为什么把它放在KV型NoSql后面作为第二个写呢,因为通常搜索型NoSql也会作为一层前置缓存,来对关系型数据库进行保护。

 

列式NoSql(代表----HBase)

列式NoSql,大数据时代最具代表性的技术之一了,以HBase为代表。

列式NoSql是基于列式存储的,那么什么是列式存储呢,列式NoSql和关系型数据库一样都有主键的概念,区别在于关系型数据库是按照行组织的数据:

看到每行有name、phone、address三个字段,这是行式存储的方式,且可以观察id = 2的这条数据,即使phone字段没有,它也是占空间的。

列式存储完全是另一种方式,它是按每一列进行组织的数据:

这么做有什么好处呢?大致有以下几点:

  • 查询时只有指定的列会被读取,不会读取所有列
  • 存储上节约空间,Null值不会被存储,一列中有时候会有很多重复数据(尤其是枚举数据,性别、状态等),这类数据可压缩,行式数据库压缩率通常在3:1~5:1之间,列式数据库的压缩率一般在8:1~30:1左右
  • 列数据被组织到一起,一次磁盘IO可以将一列数据一次性读取到内存中

第二点说到了数据压缩,什么意思呢,以比较常见的字典表压缩方式举例:

自己看图理解一下,应该就懂了。 

接着继续讲讲优缺点,列式NoSql,以HBase为代表的,优点为:

  • 海量数据无限存储,PB级别数据随便存,底层基于HDFS(Hadoop文件系统),数据持久化
  • 读写性能好,只要没有滥用造成数据热点,读写基本随便玩
  • 横向扩展在关系型数据库及非关系型数据库中都是最方便的之一,只需要添加新机器就可以实现数据容量的线性增长,且可用在廉价服务器上,节省成本
  • 本身没有单点故障,可用性高
  • 可存储结构化或者半结构化的数据
  • 列数理论上无限,HBase本身只对列族数量有要求,建议1~3个

说了这么多HBase的优点,又到了说HBase缺点的时候了:

  • HBase是Hadoop生态的一部分,因此它本身是一款比较重的产品,依赖很多Hadoop组件,数据规模不大没必要用,运维还是有点复杂的
  • KV式,不支持条件查询,或者说条件查询非常非常弱吧,HBase在Scan扫描一批数据的情况下还是提供了前缀匹配这种API的,条件查询除非定义多个RowKey做数据冗余
  • 不支持分页查询,因为统计不了数据总数

因此HBase比较适用于那种KV型的且未来无法预估数据增长量的场景,另外HBase使用还是需要一定的经验,主要体现在RowKey的设计上。

 

文档型NoSql(代表----MongoDB)

坦白讲,根据我的工作经历,文档型NoSql我只有比较浅的使用经验,因此这部分只能结合之前的使用与网上的文章大致给大家介绍一下。

什么是文档型NoSql呢,文档型NoSql指的是将半结构化数据存储为文档的一种NoSql,文档型NoSql通常以JSON或者XML格式存储数据,因此文档型NoSql是没有Schema的,由于没有Schema的特性,我们可以随意地存储与读取数据,因此文档型NoSql的出现是解决关系型数据库表结构扩展不方便的问题的

MongoDB是文档型NoSql的代表产品,同时也是所有NoSql产品中的明星产品之一,因此这里以MongoDB为例。按我的理解,作为文档型NoSql,MongoDB是一款完全和关系型数据库对标的产品,就我们从存储上来看:

看到,关系型数据库是按部就班地每个字段一列存,在MongDB里面就是一个JSON字符串存储。关系型数据可以为name、phone建立索引,MongoDB使用createIndex命令一样可以为列建立索引,建立索引之后可以大大提升查询效率。其他方面而言,就大的基本概念,二者之间基本也是类似的:

因此,对于MongDB,我们只要理解成一个Free-Schema的关系型数据库就完事了,它的优缺点比较一目了然,优点:

  • 没有预定义的字段,扩展字段容易
  • 相较于关系型数据库,读写性能优越,命中二级索引的查询不会比关系型数据库慢,对于非索引字段的查询则是全面胜出

缺点在于:

  • 不支持事务操作,虽然Mongodb4.0之后宣称支持事务,但是效果待观测
  • 多表之间的关联查询不支持(虽然有嵌入文档的方式),join查询还是需要多次操作
  • 空间占用较大,这个是MongDB的设计问题,空间预分配机制 + 删除数据后空间不释放,只有用db.repairDatabase()去修复才能释放
  • 目前没发现MongoDB有关系型数据库例如MySql的Navicat这种成熟的运维工具

总而言之,MongDB的使用场景很大程度上可以对标关系型数据库,但是比较适合处理那些没有join、没有强一致性要求且表Schema会常变化的数据。

 

总结:数据库与NoSql及各种NoSql间的对比

最后一部分,做一个总结,本文归根到底是两个话题:

  • 何时选用关系型数据库,何时选用非关系型数据库
  • 选用非关系型数据库,使用哪种非关系型数据库

首先是第一个话题,关系型数据库与非关系型数据库的选择,在我理解里面无非就是两点考虑:

第一点,不多解释应该都理解,非关系型数据库都是通过牺牲了ACID特性来获取更高的性能的,假设两张表之间有比较强的一致性需求,那么这类数据是不适合放在非关系型数据库中的。

第二点,核心数据不走非关系型数据库,例如用户表、订单表,但是这有一个前提,就是这一类核心数据会有多种查询模式,例如用户表有ABCD四个字段,可能根据AB查,可能根据AC查,可能根据D查,假设核心数据,但是就是个KV形式,比如用户的聊天记录,那么HBase一存就完事了。

这几年的工作经验来看,非核心数据尤其是日志、流水一类中间数据千万不要写在关系型数据库中,这一类数据通常有两个特点:

  • 写远高于读
  • 写入量巨大

一旦使用关系型数据库作为存储引擎,将大大降低关系型数据库的能力,正常读写QPS不高的核心服务会受这一类数据读写的拖累。

接着是第二个问题,如果我们使用非关系型数据库作为存储引擎,那么如何选型?其实上面的文章基本都写了,这里只是做一个总结(所有的缺点都不会体现事务这个点,因为这是所有NoSql相比关系型数据库共有的一个问题):

但是这里特别说明,选型一定要结合实际情况而不是照本宣科,比如:

  • 企业发展之初,明明一个关系型数据库就能搞定且支撑一年的架构,搞一套大而全的技术方案出来
  • 有一些数据条件查询多,更适合使用ElasticSearch做存储降低关系型数据库压力,但是公司成本有限,这种情况下这类数据可以尝试继续使用关系型数据库做存储
  • 有一类数据格式简单,就是个KV类型且增长量大,但是公司没有HBase这方面的人才,运维上可能会有一定难度,出于实际情况考虑,可先用关系型数据库顶一阵子

所以,如果不考虑实际情况,虽然合适有些存储引擎更加合适,但是强行使用反而适得其反,总而言之,适合自己的才是最好的。

Guess you like

Origin www.cnblogs.com/alimayun/p/12057102.html