Principles and Applications of Big Data Technology Part II Big Data Storage and Management (2) NoSQL Database and Cloud Database

Table of contents

Chapter 3, NoSQL Database

1. Characteristics of NoSQL database

1.1 NoSQL database and traditional relational database

1.2 Hybrid Architecture

2. Four types of NoSQL

3. Three cornerstones of NoSQL

3.1 CAP

3.2 BASE

3.3 For HBase database

Chapter 4, Cloud Database

1. Features of cloud database

2. UMP Architecture

2.1 Overview of UMP

2.2 UMP system architecture

2.3 UMP system function


Chapter 3, NoSQL Database

Traditional relational database : Based on the perfect relational algebra theory, it can support structured data storage and management, has strict standards, supports transaction ACID characteristics, and can realize efficient query with the help of index mechanism. However, its data model is inflexible, its ability to expand horizontally is poor, and its excellent transaction mechanism and complex query are not suitable for large-scale unstructured storage.

ACID four characteristics :

Atomicity: Atomicity means that a transaction is an indivisible unit of work, and the operations in the transaction either all succeed or all fail. For example, the SQL statements in the same transaction either all execute successfully or all fail to execute. If a SQL statement fails to execute, it will be rolled back to the state where the transaction is not open.

Consistency: The concept of transaction consistency on the official website is: a transaction must change the database from one consistent state to another consistent state.

Another way to understand it is: when the transaction execution fails, it will not be committed to maintain data consistency.

Isolation: The isolation of transactions means that when multiple users access the database concurrently, the transactions opened by the database for each user cannot be disturbed by the operation data of other transactions, and multiple concurrent transactions must be isolated from each other. Before the transaction is committed, other users cannot view the uncommitted data.

Durability: Durability means that once a transaction is committed, its changes to the data in the database are permanent.

1. Characteristics of NoSQL database

1.1 NoSQL database and traditional relational database

NoSQL database

traditional relational database

Scalability

Good horizontal expansion capability: only need cheap standardized blade servers to meet the needs of similar expansion

It is difficult to expand horizontally, and vertical expansion needs to be realized by upgrading hardware, which is expensive

data model

Using key-value, column family and other Philippine relational data models, it does not strictly meet the ACID characteristics, allowing different types of data to be stored in one data element

The relational data model is adopted, based on the complete relational algebra theory, with standardized definitions and strict constraints. Satisfies ACID properties

Application Scenario

With good horizontal expansion capabilities, it acts as a cloud computing infrastructure and builds NoSQL-based cloud database services, facing the "Web2.0" era

Business systems in supermarkets, banks and other fields are highly dependent on relational databases to maintain data consistency; for complex query analysis applications, relational databases have better performance

Advantage

It can support ultra-large-scale data storage, the flexible data model can well support Web2.0 applications, and has strong horizontal expansion capabilities, etc.

Based on the perfect relational algebra theory, it has strict standards and supports the ACID four properties of transactions. With the help of the index mechanism, efficient query can be realized. The technology is mature and has the technical support of professional companies

disadvantage

Lack of mathematical theoretical foundation, complex query performance is not high, most of them cannot achieve strong transaction consistency, it is difficult to achieve data integrity, the technology is not yet mature, lack of professional team technical support, maintenance is difficult, etc.

The scalability is poor, and it cannot support massive data storage well. The data model is too rigid to support Web2.0 applications well. The transaction mechanism affects the overall performance of the system, etc.

It is more tasteless in web2.0 :

Web2.0 website systems usually do not require strict database transactions

web2.0 does not require strict real-time reading and writing

web2.0 usually does not contain a lot of complex SQL queries

1.2 Hybrid Architecture

Case: Amazon uses different types of databases to support its e-commerce applications. For temporary data such as "shopping baskets", it is more efficient to use key-value storage. Current product and order information is suitable for storage in a large number of relational databases. Historical order information is suitable for saving in a document database similar to MongoDB [belonging to NoSQL]

2. Four types of NoSQL

Typical NoSQL includes: key-value database, column family database, document database and graph database

key-value database

column family database

document database

graph database

data model

key/value pairs,

key is a string object value can be any type of data

column family

Query by row key

A "document" is actually a data record that can "self-describe" the type and content of the data it contains. Similar to XML, JSON

graph structure

scope of application

Application content caches that involve frequent reads and writes and have a simple data model, such as mobile applications that store configuration and user data information such as sessions, configuration files, parameters, and shopping carts

Distributed data storage and management Applications where data is geographically distributed across multiple data centers Applications that can tolerate short-term inconsistencies in replicas Applications with dynamic fields Applications with potentially large amounts of data, up to hundreds of terabytes data

Store, index, and manage document-oriented or similar semi-structured data

It is specially used to process data with a high degree of correlation, and is more suitable for problems such as social networks, pattern recognition, dependency analysis, recommendation systems, and path finding.

advantage

It has good scalability and can theoretically expand infinitely.

In-memory key-value databases can store data in memory for easy query, and persistent key-value databases store data in disk

Fast search speed, strong scalability, easy distributed expansion, and low complexity

Good performance (high concurrency), high flexibility, low complexity, flexible data structure

Provides embedded document functionality to store frequently queried data in the same document

Indexes can be built either by key or by content

The data is irregular, each record contains all relevant information without any external references, this record is "self-contained"

Records are easily moved to other servers entirely, as all information is contained there, no need to consider other data

During the moving process, only the moved record (document) needs to be operated, unlike in the relational type where each associated table needs to be locked to ensure consistency, so the ACID guarantee will be faster, and the read and write speed will be very fast big boost

High flexibility, supports complex graph algorithms, and can be used to build complex relationship graphs

shortcoming

Values ​​cannot be queried, structured information cannot be stored, and conditional query efficiency is low. It is necessary to avoid multi-table associated queries and decompose operations into single-table operations

The part-time database does not support rollback operations and cannot support transactions

Fewer functions, most of which do not support strong transaction consistency

Lack of unified query syntax

High complexity, can only support a certain data size

3. Three cornerstones of NoSQL

The three cornerstones of NoSQL include: CAP, BASE and eventual consistency

3.1 CAP

The CAP theory tells us that it is impossible for a distributed system to meet the three requirements of consistency, availability, and partition tolerance at the same time. At most, it can only meet two of them at the same time.

CA

也就是强调一致性(C)和可用性(A),放弃分区容忍性(P),最简单的做法是把所有与事务相关的内容都放到同一台机器上。很显然,这种做法会严重影响系统的可扩展性。传统的关系数据库,都采用了这种设计原则,因此,扩展性都比较差

CP

也就是强调一致性(C)和分区容忍性(P),放弃可用性(A),为了保持一致性,必须等待失联分区返回数据,且等待期间无法对外提供服务

AP

也就是强调可用性(A)和分区容忍性(P),放弃一致性(C),允许在分区失联时,返回旧的数据

3.2 BASE

BASE的基本含义是基本可用(Basically Availble)、软状态(Soft-state)和最终一致性(Eventual consistency)。

基本可用:指一个分布式系统的一部分发生问题变得不可用时其他部分仍然可以正常使用,也就是允许分区失败的情形出现。【分区容忍性】

软状态:指状态可以有一段时间不同步具有一定的滞后性

最终一致性

1.因果一致性:如果进程A通知进程B它已更新了一个数据项,则进程B的后续访问将获得A写入的最新值。而与进程A无因果关系的进程C的访问,仍然遵守一般的最终一致性规则

2.会话一致性:它把访问存储系统的进程放到会话(session)的上下文中,只要会话还存在,系统就保证

3.“读己之所写”一致性。如果由于某些失败情形令会话终止,就要建立新的会话,而且系统保证不会延续到新的会话

4.单调写一致性:系统保证来自同一个进程的写操作顺序执行。系统必须保证这种程度的一致性,否则就非常难以编程了

5.单调读一致性:如果进程已经看到过数据对象的某个值,那么任何后续访问都不会返回在那个值之前的值

3.3 对于HBase数据库来讲

HBase是借助底层的HDFS来实现其数据冗余备份

HDFS采用强一致性保证,在数据未完全同步到N个节点前,写操作不会成功返回,而读操作只需要读到一个值即可

第四章、 云数据库

云计算【IaaS、PaaS、SaaS】是云数据库兴起的基础:云计算实现了通过网络提供可伸缩的、廉价的分布式计算能力,用户只需要在具备网络接入条件的地方,就可以随时随地获得所需的各种IT资源

云数据库:云数据库是部署和虚拟化在云计算环境中的数据库。云数据库是在云计算的大背景下发展起来的一种新兴的共享基础架构的方法,极大地增强了数据库的存储能力

一、 云数据库的特点

云数据库具有高可扩展性、高可用性、采用多租形式和支持资源有效分发等特点

特性

动态可扩展

理论上无限扩展,有需求时可以及时扩展,无需求时可以立即释放数据

高可用

冗余存储,世界范围的数据中心,具有高容错率

低使用代价

采用多租户、按需服务的形式,为多用户提供服务,节省开销

易用性

用户不必控制运行原始数据库的机器,只需要通过URL就可以使用数据库,易迁移

高性能

大型分布式存储系统

免维护

用户无需维护

安全

二、 UMP架构

2.1 UMP概述

三种用户的分类,实现了资源的虚拟化,降低整体成本

用户类型

数据量和流量小的用户

多个小规模用户共用一个MySQL

中等规模用户

中等规模用户独占一个MySQL

需要分库分表的用户

大规模用户的多个MySQL实例共享同一台物理机

UMP架构遵循四个原则

保持单一的系统对外入口,并为系统内部维护单一资源池。

消除单点故障,保证服务的高可用性。

具有良好可伸缩性,能动态增加、删减计算与存储节点。

保证分配给用户的资源也是弹性可伸缩的,资源之间相互隔离,确保应用和数据安全。

2.2 UMP系统架构

【角色】/(开源组件)

介绍

特点

(Mnesia)

Mnesia是一个分布式数据库管理系统,与编程语言Erlang【上下文切换高效,并行计算支持,适合分布式、软实时系统】紧耦合。

支持透明的数据分片,支持事务,利用两阶段锁实现分布式事务,可以线性扩展到至少50个节点

数据库模式(schema)可在运行时动态重配置,表能被迁移或复制到多个节点来改进容错性

其在开发云数据库时被用来提供分布式数据库服务

(RabbitMQ)

RabbitMQ是一个工业级的消息队列产品

作为消息传输中间件来使用,可以实现可靠的消息传送

UMP集群中各个节点之间的通信,不需建立专门的连接,都是通过读写队列消息来实现的

(Zookeeper)

Zookeeper是高效和可靠的协同工作系统

提供分布式锁之类的基本服务(统一命名、状态同步、集群管理、分布式应用配置项的管理等),用于构建分布式应用,减轻分布式应用程序所承担的协调任务

作为全局的配置服务器:配置信息交给Zookeeper管理,发生变化时,所有服务器收到ZooKeeper信息

提供分布式锁:选出一个集群的“总管” 负责发起系统任务

监控所有MySQL实例 :

(LVS)

LVS 即Linux虚拟服务器,是一个虚拟的服务器集群系统

LVS集群采用IP负载均衡技术和基于内容请求分发技术

UMP系统借助于LVS来实现集群内部的负载均衡

调度器是LVS集群系统的唯一入口点,有很好的吞吐率,将请求均衡转移到不同服务器上执行,自动屏蔽故障服务器,从而将一组服务器构成一个高性能、高可用的虚拟服务器

整个服务器集群的结构对客户是透明的,而且无需修改客户端和服务器端的程序

【Controller服务器】

向UMP集群提供各种管理服务,实现集群成员管理、元数据存储、MySQL实例管理、故障恢复、备份、迁移、扩容等功能

运行了一组Mnesia分布式数据库服务,其中存储了各种系统元数据,包括集群成员、用户配置和状态信息,及用户名到后端MySQL实例地址的映射关系(或称“路由表”)等

当其它服务器组件需获取用户数据时,向其发送获取请求

为防单点故障,保证高可用性,部署了多台Controller服务器,由Zookeeper的分布式锁功能来选出一个“总管”,负责各种系统任务的调度和监控

【Web控制台】

Web控制台向用户提供系统管理界面

【Proxy服务器】

向用户提供访问MySQL数据库的服务,完全实现了MySQL协议。

Proxy服务器允许用户用已有MySQL客户端连接

通过用户名获取用户认证信息、资源配额的限制、后台MySQL实例的地址

将用户SQL查询请求会转到相应MySQL实例上。

也可以实现屏蔽MySQL实例故障、读写分离、分库分表、资源隔离、记录用户访问日志等功能

【Agent服务器】

部署在运行MySQL进程的机器上,用来管理每台物理机上的MySQL实例

执行主从切换、创建、删除、备份、迁移等操作,同时还负责收集和分析MySQL进程的统计信息、慢查询日志(Slow Query Log)和bin-log

【日志分析服务器】

存储和分析Proxy服务器传入的用户访问日志

支持实时查询一段时间内的慢日志和统计报表。

【信息统计服务器】

定期将采集到的用户连接数、QPS数值以及MySQL实例的进程状态用RRDtool进行统计

可以在Web界面上可视化展示统计结果,也可以把统计结果作为今后实现弹性的资源分配和自动化的MySQL实例迁移的依据

【愚公系统】

是一个全量复制结合binlog分析进行增量复制的工具

可以实现在不停机的情况下动态扩容、缩容和迁移

2.3 UMP系统功能

UMP系统是构建在一个大的集群之上的,通过多个组件协同作业,实现对用户透明的各种功能:

容灾、读写分离、分库分表、资源管理、资源调度、资源隔离和数据安全

功能

介绍

过程

容灾

对用户完全透明的故障恢复

UMP系统会为每个用户创建主从

两个MySQL实例主库和从库的状态是由Zookeeper负责维护的

主库宕机过程:

Zookeeper探测到主库故障,通知Controller服务器

Controller服务器启动主从切换,修改“路由表”

把主库标记为不可用

借助消息中间件RabbitMQ通知所有Proxy服务器修改用户名到后端MySQL实例地址的映射关系

宕机后的主库在进行恢复处理后需再次上线过程:

在主库恢复时,会把从库的更新复制给自己

当主库数据库状态快要和从库一致时,Controller服务器命令从库停止更新,进入不可写状态,禁止用户写入

数据主库更新到和从库完全一致时,Controller服务器发起主从切换操作,并在路由表中把主库标记为可用状态

通知Proxy服务器把写操作切回主库上,用户写操作可以继续执行,之后再把从库修改为可写状态

读写分离

充分利用主从库实现用户读写操作的分离,实现负载均衡

当整个功能被开启时,负责向用户提供访问MySQL数据库服务的Proxy服务器,就会对用户发起的SQL语句进行解析,如果属于写操作,就直接发送到主库,如果是读操作,就会被均衡地发送到主库和从库上执行

分库分表

UMP支持对用户透明的分库分表,但需要用户创建账号时指定多实例类型,并设置实例个数和分库分表规则。

当采用分库分表时,系统处理用户查询的过程如下:

首先,Proxy服务器解析用户SQL语句,提取出重写和分发SQL语句所需要的信息

其次,对SQL语句进行重写,得到多个针对相应MySQL实例的子语句,然后把子语句分发到对应的MySQL实例上执行

最后,接收来自各个MySQL实例的SQL语句执行结果,合并得最终结果

资源管理

UMP系统采用资源池机制来管理计算资源,所有的计算资源都放在资源池内进行统一分配,资源池是为MySQL实例分配资源的基本单位

集群中所有服务器会根据其机型、所在机房等因素被划分多个资源池,每台服务器会被加入到相应的资源池中

对于每个具体MySQL实例,管理员会根据资源状况为该实例具体指定主库和从库所在的资源池,然后,系统的实例管理服务会本着负载均衡的原则,从资源池中选择负载较轻的服务器来创建MySQL实例

每台服务器内部采用Cgroup进一步细化资源,限制资源上限,保证进程组之间相互隔离

资源调度

三种用户

利用愚公系统实现在不停机的情况下动态扩容、缩容和迁移

资源隔离

同一台服务器上多个MySQL实例:Cgroup限制资源

多用户共享一个MySQL实例:Controller服务器监测用户的MySQL实例的资源消耗情况,如果明显超出配额,就通知Proxy服务器通过增加延迟的方法去限制用户的QPS【每秒查询率】

数据安全

UMP系统设计了多种机制来保证数据安全

SSL数据库连接:通信安全和数据完整性 协议

数据访问IP白名单:非白名单拒绝连接

记录用户操作日志:记录所有操作到日志服务器

SQL拦截:拦截指定SQL语句

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/127911542