SQL Server 2019 depth interpretation: Wildness Microsoft Data Platform

This article is the author's debut in InfoQ original articles , mainly on weekends in succession languages, are also considered for the near future intentions. It is reproduced back to their public numbers, please exhibitions.

November 4, Microsoft officially released its next-generation database product SQL Server 2019, bringing big data clusters, data virtualization and other heavy features. This release over a distance of a large version of SQL Server 2017 in just two years, this iteration rate for highly complex database system is quite amazing. Two years ago, InfoQ has published a long article "SQL Server 2017 officially released, how old Microsoft database the future? " , This time we again worked with the authors, reading progress with SQL Server 2019 features for everyone depths.

Since the 21st century, war-torn battlefield data platform, brilliant. The so-called talent in every generation out to MongoDB, Redis, Neo4j as the representative of NoSQL databases and Hive, Impala, Presto system such as Hadoop big data solutions limelight without the two. Under the impact of these young juniors, relational database schema as the data backbone, not only did not retreat, in recent years there have actually Return of the King, grew in courage trend. Today all kinds of architecture and design in critical systems, relational databases still stable performance and rich feature occupies a central position.

SQL Server is a relational database outstanding representative, it is one of the Oracle, DB2 enterprise-class commercial database of famous "Big Three." Decades-long development and honed, it has been very mature and stable; and to follow the development trend of the times continue to incorporate new technologies, and it is very comprehensive. In particular, the previous version of SQL Server 2017 is the legendary database into a broad section of the Linux world, to further expand its potential customer base and usage scenarios.

We could have a brief review of the first array of feature-rich SQL Server. In two years ago article we mentioned, SQL Server has set the traditional line of storage, updatable column storage, memory table, map database and machine learning to be versatile in one. Many of these advanced features, some of which are still struggling to catch up with the field of open source database, or can not be perfectly integrated in the same database. This is where the value of commercial databases: high stability, high performance and high integration to win the favor, helping customers to solve critical business issues of support at the same time, the technical architecture can also simplify and reduce the maintenance burden.

Just two years, Microsoft has developed to build a new SQL Server 2019, this iteration speed is quite amazing for a highly sophisticated database system on the basis of the previous generation. Today, of course, and fast-paced publishing industry generally aggressive version of the strategy related to, but we also must be curious, have a highly sophisticated commercial database system, what kind of progress can be achieved in such a short time? In what ways has made its own response to the vagaries of the market for it? This article will explore together with you.

Combined with the new features of SQL Server 2019, the next we were enhanced from the core engine, data virtualization, and this version of the biggest bright spot SQL Server Clusters big three aspects of data analysis and discussion.

The core engine enhancements

We first start with the core engine part. HTAP (Hybrid transaction / analytical processing) mixed load capacity is the trend in today's world of databases, SQL Server is one of the industry leader in this regard, the previous version has been achieved through the perfect integration of OLTP rows and columns stored in a single storage engine and At the same time support OLAP workloads. Users can not only save queries simultaneously and connecting lines and columns of memory tables, columns and even non-clustered index can be stored in the memory table to add a row, making a single table can simultaneously better support OLTP and OLAP query scenarios and two modes of operation.

SQL Server 2019 continues to strengthen support for mixed load capacity, moisten things silently through improved so that the relevant type of engine to mature, but also makes more convenient for everyday use. For example, in terms of column store index, now allows online create or rebuild (REBUILD) gathered column store index - which will greatly facilitate the maintenance and use of large-scale production environments column store table, which can save storage space, but also improve follow-up query performance. In a production environment it is often in contact with the author column store table due to the update of the line leading to fragmentation problems, but in order to safeguard the continuity of business online, only use has been relatively lightweight REORGANIZE command for simple maintenance. This is expected to completely solve the problem after the database upgrade.

Map data on engine generation SQL Server 2017 introduced in SQL Server 2019 has also been enhanced considerable margin. The improvement in both FIGS storage level supported using multiple tables and indexes filegroup partition, also added a very important mode of any length (Arbitrary Length Pattern) support, and finally the user can express any number of hopping communication relationship between the nodes. Let's look at a sample query against the official character relationship map:

SELECT PersonName, Friends
FROM (	
	SELECT
		Person1.name AS PersonName, 
		STRING_AGG(Person2.name, '->') WITHIN GROUP (GRAPH PATH) AS Friends,
		LAST_VALUE(Person2.name) WITHIN GROUP (GRAPH PATH) AS LastNode
	FROM
		Person AS Person1,
		friendOf FOR PATH AS fo,
		Person FOR PATH AS Person2
	WHERE MATCH(SHORTEST_PATH(Person1(-(fo)->Person2)+))
	AND Person1.name = 'Jacob'
) AS JacobReach
WHERE JacobReach.LastNode = 'Alice'

Easy to understand that this inquiry will help determine whether both Jacob and Alice communication, relationship and gives the shortest path between them, and the length of this path is uncertain. Key points on the T-SQL syntax clause wherein MATCH: SHORTEST_PATH which uses a method to find the shortest distance between two nodes in a given calculation in FIG. It should be noted that the input parameters of the method is similar to the regular expression syntax to support the variable length mode, a + sign cleverly relationship expressed by multiple sequential friendOf pathfinding is enclosed plus - (fo) -> Person2 namely portion may be repeated several times.

Above characteristics is one of the most common advanced query scene graph database applications, mathematics is called transitive closure (transitive closure). This feature means that the addition of SQL Server 2019 finally reached a higher level in terms of drawing query capabilities, began to have a special map database with the strength to compete.

Commercial database for the hardware has always been more concerned about the areas of development, and continuously through the latest hardware to maximize the performance potential. Persistent memory (Persistent Memory, often abbreviated as PMEM) its far superior SSD IO capacity hard disk, has become one of the hot current server hardware, Intel and other manufacturers have vigorously planning and development such as enterprise-class Optane DC persistence memory hardware products. For this reason, SQL Server 2019 lost no time to launch a hybrid pool (Hybrid Buffer Pool) feature allows storage layer between persistent memory as in DRAM memory and SSD hard drive, buffer pool to play a significant role in accelerating performance-critical page . After the user selects this feature is turned on, paged pool memory can be expanded to PMEM space on the device, SQL Server data page is located directly on the PMEM device memory-mapped IO access. In many cases it can avoid frequent data pages copied from disk to a traditional DRAM, storage agreement also able to bypass Caozuoxitong page when accessing the stack overhead, thereby gaining a huge performance boost.

Persistent memory and memory-mapped access (from Microsoft's official documentation)

It is worth mentioning that Oracle will release next year a new generation of Oracle 20c, it will also provide support for persistence of memory - at this point, can be described as great minds think alike to SQL Server.

We look at programming language integration. Previous SQL Server extended function is C # /. NET patent, for example, the user can call the UDF written in .NET integrated by SQL Server's CLR. With Microsoft's open strategy in recent years continued to advance, more languages ​​into the system SQL Server. Two years ago we have introduced SQL Server 2017 to integrate environmental Python / R to facilitate machine learning to work, and in the SQL Server 2019 in the Java has become the main target of integration and support. With the new language extensions System (SQL Server Language Extensions) makes Java classes and methods can be executed directly on the local SQL Server server. Users only need to implement Microsoft's Java Extensions SDK (Microsoft Extensibility SDK for Java) is an abstract class AbstractSqlServerExtensionExecutor to make Java code to call their own package of T-SQL database running in the context of a stored procedure by sp_execute_external_script.

A related topic with Java support, because of Oracle's Java copyright and terms of use continue to tighten controls, in order to avoid SQL Server embedded in Oracle Java environment unnecessary restrictions and risks, and Microsoft's recent open-source Java contributors and Azul Systems publishers reached a series of cooperation, use Azul Zulu JRE / JDK (based on OpenJDK) as the Azure cloud and SQL Server on Java's default options. Such Azure and SQL Server users can obtain and use a free and supported Java runtime environment that provides security updates and Bug fixes, eliminates worries. We expect a similar approach will gradually become the inevitable choice of major plant. The Java environment from Azul Systems Java not only help expand the functions of the SQL Server, but will play a crucial role in supporting PolyBase function will next be described in SQL Server and big data clusters.

Each new feature referred to above, only part of the new capabilities of SQL Server 2019 engine. Indeed there are many remarkable improvement in the new version, as the polymerization APPROX_COUNT_DISTINCT approximate function memory of TempDB metadata (Memory-Optimized TempDB Metadata), UTF-8 character encoding support for the row stored in batch mode (BATCH mode on rowstore) support, and the memory allocation mode feedback line (row mode memory grant feedback) and the like. These features distributed in all aspects of store execution engine to further enhance the capacity and depth of SQL Server.

Data virtualization

As mentioned earlier, support for multi-model multi-paradigm has become one of the important goals pursued by commercial databases, in order to establish and maintain a core position in the overall enterprise data architecture. But in reality, there is always the objective existence of heterogeneous data sources, so from another idea, how to strengthen and facilitate interoperability between heterogeneous data sources, has gradually become an important consideration in modern database products and Evaluation Standards.

Data interconnection, is most likely to occur using similar SSIS and Azure Data Factory ETL tool such as the timing of data transmission. This of course is an effective method, but there timeliness of data duplication and data limitations. Today, more advanced compared establish a concept ETL channel, it is the data virtualization. The so-called data virtualization , regardless of where the data is stored by definition is specific to what format, and access can be managed in a unified abstraction. Technically speaking, to the database as the core data virtualization system mainly declarative external table to point to the underlying data and definitions.

In SQL Server 2019 release, Microsoft product data virtualization as a core concept and the main building of the goals set, and the key support and enhanced by the built-in function PolyBase technical level. PolyBase in fact not a new face, which first appeared in the SQL Server 2012 Parallel Data Warehouse, a service in the hardware and software integration of distributed MPP database version. PolyBase components on the database level function gives the ability to define an external table points Hadoop / HDFS data has become an important bridge to help get through a relational database and Hadoop Big Data ecosystem. In SQL Server 2016 in PolyBase is really mature and become well-known, officially appear in the standard SQL Server, which greatly expanded the audience.

Polybase outreach capability has been further enhanced in SQL Server 2019 version, in addition to previously supported Hadoop and Azure Blob Storage, the new version adds additional SQL Server, Oracle, Teradata, MongoDB and ODBC support. If the ancillary functions before PolyBase just humble in emphasis on data virtualization, SQL Server 2019 is already in the spotlight among the core competencies.

SQL Server 2019 depth interpretation: Wildness Microsoft Data Platform

Data virtualization capabilities (from Microsoft's official documentation)

不妨来看一个在 SQL Server 2019 中使用 PolyBase 配置远端 MongoDB 数据源的简单例子,以此来理解数据虚拟化的落地形态。

CREATE DATABASE SCOPED CREDENTIAL MongoCredential 
	WITH IDENTITY = 'username', SECRET = 'password';
CREATE EXTERNAL DATA SOURCE MongoDBSource
	WITH (	
		LOCATION = 'mongodb://<server>[:<port>]',
		PUSHDOWN = ON,
		CREDENTIAL = MongoCredential 
	);
CREATE EXTERNAL TABLE MyMongoCollection(
	[_id] NVARCHAR(24) NOT NULL,  
	[column1] NVARCHAR(MAX) NOT NULL,
	[column2] INT NOT NULL
	-- ..., other columns to be mapped
)
	WITH (
		LOCATION='dbname.collectionname',
		DATA_SOURCE= MongoDBSource
	);

可以看到,通过 T-SQL 对凭证 (credential)、数据源 (data source)、外部表 (external table) 这三个核心配置进行定义,就可以轻松地将 MongoDB 中的集合与字段映射到 SQL Server 中来,后续即可对虚拟的外部表进行查询。PolyBase 甚至还支持 MongoDB 中的对象、数组等嵌套结构,允许在外部表定义时将复杂字段打平。另外,虽然此处所举的例子是针对 MongoDB,若需连接其他类型数据源,配置的步骤也大致类似,只是相关参数的含义和形式有所不同。

值得注意的是,PolyBase 加持下的外部表使用起来与一般数据表无异,能够与其他表进行 join 等操作,这大大方便了异构数据源之间的集成,许多情况下能够免除数据搬运的麻烦。当然,对于一些出于性能原因不便直接查询的场景,也可用简单的 SQL 语句将外部表数据方便地同步到 SQL Server 内部。

在技术实现层面,PolyBase 由于脱胎于 MPP 架构场景,所以其实具备很好的并行扩展能力——当远端数据体量巨大时这一特性殊为重要,能够极大地加速查询的执行。用户可以设立多个 SQL Server 实例(分为头节点和计算节点)并编组为 PolyBase Scale-out Group 来协同工作,对外部大数据进行并行读取和处理。从这个层面来看,PolyBase 模块已使 SQL Server 具备了分布式分析型数据库的一些典型特征。

SQL Server 2019 depth interpretation: Wildness Microsoft Data Platform

PolyBase Scale-out Group 架构(来自微软官方文档)

PolyBase 的另一个特点,是具备一定的查询下推 (pushdown) 能力,在远端能够支持的情况下,查询处理器会将符合条件的谓词发送到数据源端进行就近处理,既提高查询性能同时也减轻网络 IO 的负担。例如,在面向 Hadoop 的读取场景下,有时 PolyBase 会根据统计信息选择使用 MapReduce 来读取过滤原始文件,最终只需传回部分结果数据而非全量数据。

综上所述,数据虚拟化的理念和 PolyBase 技术的增强,有望帮助新一代 SQL Server 成为数据架构的中心。通过捏合和集成多种异构数据源,SQL Server 2019 可有效降低企业架构复杂性,还能在数据冷热分层、统一数据湖构建等应用场景中大显身手。

SQL Server 大数据集群

SQL Server 2019 最值得一提的重磅特性,恐怕就要数 SQL Server 大数据集群了(SQL Serve Big Data Cluster)。凭借创造性地将 Hadoop 和 Spark 等开源大数据技术组件直接纳入 SQL Server 并在 Kubernetes 体系下无缝集成的大胆设计,SQL Server 大数据集群在去年一经宣布并开始有限预览后,即引起了广泛关注。因为大家都非常好奇:大数据、Hadoop、Spark、容器化、云原生这些炙手可热的技术热词将如何与一个传统商业数据库发生化学反应呢?

SQL Server 大数据集群本质上既是 SQL Server 2019 的一个新特性,也是一种新的产品形态和部署方式。它具有以下几个重要特点:(1) 将 SQL Server 以多实例形态进行部署和联动,实现数据的分布式存储、处理和计算 (2) 将 SQL Server 完全容器化并以 Kubernetes 为基础架构实现底层计算资源的编排和管理 (3) 在自有分布式存储基础上额外内置提供了标准 HDFS 分布式文件系统 (4) 在计算层面额外提供了标准 Spark 作为分布式计算引擎。其架构概览图如下所示:

SQL Server 2019 depth interpretation: Wildness Microsoft Data Platform

SQL Server 大数据集群架构(来自微软官方文档)

可以看到,SQL Server 大数据集群代表了微软数据平台最新的架构思想,从单纯的与外部互联互通,走向了与开源平台技术的全面融合;从技术对接与兼容,走向了你中有我、我中有你。这不能不说是一个大胆的尝试,也是一个令人拍案叫绝的产品思路。它的好处显而易见:从企业客户角度来说 all-in-one 的设计大幅简化了架构,用户可基于此建设自己的一站式大数据平台,开源与商业技术两者兼得;从微软角度而言,确保了开源工作负载在 SQL Server 集群和体系内顺利运行,类似一个商业 Hadoop 发行版本,无疑有利于其在开源时代继续获得商业上的成功。

如果想体验 SQL Server 2019,最简便的方法是先建立一个 Azure Kubernetes Service(AKS) 集群(当然也支持其他云或本地 K8s 集群),然后借助 azdata 命令行工具即可一键将 SQL Server 大数据集群部署至 Kubernetes。笔者进行了相关的动手实验和架构观察,发现 SQL Server 大数据集群在技术实现上可谓颇具看点,列举部分如下:

  • 控制、计算、存储等各节点实现了完全容器化,部署时可自动从微软容器注册表 (Microsoft Container Registry) 下载相应镜像并运行。
  • 大数据集群的 master 实例支持多节点部署和高可用,通过结合 K8s 提供的底层故障检测转移能力和 SQL Server 中的可用性组 (Availability Group) 共同实现。
  • 分布式存储底层由 VM 集群挂载的磁盘组合构成,向上提供了两种不同选择 Data Pool 和 Storage Pool,分别对应私有和开源技术。使用时通过定义外部表指向 sqldatapool 或 sqlhdfs 协议下的地址进行挂载和访问。两种不同的存储可以结合使用,互相配合。
  • Data Pool 提供了 SQL Server 自有的分布式存储能力,一般配合 ROUND-ROBIN 数据分布策略,可提供较高的数据加载性能。实际场景中可作为外部数据接入时的落地选择,也可作为大查询结果集的持久化存储。
  • Storage Pool 对应的 Pod 高度集成了 Spark、HDFS DataNode 和 SQL Server 实例,对外提供了一个完整的 HDFS 文件系统,可完美兼容使用 Parquet 等开源体系的列存储格式,还能通过 HDFS tiering 功能挂载使用 Amazon S3、Azure Data Lake Storage Gen2 等云端存储服务;查询时 SQL Server 能够通过 NameNode 提供的信息进行尊重 data locality 的本地高速读取,还能够在许多情况下支持谓词下推 (predicate pushdown)。
  • 大数据集群全面集成 Spark 运行环境意义重大,意味着可使用标准 Spark 技术栈读写 Storage Pool,与 SQL Server 就地共享同一份数据。经验证此次发布集成的 Spark 版本为 2.4,是最新的大版本。
  • 大数据集群自动安装包含了 Elasticsearch 和 Kibana 组件,帮助监控系统各环节的关键指标与健康状态。
  • 工具支持方面可使用跨平台的 Azure Data Studio 连接 SQL Server 大数据集群,SQL Server 2019 专用插件大大方便了自助查询、集群管理、外部表创建等工作。还可在 Azure Data Studio 中使用广受欢迎的 Jupyter Notebook 连接到集群,通过 SQL、Python/PySpark 或 Scala/Spark 脚本进行探索式数据分析和机器学习模型训练。

限于篇幅,更多内容此处不再展开。若大家对其中一些关键细节和动手实操感兴趣,可关注笔者微信公众号“云间拾遗”的后续文章了解更多信息。

在定价方面,虽然 SQL Server 大数据集群仍属商用数据库范畴,且占用 CPU 核心数较多,但用户不必过于担心在授权费用方面的高额支出。SQL Server 团队贴心地设计了成本友好的定价策略,主要体现为除 master 实例需要 Enterprise 或 Standard 版本授权外,其他占大多数的 computer/data/storage 节点只需要按照专门设计且便宜许多的“Big Data Node”的方式进行计费,这会大大减轻用户在选用 SQL Server 大数据集群后的成本负担。

回过头来看,SQL Server 大数据集群虽然是全新的能力,但也许微软其实早早就开始了相关布局。因为容易发现 SQL Server 之前版本的一些成果,恰恰是此次大数据集群得以横空出世的技术前提。比如前面提到的历经多年积累的 PolyBase 技术,正是 SQL Server 得以和大数据技术栈无缝交互的关键;又如 SQL Server 2017 开始引入的 Linux 版本,则是容器化封装得以顺利达成的重要基础条件。

微软近年来全面拥抱开源之后,正在逐渐获得回报。拥抱开源既能够拉近与社区和用户的距离,也为最新的技术产品发展赢得了更大的设计灵活度。此次彻底容器化、使用 Kubernetes 进行编排并集成 Spark、HDFS 等开源组件的 SQL Server 大数据集群,无疑也是这种“改革开放”和“拿来主义”策略的成功典范。

当然,任何事物都有其两面性。对于 SQL Server 大数据集群这样的一体化架构模式,也有个别业界人士持有不同观点,认为过度整合封装未必是云时代的架构演化方向,他们更倾向于计算存储分离的架构,让每个数据组件专注做好一件事情。这就是一个仁者见仁智者见智的问题了。也许 SQL Server 大数据集群的设计初衷更侧重基于本地部署的大型客户,同时吸引对可迁移性和跨云适配十分敏感的企业解决方案提供商——对这些场景而言,SQL Server 大数据集群不失为极具竞争力的选择。相信市场会给予我们最终的答案。

总结

世界即将跨入新的十年。在 2019 年末发布的 SQL Server 2019,展现了微软在新时代对下一个十年的展望和雄心。尤其是 SQL Server 大数据集群的推出,相信将促成一批全新大数据平台的落地,也会启发业界思考未来大数据的架构模式,以及商业技术与开源世界和谐并存之道。

值得一提的是,SQL Server 2019 与 SQL Server 2017 一样,拥有面向 Linux 的版本,并与 Linux 厂商一起提供官方的支持服务。事实上 SQL Server 对 Linux 的特性覆盖也一直在默默地持续改进,2019 版本为 Linux 带来了数据复制、Active Directory 集成、PolyBase on Linux 等重要特性。如果大家对于两年前的首个 Linux 版本还持观望态度的话,SQL Server 2019 对于 Linux 的兼容性和功能集合已经完善了许多,是一个更好的 SQL Server for Linux,或许是时候可以“上车”了。

Microsoft's cloud-first, in addition to cloud the original biochemical SQL Server 2019 itself, will certainly consider a new version of the next-generation capabilities gradually synchronized to the Azure cloud PaaS service. In fact, Azure SQL Database has begun to support some new features in SQL Server 2019 as APPROX_COUNT_DISTINCT wait, just manually set the compatibility level of the database (compatibility level) of 150 to the corresponding 2019 versions. Also as PolyBase, before only on Azure SQL Data Warehouse provides support (mainly for accessing Blob Storage), the follow-up characteristics are likely to get updated accordingly enhanced in the cloud, it is also expected to expand to SQL Databases or SQL Managed Instance and more data services.

Finally, we briefly summarize the SQL Server Development Strategy 2019 is as follows: First, continue to consolidate the native support for multiple data architecture paradigm multimode core, followed by the continuous improvement of data virtualization technology PolyBase to strengthen the outer join, and finally through hugs and into the open-source big data technology system to achieve the overall integration. This is a steady development of progressive layers of product evolution ideas. I do not know you as a user of, whether heart? Let's wish good luck SQL Server 2019.

"Clouds Supplements" from the user's perspective focused on the introduction of cloud computing products and technology, adhere to the practical operation experience as the core content of the output, combined with the logical product of the scenarios depth interpretation. Welcome Fanger Wei code concern "among the clouds Supplements" micro-channel public number scan next.

Guess you like

Origin www.cnblogs.com/yunjianshiyi/p/sql_server_2019_in_depth.html