VLDB 2019:

 

Overview of the top database will VLDB 2019 papers, we found six developments

 

Author | Han Shuo

Database fields REVIEW annual top-level meeting VLDB 2019 local time on August 26 - was held in Los Angeles, California August 30, explore the most cutting-edge technology and exchange of database field development.

Tencent cooperation with Renmin University of China, National University of Singapore in this Assembly, voted Industry Paper two. Which TDSQL team's paper work "A Lightweight and Efficient Temporal Database Management System in TDSQL", describes the full-temporal database system based on distributed transactional database TDSQL extended from the T-TDSQL. The system is to ensure OLTP performance under the premise of providing a lightweight transaction processing capabilities of full-time temporal data management capabilities and full-time temporal data, as well as set the current state of the data set clusters historical state data for the analysis of systems in the production system architecture, constitute a complete solution for full-time temporal data.

After the Conference, Tencent TDSQL team on paper this conference provides a summary of induction, extract to share with readers.

 

VLDB Profile

 

Stands for Very Large Data Bases Conferences, sponsored by the VLDB Endowment, database researchers in related fields from around the globe, vendors, participants, and other application developers to participate in major international concern and academic conferences VLDB Conference. Its purpose is to promote the exchange database and its cutting-edge academic and work-related fields worldwide. VLDB and sponsored by ACM SIGMOD, IEEE ICDE co-hosted the meeting, said the top three database fields. In published papers on the difficulty and the degree of concern, VLDB can be said to keep pace with SIGMOD.

It is worth mentioning that the majority of the computer industry conferences once a year or two different posting periods, VLDB Endowment since 2008 to establish a PVLDB (The Proceedings of the VLDB), thereafter in the form of journals to review papers for each once a month submission period, namely No. 1 a month for the previous month deadline for submission period, 12 times a year have the opportunity to contribute. The review period is shorter than the traditional journals, authors generally receive feedback in the review month and a half to two months. In the annual VLDB conference, since 2001 it was included PVLDB papers will be centralized reporting.

VLDB 2019

VLDB conference this year is already the 45th meeting, on August 26 to 30 held in the US city of Los Angeles famous West Coast. Agenda includes three keynote speeches (Keynote), 28 papers reporting Ge Branch (Research Session), 4 industrial sector papers report Branch (Industry Session), 2 industrial sector invited lectures (Invited Industry Talks), 2 display systems Forum (Demo Session), 7 tutorial (tutorial), and doctoral Forum (PhD Workshop) and a plurality of sub-Workshop (Workshop) and so on. It lasted 5 days, each of which is both days Workshop, n will be 3 days.

 

 

This year a total of 128 Research Paper, 22 Pian Industry Paper, as well as 48 Demo Paper selected. Compared with last year, Research Paper and Paper included Demo number remained stable, while Industry Paper has been significantly improved, from 12 last year to 22 this year. From the point of view of employment and contribution amount, Research Paper submission 677, acceptance rate 18.9%, Industry Paper of 72 / 30.6%, Demo Paper 127 / 37.8%. Compared with last year, the number of contributors Research Paper declined slightly, the employment rate was essentially flat.

As can be seen from the increase in the number of papers included in the industrial sector, VLDB Conference academia and industry trends exchanges this year to further enhance cooperation. And in addition to Industry Paper, in Research Paper also has a lot of work by the enterprise or the completion of joint enterprises and universities, such as Google, Microsoft, IBM and domestic Alibaba are many articles Research Paper selected. Program Committee of the General Assembly can also see a lot of people in the industry or branch presidency reviewers.

Domestically, this year from the mainland colleges and universities (excluding Hong Kong, Macao and Taiwan) and business-led or participated in a total of 27 Research Paper, a slight increase compared to last year, the number of which Tsinghua University, Zhejiang University and other universities have published several papers. Papers from the mainland colleges and universities, the most important research focused on drawing data and machine learning, including seven papers associated with the map data. From previous years mainland college papers published in the conference database VLDB, SIGMOD and other point of view, graph data has been relatively strong Chinese scholars research. In addition, on query optimization, privacy protection, spatial data, crowdsourcing, block chain and other topics, domestic colleges and universities also are involved. The domestic industry to further improve the database for the participation of academic conferences, Tencent, Alibaba, Huawei and other domestic companies have published papers in this session, research focused on RDBMS and distributed systems.

Next, this paper carried out an overview of the current VLDB paper from the paper distribution and technology developments.

The overall distribution of papers

 

In order to facilitate the unified arrangement of the length of time reporting branch of the papers, this session will follow research papers roughly average for 28 Research Session and four Industry Session, Session 4-5 papers each report.

 

Since the research direction of uneven distribution, the direction will arrange more popular Session, such as transaction processing, query optimization, distributed systems and data, while a small number of papers in different directions may be mixed in the same Session, so each boundaries and hierarchical relationship between the Session and not very clear.

We have read the entire contents of the paper, on the basis of Session division, according to research each paper and data types for which the paper carried a more detailed classification clear, easy to understand we study heat in various fields.

 

Figure 1. VLDB 2019 the distribution of papers in various fields

 

 

Figure 2. VLDB 2018 the distribution of papers in various fields

 

Because there paper cases involving multiple fields, so the number of fields in paper in FIG. 1 and is larger than the total number of papers. As can be seen from Figure 1 of the distribution, research relational database (RDBMS) is still the mainstream, but the overall number somewhat less than last year (see Figure 2) (34 this year, 42 last year), the total number of papers about 1/4; followed by the study of map data and database systems, related papers involving the sub-graph on a large scale map data matching, associations found, constrained shortest path problem query classical algorithm, as well as in distributed environment Figure division and other issues. In addition to the dominance of the relational data model can not shake addition, in recent years showing a data model is gradually being applied in actual business. Whether it be relational data, map data, or other types of data, execute queries and query optimization performance optimization has always been the core issue. With mobile Internet, the rapid development of the Internet of Things in recent years, constantly gave birth to the application depends on the space-time information and real-time performance, and therefore relevant papers spatio-temporal data and streaming data in this session also occupy a place. In addition, machine learning and database gradually closely, there are some papers try to use machine learning algorithms to optimize query algorithm.

RDBMS in each sub-field of paper distribution

 

In the paper associated with the RDBMS, we further subdivided according to the sub-field which involves, as shown in FIG. The number of papers relating to the transaction of the current session compared with last year there (see Figure 4) increased significantly, both distributed transaction processing difficulty is hot. The query optimization, storage optimization, cache optimization of these are closely related to the performance of the theme is always the core research areas of the database. In addition, researchers have come to realize how to facilitate more convenient and intuitive user access to the database is an important issue that needs to be addressed, academia, which is defined as data availability (Data Usability) problem, which in recent years there are many papers around the issue studies interactive access interface, data visualization techniques.

 

Figure 3. VLDB 2019 RDBMS sub-field of the distribution of papers

 

FIG 4. VLDB 2018 RDBMS sub-field distribution of papers

 

Papers from industry

 

工业界的论文来自 Google、Microsoft、IBM、Amazon、Facebook、SAP、eBay,以及国内的腾讯、阿里巴巴、华为等企业。除了 20 篇 Industry Paper 之外,据统计,在 Research Paper 中由企业独立完成或主导完成的论文有 11 篇,企业与高校合作的论文有 17 篇,占到 Research Paper 的 1/5;而 Demo Paper 中,也有 14 篇企业主导或参与的论文。由此可见工业界在数据库研究中参与度之高,企业与高校的合作日益密切。明显感到与学术界论文的区别是,工业界的论文更加注重系统实现和业务落地,而学术界论文则侧重于某个技术难点或者说算法问题的攻关。两者的优势结合则更有可能产出高质量的研究成果。

 

数据库技术发展动向

 

我们从本届 VLDB 论文中尝试观察总结数据库技术发展的新动向,抛砖引玉,期待与读者共同交流。如下是本届大会论文讨论到的一些重要话题。

 

分布式事务处理

随着摩尔定律的停滞失效,单机存储和计算能力增长遇到了瓶颈,现代数据库系统也朝着分布式多机集群发展,而其中遇到的最大的技术挑战即是分布式事务处理。如何保持分布式数据的一致性,事务隔离性不同级别的高效实现,都有待进一步深入研究。在本届 VLDB 中,事务处理的相关论文数量也有了明显增加。

例如论文“Adaptive Optimistic Concurrency Control for Heterogeneous Workloads”提出了一个简单有效的AOCC(自适应乐观并发控制)框架。根据查询读取的记录数,以及涉及更新操作的并发事务的写大小,AOCC自适应地选择合适的Validation 策略来降低开销,从而在不牺牲可串行化的前提下提升异质负荷的性能。论文“Improving Optimistic Concurrency Control Through Transaction Batching and Operation Reordering”则通过事务的批量执行和操作的重排序来提升OCC性能。恰巧,TDSQL的第二代事务处理机制,也是基于OCC机制,期待能有机会和大家深入进行探讨。

论文“SLOG: Serializable, Low-latency, Geo-replicated Transactions” 指出,现有的支持异地备援(Geo-replicated)的数据库通常需要在三个方面做取舍:(1)严格可串行化,(2)低延迟写入,(3)高事务处理吞吐量。该论文提出的SLOG系统利用了物理分区的局部性特征,能够同时满足以上三个要求。

在事务处理中,数据的故障恢复机制是很复杂的一项。传统的数据库实现通常需要维护WAL(Write Ahead Log)和数据本身的持久化存储,而且恢复算法渗透到了系统的各个模块,即数据库的各个模块在设计和实现时都需要考虑恢复功能的正确性,以保持事务的原子性。论文“FineLine: Log-structured Transactional Storage and Recovery” 中提出了FineLine——一个事务存储和恢复机制,舍弃了传统WAL,将所有需要持久化的数据存储到一个单一的数据结构,达到了数据库的持久化部分和内存中数据之间的设计解耦。

 

区块链技术 & Best Paper Award

区块链也是当下的热门话题之一,本届 VLDB 增加了一个关于区块链的单独 Session,共有 4 篇论文入围。值得一提的是,本届 VLDB 的 Best Paper Award 颁予了论文“Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems”。

这篇最佳论文的研究动机是,区块链系统还没有一个方便的方法来追溯数据的起源和变迁(Lineage,血统),只能依靠回放事务来重现过去的状态,这种方式适用于大规模的线下分析,但是不适合线上的事务处理系统。论文给出一个简单的例子:账户A给B转账,要求近期账户B的每日余额位于某一阈值以上,才可转账,现有系统需要重放近期B账户每天的交易,才能作出转账的决策。为了解决这样的问题,该论文提出了LineageChain系统,能够做到细粒度、安全高效地回溯区块链数据。LineageChain基于Hyperledger实现,底层存储为ForkBase(同一团队研发的面向区块链的存储系统,论文发表于VLDB 2018,“ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications”)。论文提出了一种新型的索引,针对区块链数据起源和变迁的查询作出优化。在线交易进行时,LineageChain能够精细、安全地保留下数据的变迁,并且对外提供简单的接口来访问这些数据变迁。

这篇论文提及“The management of that history, also known as data provenance or lineage, has been studied extensively in database systems.”,其实,这是对于历史数据的一种管理理念,其核心是认为“历史数据具有价值”。这一理念,使得数据处理系统的数据处理疆域扩展,延伸到了历史数据的存储、管理和计算领域,非常有意义。作为“Best Paper”,该文有许多值得我们学习之处。而异曲同工的是,腾讯TDSQL在本届VLDB投中的《A Lightweight and Efficient Temporal Database Management System in TDSQL》一文,系统地阐述了腾讯TDSQL对于历史数据管理的完备方案和主要技术:从数据生命周期到全时态数据模型的建立、从事务处理到分布式系统的全局读一致,从查询优化到索引建立,从事务型生产系统到分析历史数据的分析型集群的数据无损、性能无损的体系结构的一体化构建,表明了腾讯公司TDSQL系统处理历史数据的完备性、先进性,以及技术的前瞻性。

无独有偶,AWS在2018年底发布的QLDB(Quantum Ledger Database(量子账本数据库)),也意在解决历史态数据的存储、管理和计算。详情可参考《论亚马逊QLDB与腾讯TDSQL对历史数据的管理和计算》

 

新硬件

新的存储硬件和计算硬件,例如NVM、SSD、NUMA,SIMD、多核CPU、GPU、FPGA等,为数据库性能的scale up带来了新的机会。如何充分利用新硬件的优势来提高数据库性能也是近年来的研究热点之一。本届VLDB有多达9篇论文涉及该方向,提供了使用GPU、SIMD加速RDBMS或者机器学习平台的并行计算能力,使用NUMA实现分布式数据库的高可用数据复制方案等新技术思路。

机器学习平台

机器学习、深度学习作为时下最为火热的研究领域,也受到了数据库学者的广泛关注。机器学习、深度学习算法通常是计算密集型任务,而且在实际应用中训练数据通常也远超单机所能承受的数据规模,因此如何利用大数据分布式存储与计算能力,为用户提供一站式的机器学习和深度学习平台服务,是两者的契合点。一个明显的体现是最近三年来的数据库领域会议如 VLDB、SIGMOD 增加了机器学习相关的 Track。

 

使用机器学习算法优化DBMS性能

这是机器学习与数据库技术的另一个结合点。例如论文“Towards a Learning Optimizer for Shared Clouds”研究了在多租户云数据库环境下,使用历史查询的执行统计数据进行训练学习,来预估未来查询的中间结果基数大小,从而指导生成更优的查询计划。此外,近两年的VLDB、SIGMOD也有使用机器学习模型来优化索引结构、存储、参数自动调优的相关研究工作。

图数据库与图计算平台

相比于关系表结构,图模型更能灵活地表示事物实体之间的关联关系。随着知识图谱的普及和应用,对图数据的研究在数据库领域占据了一席之地。但与关系表的Lookup、Scan、Join等基本操作不同,图的各种算法操作种类繁多,而且其中很多算法复杂度较高。大规模图数据的存储、查询和各种分析计算,成为了新的技术难点。相关的研究内容有图数据库和图分析计算平台的构建。

以上介绍了这么多,大家对本届VLDB是不是有了更多的了解呢?欢迎与我们交流感想与思考。在后续的文章中,他二哥也会继续为大家带来更多的现场报道和技术分享,期待大家继续关注今年VLDB的动态哦!

本文作者介绍:

 

韩硕,2014年于北京邮电大学获得工学学士学位,2019年于北京大学获得理学博士学位。博士期间的主要研究方向为图数据管理和知识图谱。毕业后加入腾讯公司从事数据库技术研发工作。

 

  • hellocode
    hellocode24 天前

    文中提到的FineLine,并没有抛弃WAL,仍然是LogStructured的设计,

  • rot.cx
    rot.cx33 分钟前

    OCC比较适合具有冲突率极低, 短事务, touch的记录数目较少等特点的OLTP workload. "locking may be necessary only in the worst case"[1], 如果冲突极少, 加锁开销相对比较显著.

    OCC仅支持包括{read-only, update} transaction, 很显然, 不支持多次交互的conversational事务.

  • rot.cx
    rot.cx16 分钟前

    SLOG:
    SLOG uses locality in access patterns to assign a home region to each data granule. Reads and writes to nearby data occur rapidly, without cross-region com- munication. However, reads and writes to remote data, along with transactions that access data from multiple regions, must pay cross- region communication costs. Nonetheless, SLOG uses a determin- istic architecture to move most of this communication outside of conflict boundaries, thereby enabling these transactions to be pro- cessed at high throughput, even for high contention workloads.

     

    感觉一个数量级的提升,这个实验结果不一定 repeatable

     

    而且 dynamical remastering 对于频繁变化的 pattern 并不适用 

     

     

  • rot.cx
    rot.cx5 分钟前

    FineLine:

     

    The distinguishing feature of FineLine in contrast to existing approaches is that it provides persistence without mapping data structures directly to a persistent storage representation.

     

    Following the WAL rule, a log record must be written before the affected page is written. FineLine, on the other hand, never flushes nodes or any other part of an in-memory data structure. Instead, it relies on the log, which is indexed for fast retrieval, as the only form of propagation to persistent storage. In order to retrieve a node into main memory, its most recent state is reconstructed from the log with the fetch operation.

     

     

     

     

    • rot.cx
      rot.cx1 分钟前

      FineLine:

       

      Sequential log变 Indexed log,实际 commit 写磁盘之前,先对 log 进行合并处理;

      然后核心是 对 Indexed log 的处理“very efficient”(如下):

       

      The commit queue is formatted as a log page that can be appended directly to the indexed log. Before the append occurs, the log records in this page are sorted primarily by node ID and secondarily by a node-local sequence number. This sort can be made very efficient if log pages are formatted as an array of keys (or key prefixes) and pointers to a payload region within the page.

 

Guess you like

Origin www.cnblogs.com/cx2016/p/11609300.html