Block as a Value for SQL over NoSQL

Author

Yang Cao, Wenfei Fan, Tengfei Yuan

University of Edinburgh,Beihang University SICS, Shenzhen University

{Yang.cao @, wenfei @ inf., Tengfei.yuan@}ed.ac.uk

Translation: Chenxiang Chun

Summary

This article describes the Zidian, it is the key (K, V) middleware storage to accelerate SQL query evaluation by NoSQL. As opposed to the conventional practice using the tuple id or primary, Zidian proposed to bond a bond, the entire tuple values ​​i.e. block model BaaV. BaaV denotes a keyed relationship block (k, B), where k is a bond portion tuple block (set) B and B. We will extend relational algebra to BaaV. We found that, under BAAV, Zidian greatly reduce data access and communication costs. We provide characteristics (sufficient and necessary conditions) to (a) keep the results of the query, you can use the query BaaV store covered, (b) free scan query, that is, without an inquiry scan to assess any query tables, and (c) bounded query that queries can be answered by visiting the limited amount of data.

We proved that in parallel processing, Zidian ensure that (a) are not scanned without scan query, (b) Communication Fee limited inquiry; (c) parallel scalability, speed up processor time is added. Further, Zidian may be inserted into an existing SQL-No-SQL system and preserving horizontal scalability. Using the reference data and the real data, we empirically verified Zidian improved the existing SQL-over-NoSQL system reduces average two orders of magnitude.

1 Introduction

Key (KV) store has been widely used in the industry. Ac- data retrieval and data storage to the necessity of the key KV store a dictionary-like support, thus providing excellent scalability, fault tolerance and transparent fragmentation. In order to support large-scale inquiry has been used multiple SQL engines. KV top developers in the store. After all, 75% of business data is generated and stored as a relationship [43], and data analysis is usually performed by a SQL query. These systems are usually based on SQL-based NoSQL architecture. It is stored in the persistent data storage cluster KV, and answer the query (as a separate layer) [29] calculated cluster. The architecture has been Google's Spanner [20,12], Facebook's MyRocks [25], Hive [8] and SparkSQL [11], as well as other systems. While these systems offer potential advantages KV, storage, their performance is not as good as traditional DBMS when evaluating SQL query, for the following reasons.

(1) high cost of scanning . Typically, (TaaV) Most SQL over NoSQL model system that is based on tuple values. It stores a relation as a set KV pair (k, t), where k is the internal ID or the tuple t of the primary key. KV of these organizations in a distributed hash table (DHT). DHT support efficient access points through a given key k, get, acquire the entire tuple t. However, for most SQL queries, we do not know in advance the relevant tuples. Therefore, we must "blindly" scan by obtaining as much as the size of the table get a few to get a table.

(2) the communication load is large . As [37] observed, little SQL-No-NoSQL data retrieval system can be reduced down to the selection predicate for example the storage layer, and no one can be efficiently performed by scanning. As a result, usually it requires a large amount of data (or even entire relationship) by retrieval from COM- treatment layer placed KV memory. This can result in costly data communications reorganization parallel execution. In a common practice worse non-normalized database [32, 36], i.e., when the width of the table or using general relationship.

We can reduce the excessive cost of data access and communicate with you, and make existing SQL-No-NoSQL system effective in answer to a query whether the same SQL DBMS as effective? Zidian order to overcome the SQL-No-SQL limitations, we have developed Zidian (KV storage middleware). Underlying Zidian is a model by value-denominated (BaaV). In contrast with the traditional model KV TaaV store, representative of the relationship between KV BaaV as the key storage block (k, B), where k is the key block portion B tuple. In BaaV, any property may be used as the key k, and k is the only or primary key attribute id under TaaV.

According to BAAV mode, Zidian following. [1] and efficient SQL. Zidian speed up SQL system on a No-SQL. By reducing get call, retrieve irrelevant data, and therefore, the cost of both computing or communication cost.

(A) providing a block key data locality relations DHT. Just get one, you can retrieve a set of related data.

(B) BaaV convenient store to provide KV indexing function, as [37] observed that, KV store it not well supported. By explicitly using the index, we can make the inquiry scan - free, no need to scan any form can be answered. Free scans acquired and Q check data of only a desired portion of the operating answer for Q, the computational cost is also reduced.

(C) by inference key block (k, B), and size B. We can check whether a query bounded only regardless of how much, we need access to a limited number of data underlying data set is, so you can use limited computing and communications costs.

[2] scalability. In parallel processing, Zidian ensure (a) parallel scalability, i.e. calculated to ensure the acceleration computing node added layer; (b) bounded query limited communication costs. Further, (c) Zidian retention level SQL-No-SQL scalability of the system, i.e., add new nodes to the storage layer, which is the total amount of throughput per get through [37] are retrieved from all the storage nodes tuples.

[3] Easy to use. Zidian can create any store KV NoSQL systems can be used in any SQL-over-, without hacking into the system or change its basic KV storage. That's, Zidian can "insert" No-SQL-No-SQL system and to help accelerate the speed of its reply to SQL queries.

Contribution and organization. In this paper, Zidian and prove BAAV, from basic to practice.

(1) data model (Section 4). We introduce BaaV, the relationship KV store is represented as a key block. We extend relational algebra to BaaV stores to take advantage of BaaV model answer when SQL queries. In addition, we define BaaV query plan and query-free scanning bounded inquiries, accelerate the No- SQL SQL SQL system of assessment.

(2) a frame (section 5). Based BaaV, we propose Zidian, an accelerated frame SQL-No-NoSQL systems existing SQL assessment. It maps to a conventional database D stored BaaV D. It needs to be D SQL query Q, then tilt to the extent possible in the corresponding BaaV Q store in D. The basic problem we study framework. In particular, we provide a save feature, that is to say, there is enough of the necessary conditions to decide whether to answer queries raised in the available D D in Q.

(3) We describe the non-scanning (bounded) features a query, that is, we have developed the necessary and sufficient conditions to determine the SQL query on whether or not the storage BaaV scanning (bounded). Although the problem can not be determined, but the feature descriptions provide valid syntax can effectively check such queries. In addition, we provide an algorithm for generating a query plan, the algorithm can be avoided to ensure that no scanning (limit bounded) query scans (to restrict access to a limited number of data).

(4) Parallelization: bounded and scalability (Section 7). We recommend interleaved data access and parallel computing when answering queries, without first obtaining all the data and then calculate the answer. Through this strategy, we do not need to show zidian scanning can be no scan query, the query is limited and does not need to increase the communication costs. Further, in BaaV, zidian ensure the scalability of parallel, horizontal and retained SQL-No-SQL system scalability.

(5) implementation (Section 8). As a proof of concept, we have achieved Zidian and deployed to SoH (SparkSQLover-HBase [7]), SoK (SparkSQLover-Kudu [7]) and SoC (SparkSQL-Cassandra [6]). In addition to the first frame section 5, Zidian further comprising: (a) a module for help design BaaV mode (Section 8.1) in a storage constraint; (b) for deploying zidian adapter KV on existing systems.

(6) Test (Section 9). Using the reference TPC-H [42] and the actual data, we evaluated the effectiveness of Zidian. On average we found the following. 1) No query efficiency Zidian are scanned than SoH, SoK SoC high and 2.8 × 102,1.7 × 102 and 8.1 × 102, non no query queries were higher 2.0 × 102,1.5 × 102 and 3.6 × 102. 2) Use Zidian, when the growth of the data set, the system will bring stability to the query is bounded computing and communications costs. 3) Zidian parallel scalable, and it may well be extended to a data set, e.g., an average of 8 Zidian employees freefree 128GB data sets and non-freefree SoH query on the needs and 27.7, respectively 65.4 seconds, and 1.7 × 103 and 2.1 × SoH no purple power 103 seconds. (4) zidian KV retains the basic level of system scalability to cope with the workload KV.

We will discuss related work in Section 2, and review Section 3 SQL-over- of NoSQL. The results demonstrated in [2].

2. Related Work

We will work-related classified as follows.

No-SQL-SQL . The SQL NoSQL architecture is widely used to support scalable parallel processing SQL business machines, e.g. [34,12,40,19,35,41,25], capital letter - using KV as the storage system, such as Apache's Cassandra [ 6], HBase [7] and the Kudu [1]. Wrench [20,12,40] begin this work in order to support large-scale distributed transactions. It is based on BigTable [18], it is stored in relation to the table KV TaaV. After this work is an open source system CockroachDB [19], Nuodb [35 ], MyRocks [25] and particles [41] (support SPJ). SparkSQL [11] and Hive [8] also provides SQL-like query interface for Spark and Hadoop, KV KV dataset based system. All of these systems follow BigTable [18] Group column design, as a value obtained by processing each tuple having the specified row keys.

Although these SQL-NoSQL systems can be extended by OLTP transactions, but its efficiency is affected by KV store scanning, as recent attempts to support analysis (OLAP) queries observed [37,15,33]. To overcome these limitations, [37,15,33,1] By design improves the performance of the new KV scanning system. They focused on the exploration of the design space KV system to scan efficiency and other system parameters (such as update, version control and query types) trade-off. Which, Tell [37] (for the most recent scan optimized modern system KV) and Apache Kudu [1] also explore KV store to store based on the relationship of the column. These efforts did not help KV existing storage portfolio and SQL-No-NoSQL.

By proposing BaaV model, this work takes a different approach. It is designed to improve the performance of analytic queries on an existing SQL-No-SQL system, without compromising its scalability. It explores the new logic of existing systems can be stored easily in relations KV expressed support model and to study their impact on query evaluation, without the need to change the KV storage.

Secondary index . BaaV model provides the ability to store KV secondary index, but it is not limited to the index. Few KV store support index. In doing so few in relation to the secondary index is encoded by filling sort key [38], and therefore still limited TaaV model. More specifically, the secondary index on a non-key attribute store KV A relation R is typically implemented on set KV (k, V), where k is filled with an internal key attribute id I (or R) is the primary key Therefore the value of AI is in TaaV different, it can be used as a key; they get the entire tuple of R. This is inefficient, because (a) access point A will still cause a lot of get calls, (b) is not conducive to scan, and (c) introduces additional index maintenance costs.

Conversely, (a) BaaV KV to support indexing systems using DHT, acquires only a set of values ​​for the attribute on the same access point A. In addition, it reduces the acquisition and unnecessary repetition in the tuple attributes, and thus reduces the data access and intermediate relationships. With the rapid expansion of a redundant connection. (B) improved by increasing the scanning data locality and get certain advantages while retaining the ability to parallel and horizontal scaling. (C) First, as the data model, BAAV discloses index "program", and allows the user to optimize the use of specific index. (D) Better yet, we can infer that no scan query and bounded queries, all at the query level. These have greatly improved the Nick of computing and communications. These go beyond the traditional scope of high school only aimed at accelerating data retrieval index.

Materialized views . Materialized views used to customize the database stored in a DBMS and accelerate the speed of query evaluation [39]. In a sense, BAAV and Zidian functionality provided by KV store "materialized views." However, there are key differences. (A) To our knowledge, no major SQL-over-NoSQL NoSQL storage systems and support. (B) a person may want to expand an existing system like KV DBMS support as materialized views. However, such extensions can not provide an advantage BaaV stored, if the view is stored in TaaV model. Indeed, the view is essentially a given query tailored relationship in the DBMS. Therefore, KV Store Viewer (if supported) are also subject to the same restrictions on fundamental relationship suffered in KV shop at traditional TaaV mode. Therefore, BaaV is another more effective support materialized views KV store (and basic relations) way.

Bounded assessment . It is to study and evaluate the work-related [26] to regulate the size of independence in the cardinality constraint [9,10,27,16]. That line of work based on the use of the base index constraint hash to determine if the use of only a limited number of index data needs to answer only to the relational query plan by query rewrite in cardinality constraints.

This work is limited to evaluate different in the following respects.

Focus on (a) bounded assessment is given a set of cardinality constraints and their associated hash index based on the case, the decision may be bounded assessment of what the query. On the contrary, we do not need cardinality constraint.

(B) limited assessment is only valid in the DBMS, and the BaaV Zidian for SQL-over-NoSQL stored in the KV developed.

(C) assessment was not limited our study, for example, developed for the algebraic Zidian, and the parallel data mapping.

3. premise

We reviewed the basic symbol SQL-No-SQL system.

Key-value store . Key KV is stored - value (KV) set of (k, v), and are referred to as key and value attribute

It supports (a) get (k) to retrieve KV to (k, v) bond is k, (b) put (k, v) adding KV pair, and (c) next () through all the key and acquires the next key .

KV relationship store . Relational tuple stores KV R t of at TaaV (i.e. tuple value) is represented as a model of KV (k, v), where k is the primary or the R key id in t, and v as t. Relationship R KV stored as a set of keys share the same attributes and values, wherein each of R representing tuple. Package and is generally referred to as a column such as a wide family store, because they provide an overview of the relationship. Performs key scanning of the R () extracted by the next operation by calling the get with the following contents, through all of the R key.

NoSQL-over-SQL . In such a system, shown in Figure 1a, in TaaV model database relational schema R D KV as stored in the storage layer. SQL-over-NoSQL R system disclosed to the user, the user may then issue a query Q by R. Q is calculated by the assessment of the cluster in a separate layer called the SQL layer. SQL layer by the SQL parser, planner and executor composition to generate Q query plan ξ. Plan ξ only operational data base storage access KV D in by get. .

Work SQL-over-NoSQL system is as follows. Upon receiving the SQL query Q, the memory layer Q relationship retrieves all involved, and move the data to the SQL layer. Then, SQL layer generating parallel query plan ξ is Q, and the parallel execution of the program on all compute nodes.

That the storage and computing separate SQL-over-NoSQL (a) due to the heavy computational task will not affect the highly available storage, and (b) easy to expand, because we can expand on demand. However, it comes with a price: usually leads to slow data access complete relationship scanning, and therefore, the heavy traffic load SQL layer.

4. BAAV model

In this section, we first introduce BaaV model, re-association KV store as key blocks (Section 4.1). Then, we present algebraic on BaaV store (Section 4.2).

4.1 BaaV: Block-as-a-Value

BaaV store. KV to form patterns of R <X, Y>, wherein X and Y are attributes. Key blocks on R <X, Y> KV is on (k, B), where k is the property X and B is a group of the tuple attributes Y. Examples of KV DR <X, Y> is a group of blocks having different keying keys on R. Power D in order deg (D) is represented as the maximum size of the key block D (K, B), i.e., deg (D) = max (k, B) ∈D | B |, where | B | B, is the number of tuples.

 

 

 

Note that, in TaaV model, nationalkey, suppkey and not as a key attribute name, because they are not corresponding to the primary key of the relation. In contrast, in BAAV, since the value of a tuple BAAV blocks, so they are used as keys.

 

Attributes. BaaV has the following advantages.

(1) Compared with TaaV, BaaV repository allows any attributes as key. In fact, when the key in the block (k, B), B is a single tuple, TaaV BaaV is a special case.

(2) than in the TaaV, each get call at BaaV more data can be retrieved, and therefore more efficient in BaaV storage.

(3) the degree of locality BaaV stored representation of the level of DHT KV data storage. By adjusting the degree BaaV store, we can get a limited query, update and balance efficiency and cost-KV store in BaaV. The mapping between relational databases and KV shops. There is a correspondence between easily.

 

 

 

Table 1 summarizes the symbol.

 

4.2 KBA: Algebra key block

Guess you like

Origin www.cnblogs.com/chenxiangchun/p/11925755.html