+ should impala

Kudu + Impala Kai绍

Outline

Kudu and Impala are all top-level project Cloudera contribution to the Apache Foundation. Kudu as the underlying stored, while supporting low-latency high kv concurrent queries, the Scan still maintain good performance, the characteristic that it is theoretically capable of taking into account the type and OLAP OLTP query category. Impala as a veteran of SQL parsing engine, which face ad hoc queries (Ad-Hoc Query) class-requested stability and speed has been extensively validated in the industry, Impala and not have their own storage engine, which is responsible for parsing SQL, and connect the underlying storage engine. In the beginning of the release Impala main support HDFS, released after Kudu, Impala and Kudu is doing a deep integration.

Among the many big data framework, Impala positioning Hive similar, but more concerned about the Impala rapid ad hoc query parsing the SQL for SQL execution time is too long, still Hive is more appropriate. For other GroupBy SQL queries, calculations Impala memory is performed so that a higher Impala machine configuration requirements, memory 128G official suggested above, such problems are corresponding to the bottom Hive conventional MapReduce computing framework, although the efficiency is low, but the stability well, the machine configuration requirements are also low.

Impala is the biggest advantage of the efficiency of the data stored in HDFS, the Impala is already parsing speed much faster than Hive, then with Kudu addition, is even more powerful, query execution speed of up to a hundred times different part.

It is noteworthy that, Kudu and Impala in English intended to two different species of antelope from Africa, Cloudera this company likes to run fast with the animals as the naming of their products.

Background

OLTP and OLAP

OLTP (On-line Transaction Processing) oriented high concurrent low latency CRUD (INSERT, DELETE, UPDATE, SELECT , etc ..).

OLAP (On-line Analytical Processing) is intended for BI analytical data requests, which have a higher tolerance for delay, the amount of data compared to OLTP processing to be much larger.

Traditionally OLTP corresponds with MySQL and other relational databases, OLAP and corresponding to the data warehouse. The OLTP and OLAP query engine for data storage is different, not the same as its request processing, the required infrastructure is also very different. This feature means that data needs to be stored in at least two places, the need for regular or real-time synchronization, but also need to maintain consistency, the characteristics of the data development engineer caused great distress, students wasted a lot of time in the data the verification and synchronization. Kudu + Impala appeared, though not the perfect solution to this problem, but it is undeniable, its ease this conflict.

Note that this is not critical OLTP and their transaction ACID four conditions are met. In fact, OLTP is a concept much earlier than the emergence of ACID. In this paper, the concept of OLTP and OLAP attention in terms of the amount of data, concurrency, latency requirements, etc., are not concerned about the transaction.

Kudu Raiyu

Kudu Cloudera was first developed, and contributed to the December 3, 2015 to the Apache Foundation, July 25, 2016 announced graduation , upgraded to a top-level Apache project. It is noteworthy that in the development of Kudu get strong support among the Chinese company millet, millet deeply involved in the development of the Kudu, with a Kudu of Committer.

As can be seen from the time of graduation Kudu, Kudu still young, there are many details that need improvement, there are some important characteristics to be developed (such as transactions). But a combination of Kudu + Impala practice of being more and more companies, Cloudera company is currently the main push of new big data solutions.

Other Glossary

  • HDFS is the most basic ecological circle Hadoop storage engine, please note that HDFS is designed primarily for large file storage, high throughput reading and writing service, HDFS is not suitable for storing small files, nor does it support a large number of random read and write.
  • MapReduce distributed computing the most basic computing framework, by the task into a plurality Mapper and the Reducer, the computing framework works well with large data common task.
  • Hadoop was first composed of HDFS + MapReduce, as well as Hadoop2.0 release 3.0, Hadoop is being given more functionality.
  • Hbase inspired by Google's Bigtable paper, the project first began in Powerset company. Hbase based on the HDFS, with a high random access performance is typical OLTP class request processing engine. At the same time, Hbase Scan also has a good performance, we can handle some types of class OLAP request.
  • Spark set iterative calculation, flow calculation in one of the next generation of computing engine, compared to MapReduce, Spark provides a richer distributed computing primitives, distributed computing tasks can be done more efficiently.
  • Ad-Hoc Query is usually translated as ad hoc query , it is an important concept in the data warehouse. Data exploration and analysis applications are usually put together any temporary SQL, and there are certain requirements to speed up the search, query the class collectively referred to as ad-hoc queries.
  • Storage column as compared to the line memory , the data storage row of the same column together. Because the data is repeated the same column a higher degree, so the memory can provide a higher compression ratio. And because most BI analysis only read part of the column, compared to the line memory, column-store only need to scan the columns needed data read less, so you can provide faster query. Common columnar storage agreement Parquet and so on.

Kudu Kai绍

What is the Kudu

Kudu storage engine is built around the Hadoop ecosystem, Kudu has Hadoop ecosystem and common design philosophy, which runs on an ordinary server, distributed-scale deployment, and meet high availability requirements of industry. The design concept for the FAST ON the FAST the Data Analytics. . Most Kudu Hbase scenes and the like, which is designed to reduce the random access performance, increase in scanning, in most scenarios, at the same time has nearly Kudu Hbase random read and write performance, as well as far more than the scanning performance Hbase .

Hbase different from other storage engines, Kudu has the following advantages:

  • OLAP class fast query processing speed
  • Highly compatible with MapReduce, Spark and other Hadoop ecosystem common system, which is connected to the drive by the official support maintenance
  • Integration with Impala depth, compared to HDFS + Parquet + Impala traditional architecture, Kudu + Impala has a better performance in most scenarios.
  • Powerful and flexible consistency model, allowing the user to define a separate consistency model for each request, even a strong sequence identity.
  • Able to support both OLTP and OLAP request, and has good performance.
  • Kudu integrated into ClouderaManager, operation and maintenance friendly.
  • High availability. Raft Consensus algorithm using as a model the master failed elections, even if the election fails, data is still readable.
  • Data supports structured, pure columnar storage, while space-saving, more efficient query speed.
Kudu typical usage scenario

Streaming real-time computing scenarios

Stream computing scenarios typically have a large number of continuous writes, at the same time these data also support for near real-time read, write and update operations. Kudu design able to handle this scenario.

Time-series storage engine (TSDb of)

The hash Kudu fragments designed to work well to avoid local hot spots TSDB request class. The Scan performance while efficiently than can make Kudu Hbase better support queries.

Machine Learning & Data Mining

Intermediate results of data mining and machine learning often requires high throughput batch writing and reading, while a small amount of random access operation. Kudu design can satisfy the storage needs of these intermediate results.

Coexist with historical heritage data

In the industrial sector the actual production environment, often have a large number of heritage data. Impala can support the HDFS, Kudu, more underlying storage engines, this feature makes use of Kudu, do not put all data migrated to Kudu.

Kudu important concepts in

Columnar storage

There is no doubt, Kudu is a pure column-based storage engine, compared to Hbase just store data in columns, Kudu column-based storage closer Parquet, in support of a more efficient Scan operation, it is also take up less storage space. Columnar storage have such advantages, mainly because of two things:. 1 OLAP queries in the usual sense of the data access only part of the column, the column storage engine support on-demand access In this case, the line storage engine must retrieve all row data. 2. Data in columns put together a general sense, will have a higher compression ratio, because the same column data tend to have a higher similarity.

Table

Kudu All data are stored in the Table, each table has a table structure and the corresponding primary key, the primary key data are stored in order. Because Kudu designed to support large scale data volume, data is among the Table will be split into fragments, called the Tablet.

Tablet

Tablet to a neighboring data together with other distributed storage service similar to a Tablet will have multiple copies on different servers placed above, at the same time, there is only one Tablet as a leader, each copy can be provide separate read, write consistency you need to synchronize writes.

Tablet Server

Tablet service definition, read and write operations on the Tablet will be done by the service. For a given tablet, as a leader, the other as a follower, leader election and disaster recovery following the principle Raft consistency algorithm, which is described in detail later. Note that a limited number of Tablet Tablet service can carry, design Kudu table structure which also requires the need for a reasonable number of settings Partition, too little can lead to reduced performance, too much can cause excessive Tablet, Tablet service to causing pressure.

Master

master stores all meta-information other services, at the same time, at most one master serving as a leader, will be re-election in accordance with Raft consensus algorithm after the leader down.

master meta-information read and write operations will coordinate client coming. For example, when creating a new table when, client sends a request to the master, master will forward the request to catelog, tablet and other services.

master itself does not store data, the data is stored in a tablet, and a copy will be backed up according to the normal tablet.

Tablet service will be every second heartbeat connection with master.

Raft Consensus Algorithm

Kudu use Raft consistency algorithm is divided into nodes follower, candidate, leader three roles, when the leader node goes down, follower and become a candidate by majority vote principle to become a new leader, because a majority of the electoral principle, so at any time, up to a leader role. leader receives client to upload data to modify instruction and distributed to the follower, when most of the follower writes, leader considers a successful write and inform the client.

Catalog Table

Catelog table stores some metadata Kudu, including Tables and Tablets.

Kudu Architecture Overview

As can be seen from the figure there are three Master, where a is the leader, the other two are follower.

There are four Tablet server, n copies of a tablets and evenly distributed over the four machines. Each tablet has a leader, two follower. Each table is divided into a plurality of tablet according to the number of fragments.

image

Impala Introduction

What is Impala

Impala is built on Hadoop ecosystem of interactive SQL parsing engine, SQL syntax Hive Impala highly compatible, and provide a standard ODBC and JDBC interfaces. Impala does not provide data storage services, data available from the underlying HDFS, Kudu, Hbase even Amazon S3.

Impapa first developed by Cloudera company, contributed to the Apache Foundation on 15 December, now its official name for the Apache Impala (incubating)

Impala Hive itself is not a complete substitute for the requested number of high throughput long time to execute, Hive is still the most stable and best option, even if it is SparkSQL, its stability is not comparable with the Hive.

Impala stability as good as Hive, but in terms of efficiency, Impala can be no doubt spike Hive. Impala using in-memory computing model for distributed Shuffle, you can take advantage of modern computer memory and CPU resources as much as possible. Meanwhile, there Impala pretreatment and analysis techniques may be inserted after the data table COMPUTE STATS instruction to make Impala-depth analysis of data ranks.

Impala advantage
  • Hive and highly similar to SQL syntax, without much learning costs
  • SQL parsing capability Large scale data, efficient use of memory and CPU utilization, rapid return SQL query results.
  • Integrating multiple underlying data sources, HDFS, Kudu, Hbase Impala vest by other data sharing and eliminates the need for data synchronization.
  • Integration with Hue depth, providing a visual SQL operations and work flow.
  • Standard JDBC and ODBC interface to facilitate the seamless access to the downstream side of the business.
  • The most detailed to provide rights management column, the actual production environment to meet the data security requirements.
Impala and the Hive SQL compatibility?

Impala highly compatible Hive, but some Hive SQL features are not supported in the Impala, including:

  • Data type is not supported, etc.
  • XML and Json function does not support
  • DISTINCT plurality is not supported, the following operations need to complete a plurality DISTINCT

select v1.c1 result1, v2.c1 result2 from (select count(distinct col1) as c1 from t1) v1 cross join (select count(distinct col2) as c1 from t1) v2;

Impala and compatible Hive is not only reflected in grammar, in the framework Impala and Hive also maintained a considerable degree on the compatibility, Impala Hive database directly using dollars, for the company, has been in the Hive table structure in the need to migrate , Impala can be used directly.

What Kudu + Impala mean to us

Kudu + Impala provides a good solution for real-time data warehousing. This architecture also maintain good Scan performance while supporting random read and write, while its official support for the Spark client and other streaming computing framework. These features mean that data can be written to the Spark calculated in real time in real-time Kudu, Impala upper provide BI analysis SQL queries, data mining algorithms and other needs can directly manipulate the underlying data on Kudu iteration Spark framework.

Kudu and Impala inadequate

Primary key restriction Kudu
  • Table created emperor key can not be changed;
  • Line corresponding primary key content may not be modified Update operation. To modify the primary key value of the row, a new row to be deleted and the new data, and the atomic operation can not be maintained;
  • It does not support the type of the primary key DOUBLE, FLOAT, BOOL, and must be non-null primary key (NOT NULL);
  • Auto-generated primary key is not supported;
  • Each row corresponding primary key storage unit (CELL) up to 16KB.
Kudu limit column
  • Part of the data type in MySQL, such as DECIMAL, CHAR, VARCHAR, DATE, ARRAY, which are not supported;
  • Data type, and whether it is blank column property can not be modified;
  • A table up to 300.
Kudu limit table
  • Backup must be an odd number of tables, a maximum of 7;
  • The number can not be modified after the backup set.
Kudu limiting means (Cells) of
  • A data unit corresponding to the maximum 64KB, and is before the compression.
Restriction fragment Kudu
  • Fragments manually specify only supports, does not support automatic sheet;
  • Fragmentation setting can not be modified, to set the desired modifying slice "build a new table - derivative data - deleted cousin" operation;
  • Most lost backup Tablets need to manually repair.
Kudu capacity constraints
  • Tablet servers recommended maximum number is 100;
  • The maximum recommended number of masters is 3;
  • Each tablet server recommended data stored up to 4T (where doubt, why there is such a small 4T limit?);
  • The number of tablets per tablet server storage recommendations within 1000;
  • The maximum number of tablets per table storing the fragments in a single tablet server 60.
Kudu Other usage restrictions
  • Kudu is designed to use the analysis of each row corresponding data is too large may run into some problems;
  • There primary key index, does not support the secondary index (Secondary indexes);
  • Multi-line transaction operations are not supported;
  • Some features of relational data, such as foreign keys, is not supported;
  • Table name column and forced to UTF-8 encoding, and the maximum 256 bytes;
  • Delete one does not immediately free up space, you need to perform Compaction operation, but the operation does not support Compaction performed manually;
  • Drop table operation will immediately free up space.
Impala stability
  • Impala is not suitable for ultra-long SQL requests;
  • Impala high concurrent read and write operations are not supported, even Kudu is supported;
  • Impala and Hive some grammar are not compatible.

FAQ

Impala to support high concurrent read and write it?

not support. Although the Impala platform designed to BI- ad hoc query, but its high cost of single execution SQL does not support low-latency, high concurrency scenarios.

Impala Hive can replace it?

Not, memory Impala calculation model is designed, its high efficiency, but less stable than Hive, long running requests to SQL, Hive is still the first choice.

Impala how much memory is needed?

Similar to the Spark, Impala put data into memory as far as possible into the calculation, although not enough memory, Impala will be calculated by means of the disk, but there is no doubt that the memory size determines the efficiency and stability of the Impala. Impala official recommended memory to at least 128G or more, and 80 percent of memory allocated to the Impala

Impala Cache have?

Impala will not table data Cache, Impala will only Cache some metadata table structure. Although in reality, the second run the same query may be faster, but this is not the Impala Cache, which is the underlying storage system or Linux Cache.

Impala can add a custom function?

can. Impala1.2 version supported UDFs, but the Impala UDF add more complex than the Hive.

Impala Why so fast?

Impala for speed, it made a lot of optimization on efficiency details. In broad terms, compared to the Hive, Impala does not use MapReduce as a computational model, MapReduce is a great invention, solves many distributed computing problems, but unfortunately, not for SQL and MapReduce design. When converted to SQL MapReduce computation primitives, often requires a multi-layer iteration, the data needs to fall more frequently, resulting in a great waste.

  • Impala will try to put the data cached in memory, so that data does not fall SQL query to complete, compared to MapReduce each round of iteration floor design, efficiency has been greatly improved.
  • Resident Impala process avoids the overhead of MapReduce start, start-up cost for ad hoc query MapReduce task was a disaster.
  • Impala is designed specifically for SQL can be avoided every time tasks into Mapper and Reducer, reducing the number of iterations, and avoid unnecessary Shuffle and Sort.

Meanwhile Impala modern computational framework, can make better use of modern high-performance servers.

  • Impala with a code performed dynamically generated LLVM
  • Impala will be possible with this hardware configuration, comprising a set of instructions to prefetch SSE4.1 data and the like.
  • Impala will control their own coordination disk IO, will fine-grained control of each disk throughput, making the overall throughput is maximized.
  • In the code efficiency level, using C ++ language Impala completed, and the pursuit of language details, including inline functions, the speed loop unrolling technique
  • Procedural memory usage, Impala use C ++ natural advantages, memory footprint much smaller than the JVM-based language, level of detail on the code also follows the principle of minimal memory usage, which makes it possible to free up more memory for data cache.
What Kudu Hbase advantage compared to, and why?

Kudu on certain features and Hbase very similar, we will put them together. However, Kudu and Hbase two different nature as follows.

  • Kudu data model is more like a traditional relational database, Hbase is complete no-sql design, everything is bytes.
  • Kudu disk storage model is a true columnar storage, storage structure and design Kudu Hbase very different.
    Overall, purely OLTP request more suitable Hbase, request OLTP and OLAP bound for Kudu.
Kudu is a pure memory databases?

Kudu not pure in-memory database, a data block divided MemRowSet Kudu and DiskRowSet, most of the data stored on the disk.

Kudu has its own storage format, or follow the Parquet of?

Kudu storage memory is used in the line memory, disk storage is stored in a column, and the format is similar to Parquet, different portions to support random read and write requests.

compactions need to manually operate it?

compactions Kudu is designed to automatically perform background, and the block is slowly performed, the manual operation is not currently supported.

Kudu support expires automatically delete it?

not support. Hbase support this feature.

Kudu has the same local hot spots and Hbase problem?

Modern design is often distributed storage data will be ordered by primary key storage. This will cause some localized hot spot access, such as the last time the log real-time storage model primary key, is always written to the log at the time of ordering, this will cause serious local hot spots in Hbase in. Kudu also have the same problem, but much better than Hbase, Kudu support hash fragments, write data will first find the corresponding tablet in accordance with the hash, then the primary key and orderly written.

Kudu position in the CAP theory?

And Hbase Like, Kudu is the CAP CP. As long as the data is successfully written to a client, other clients read data are consistent, if downtime occurs, the data will be written a certain delay.

Kudu support multiple indexes do?

It does not support, Kudu only support a Primary Key index, but the Primary Key is set to include multiple columns. Automatic increase in the index, an index of traditional multi-database support support foreign keys and other features Kudu are designing and development.

How Kudu support for transactions?

Kudu does not support multi-line transaction operations, does not support rolling back the transaction, but Kudu can guarantee atomicity single line operation.

Guess you like

Origin www.cnblogs.com/maoxiangyi/p/11240263.html