You do not need a number of real-time warehouse | you need it is a powerful OLAP database (under)

In the previous chapter, we talked about building real time data warehouse, the development of Internet technology to big data today, the basic fields have matured, there is a wide range of solutions for our choice.

In real time the number of warehouse construction, the solutions mature, message queues Kafka, Redis, Hbase few rivals, has almost become a monopoly situation. The entire capacity of real-time OLAP choice of the number of positions is restricted. Open Source Spirit today, we can for the selection and use of OLAP database dazzling, this chapter we have selected some of the most popular open source OLAP engine data for analysis, technology selection is being done to give hope and future infrastructure upgrades to provide you some help.

In this paper, the performance evaluation of commonly used open source OLAP engine:
https://blog.csdn.net/oDaiLiDong/article/details/86570211

OLAP contending

Introduction to OLAP

OLAP, also known as online analytical processing (Online Analytical Processing) system, sometimes also called DSS decision support system, what we call data warehouse. As opposed to this is the OLTP (on-line transaction processing) online transaction processing system.

Concept of online analytical processing (OLAP) was first proposed by the parent EFCodd relational database in 1993. OLAP proposes caused a great response, OLAP as a separate class of products with online transaction processing (OLTP) obvious.

Codd think that online transaction processing (OLTP) database can not meet the end-user query and analysis requirements, SQL queries on large databases simply can not meet the needs of the user analysis. User decision analysis requires a large number of relational database calculations to get the results, and the results of the query and can not meet the needs of policy makers raised. Thus, Codd proposed the concept of multi-dimensional and multi-dimensional database analysis, that OLAP.

Commission definition of OLAP online analytical processing are: conversion out from the raw data, the user can truly understand and reflect the true nature of the multi-dimensional business data called information data, enabling analysts, managers or executives can from a variety of angles information data fast, consistent, interactive access to land, so as to obtain a type of software technology to a better understanding of the data. OLAP goal is to meet or multidimensional decision support environment-specific query and reporting needs, its core technology is the "dimension" concept, and therefore can be said to be OLAP multidimensional data analysis tool collection.

OLAP criteria and characteristics

EFCodd proposed 12 guidelines for the OLAP:

  • Guidelines 1 OLAP multidimensional model must provide conceptual view
  • Guideline 2 transparent criteria
  • Guideline 3 access capability criteria
  • Guideline 4 stable reporting capabilities
  • Guideline 5 client / server architecture
  • Guideline 6-dimensional equivalence criteria
  • Sparse matrix processing guideline 7 dynamic
  • Guideline 8 Multi-user support capability criteria
  • Cross-dimensional unrestricted operating guidelines 9
  • Intuitive data manipulation criteria 10
  • Guideline 11 flexible report generation
  • Guideline 12 unlimited dimensions and aggregation levels

In a nutshell:

OLTP database memory system emphasizes efficiency, emphasizing memory command rate indicators, emphasizing bind variables, emphasizing the concurrent operation, emphasizing the transactional;
OLAP data analysis system is stressed, emphasizing the long-SQL execution, stressed the disk I / O, stressed partition.

Open source OLAP engine

Currently on the market mainstream open source OLAP engine contains not limited to: Hive, Hawq, Presto, Kylin, Impala, Sparksql, Druid, Clickhouse, Greeplum, etc., can be said there is no single engine can be perfect on the amount of data, the degree of flexibility and performance the user needs to perform the selection according to their needs.

Components and Features Introduction

Hive

https://hive.apache.org/

Hive is based on Hadoop data warehousing tools, you can map the structure of the data file to a database table, and provide a complete sql query function, you can convert the sql statement to run MapReduce tasks. The advantage is the low cost of learning, you can quickly achieve a simple MapReduce statistics by type of SQL statements, without having to develop specialized MapReduce applications, data warehouse is very suitable for statistical analysis.

For the hive is aimed primarily OLAP applications, it is hdfs underlying distributed file system, hive generally used only for statistical analysis, not a common CUD operations, Hive need to synchronize into the final hdfs from an existing database or log file system, the current increment to do real-time synchronization are quite difficult.

Hive advantage is complete SQL support, low learning costs, custom data format, high scalability can be easily extended to thousands of nodes, and so on.

But not Hive data in the process of loading the data in any treatment, not even scan the data, so there is no index for some Key data. To access a particular value Hive data to meet the conditions, you need violence to scan the entire database, access high latency.

Hive is really slow. A large amount of data aggregation calculation or contingency table query, Hive takes hundreds of hours to calculate, at a certain moment, I do not even want it expelled from the OLAP "nationality", but have to admit Hive Hadoop-based system is still the most widely used OLAP engine.

Hawley

http://hawq.apache.org
https://blog.csdn.net/wzy0623/article/details/55047696
https://www.oschina.net/p/hawq

Hawq native Hadoop is a massively parallel SQL analysis engine, Hawq using MPP architecture, improved cost-based query optimizer for the Hadoop. In addition to efficient processing of internal data itself, but also access HDFS, Hive, HBase, JSON and other external data sources via PXF. HAWQ fully compatible with the SQL standard, able to write SQL UDF, can also be used to perform simple SQL data mining and machine learning. Whether it is features, or performance, HAWQ are more appropriate for building Hadoop analytic data warehouse applications.

A typical Hawq cluster components as follows:
file

file

Some people on the network to Hawq and Hive query performance comparison test, on the whole, use Hawq internal table much faster than Hive (4-50 times).
Original link: https://blog.csdn.net/wzy0623/article/details/71479539

Spark SQL

https://spark.apache.org/sql/

SparkSQL, formerly known as Shark, SQL queries and it will integrate seamlessly Spark program can be structured as a Spark of RDD data query. SparkSQL as an ecological Spark continue to develop, but not longer limited to the Hive, but compatible Hive.

Spark Spark places throughout the SQL system is as follows:
file

SparkSQL architecture is as follows:
file

Spark Spark SQL familiar with the students, it is easy to understand and get started:
compared to the Spark RDD API, Spark SQL contains structured data and more information on its operations, Spark SQL use this information for additional optimizing the operation of the structured data more efficient and convenient.
SQL provides a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC.
Hive excellent compatibility.

Presto

https://prestodb.github.io/

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.

This is the official introduction Presto. Presto is a Facebook Big Data open source distributed SQL query engine for interactive analysis queries, can support a large number of data sources, including HDFS, RDBMS, KAFKA, etc., but also provides a very friendly interface development, data source connector.

Presto supports standard ANSI SQL, including complex queries, polymerization (aggregation), the connection (join) and window function (window functions). As alternative Hive and Pig (Pig and Hive queries are done by HDFS MapReduce data pipe flow), Presto data itself is not stored, but may access a variety of data sources and data sources across support cascading Inquire.

https://blog.csdn.net/u012535605/article/details/83857079
Presto does not use MapReduce, it is through a custom query execution engine and done. It's all query processing in memory, which is also its main reason for a high performance. Presto and Spark SQL are very similar, which is different from the Hive its most fundamental difference.

However, since the memory-based Presto, the hive is read on the disk, so presto much faster than the hive, but because it is calculated based on the memory can lead to memory errors when multiple table Queen association operation.

file
https://www.cnblogs.com/tgzhu/p/6033373.html

Kylin

http://kylin.apache.org/cn/
https://www.infoq.cn/article/kylin-apache-in-meituan-olap-scenarios-practice/
mentioned Kylin will have to talk ROLAP and MOLAP.

  • Depending conventional OLAP data storage into a ROLAP (relational olap) and MOLAP (multi-dimension olap)

  • ROLAP relational data model is stored as a multi-analysis, the advantage that the small storage volume, the query flexible, but disadvantages are also evident, each query data needs to calculate the polymerization, in order to improve the shortcomings, the use of ROLAP column memory, parallel query, query optimization, bitmap index techniques.

  • MOLAP will analyze the data is physically stored in a multidimensional array of the form, structure formed CUBE. Property values ​​dimension mapped into a multidimensional array subscript or subscript range, the fact that the value stored in the array multidimensional array unit, the advantage is rapid inquiry, the disadvantage is not easy to control the amount of data, the problem dimension explosion may occur.

And a MOLAP system itself is Kylin, Cube (MOLAP Cube) is designed to enable a user to more than ten billion define the data model data set and build prepolymerization cube data in Kylin.

Apache Kylin ™ is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities over Hadoop / Spark to support large scale data, originally developed and contributed by eBay Inc. to the open source community. It can query Hive huge table in the sub-sec.

file

Kylin advantages are:

  • Provide ANSI-SQL interfaces
  • Interactive query capabilities
  • MOLAP Cube concept
  • Seamless integration with BI tools

So for Kylin scenarios include:

  • User data is present in the Hadoop HDFS utilizing the HDFS Hive file access data as relational data, a huge amount of data, at least 500G
  • Every day, even tens of G G incremental data import
  • 10 have more or less fixed dimension Analysis

Briefly, the data cube Kylin idea is space for time, for each combination of latitude is calculated in advance by defining a set of latitude and stored. There are N latitude, there will be a combination of 2 N times. So it is best control the amount of latitude, because the storage volume will increase as the latitude of explosive growth, with disastrous consequences.

Impala

https://impala.apache.org/

Impala is also a SQL on Hadoop query tools, are based on MPP technology to support rapid interactive SQL queries. Share metadata storage and Hive. Impalad is the core of the process, is responsible for receiving inquiries and more data nodes distribution task. statestored process is responsible for monitoring all Impalad process, and status of each node in the cluster report Impalad process. catalogd notification process is responsible for broadcasting the latest information metadata.

Impala's architecture is as follows:
file

Impala features include:

  • Support Parquet, Avro, Text, RCFile, SequenceFile and other file formats
  • Data stored in the operational support HDFS, HBase, Amazon S3 of
  • Support a variety of compression encoding: Snappy, Gzip, Deflate, Bzip2, LZO
  • Support UDF and UDAF
  • Automatically most efficient sequence table join
  • It allows you to define a query priority queuing strategy
  • Support multi-user concurrent queries
  • It supports data caching
  • Provide computing statistics (COMPUTE STATS)
  • It provides a window function (polymerization OVER PARTITION, RANK, LEAD, LAG, NTILE etc.) to support advanced analytics
  • Supports and connecting disks polymerization operation when the operation disk into memory overflow use
  • Allows the use of sub-query in the where clause
  • It allows incremental statistics - only to perform statistical calculations on the data of the new or changed data
  • Support maps, structs, complex nested queries on arrays
  • Impala can be used to insert or update HBase

Similarly, Impala and often Hive, Presto put together to make the comparison, Impala disadvantages are also obvious:

  • Impala does not provide any support for serialization and de-serialization.
  • Impala can only read text files, binary files can not be read custom.
  • Whenever a new record / file is added to the data directory in HDFS, the table needs to be refreshed. This drawback can lead to sql query is executing refresh hang encountered, the query does not move.

Druid

https://druid.apache.org/
https://blog.csdn.net/warren288/article/details/80629909

Druid is a level to provide sub-second query historical data and real-time data storage. Druid support low-latency data ingest, flexible data analysis to explore, high-performance data aggregation, easy horizontal scaling. Suitable for large volumes of data, scalable high capacity requirements of analytical inquiry system.

Druid problems addressed include: fast data query and rapid uptake of data.
So to understand Druid, it needs to be understood as two systems, namely input system and inquiry system.

Druid's architecture is as follows:
file

file

Druid features include:

  • Druid real-time consumption data, consumption data truly real-time, real-time query results
  • Druid support PB-level data, one hundred billion fast event processing, support thousands of concurrent queries per second
  • Druid's core time series, the time series data stored in batches, very suitable for the scene of the statistical analysis of time
  • Druid data columns are divided into three categories: timestamp column dimensions, metrics columns
  • Druid does not support multi-table joins
  • Data Druid is in general use other computational framework (Spark, etc.) is expected to be considered a good low-level statistics
  • Druid query scenarios are not suitable for processing in a perspective dimension complex
  • Druid specializes in relatively simple query types, some common SQL (groupby, etc.) in the general statement druid in speed
  • Druid support low-latency data inserts, updates, but more than hbase, much slower traditional database

Similar to other timing database, Druid in the query hit under a lot of data there may be a case of performance problems, and sorting and polymerization ability is generally not very good, not enough flexibility and scalability, such as the lack Join, sub-queries and so on.

My personal understanding is that the Druid, Druid to ensure real-time data is written, but the query to SQL support is not perfect (does not support Join), real-time record for the clean good entry, then quickly query results contain history, in our current the business is not practical.

Druid application can refer to:
"Druid in usage scenarios have praise and application of a" https://blog.csdn.net/weixin_34273481/article/details/89238947

Greeplum

https://greenplum.org/

https://blog.csdn.net/yongshenghuang/article/details/84925941
https://www.jianshu.com/p/b5c85cadb362

Greenplum is an open source massively parallel data analysis engine. With MPP architecture, analysis of complex SQL execution speed faster than many solutions on large data sets.

GPDB fully supports the ANSI SQL 2008 standard and extended SQL OLAP 2003; speaking from the application programming interface, which supports ODBC and JDBC. Comprehensive standards support makes the system development, maintenance and management are greatly convenient. Support distributed transactions, support for ACID. Strong consistency guarantees data. As a distributed database, with good linear scalability. GPDB a sound eco-system can be integrated with many enterprise-class products, such as SAS, Cognos, Informatic, Tableau etc; also can integrate a wide variety of open source software, such as Pentaho, Talend and so on.

GreenPulm architecture as follows:
file

GreenPulm technical characteristics are as follows:

  • Support for mass data storage and processing
  • Support Just In Time BI: quasi real-time, real-time data loading, to achieve real-time updates of data warehouse, so as to realize the dynamic data warehouse (ADW), based on dynamic data warehousing, business users can perform BI real-time analysis (Just for the current business data In Time BI)
  • Support mainstream sql syntax, very convenient to use, low-cost learning
  • Scalable, multi-language support custom functions and custom types
  • Offers a number of maintenance tools, maintenance is easy to use
  • Supports linear expansion: MPP using parallel processing architecture. Increasing MPP node structure can be linear storage capacity and processing power providing system of
  • Better support for concurrency and high availability support in addition to providing hardware level Raid technology, but also provide protection database layer Mirror mechanism provided Master / Stand by mechanisms master node fault tolerant, when the master node error, can switch to Stand by node continues service
  • MapReduce support
  • Internal database compression

An important message: Greenplum based on Postgresql, that is similar GreenPulm and TiDB positioning, you want to be unified in the OLTP and OLAP.

ClickHouse

https://clickhouse.yandex/
https://clickhouse.yandex/docs/zh/development/architecture/
http://www.clickhouse.com.cn/
https://www.jianshu.com/p/a5bf490247ea

The official website for ClickHouse:

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.

Clickhouse yandex developed by the Russian company. Designed for online data analysis and design. Yandex is a Russian search engine company. Official document table, ClickHouse daily processing record number "one billion."

Characteristics: The column storage; data compression; support fragmentation, and may be executed in parallel on different slices same computing tasks, calculations completed will aggregate results; support the SQL; support contingency table queries; real-time update; automatic multiple copies synchronization; support index; distributed storage queries.

Nginx we are not familiar with it, fighting the national open-source software popular features include: lightweight, fast.

ClickHouse biggest feature is fast, fast, fast, important words three times!
Compared with Hadoop, Spark these giant components, ClickHouse very lightweight, which is characterized by:

  • Column database storage, the data compression
  • Relational, SQL support
  • Parallel distributed computing, squeezed to limit the performance of single
  • High Availability
  • PB level data in the order
  • Real-time data updates
  • index

Use ClickHouse also has its own limitations, including:

  • Lack of high-frequency, low-latency has the ability to modify or delete data exists. Can only be used to batch delete, or modify data.
  • No complete transaction support
  • It does not support the secondary index
  • Limited SQL support, join achieve different
  • Window function is not supported
  • Metadata management requires human intervention to maintain

to sum up

The above gives some commonly used OLAP engine, they each have their own characteristics, we will group them:

  • Hive, Hawq, Impala - based SQL on Hadoop
  • Presto and Spark SQL similar - execution plan based on SQL parsing memory
  • Kylin - with space for time, precomputation
  • Druid - a real-time support data intake
  • ClickHouse - Hbase OLAP field, a huge performance advantages of single-table queries
  • Greenpulm - Postgresql OLAP field

If your scene is based on off-line computing tasks HDFS, then the Hive, Hawq and Imapla is your research objectives;
if your scene to solve the problem of distributed queries, there is a certain degree of real-time requirements, then Presto and may be more in line with your SparkSQL expect;
if your summary dimension relatively fixed, higher real-time requirements, dimensions that can be configured by the user + indicators is expected to be considered, so might as well try and Kylin Druid;
ClickHouse is in a single-table query performance on center stage, far more than other the OLAP database;
Greenpulm as a relational database product, performance can increase linearly with the expansion of the cluster, more suitable for data analysis.

Like the US group said in a research report of Kylin:

There is no query an OLAP system can meet the needs of a variety of scenarios.
The reason is that by its very nature, no system can be perfect at the same time the amount of data, performance, and flexibility of the three aspects, each system needs to make a choice between these three at design time.

Big Data technology and architecture
Welcome to my public concern scan code number, reply] [JAVAPDF can get a 200 Autumn trick interview questions!

Guess you like

Origin www.cnblogs.com/importbigdata/p/11521390.html