Big Data is not sql write it? - Hive: The sql parsed using MapReduce run SparkSQL: sql parsed with the Spark run, than the hive quickly Drill / Impala / Presto: interactive query OLAP Druid / Kylin: emphasis on pre-computed, the same OLAP

Graduates small ancestors participated in a needs analysis will be back heel I hate to say the product is the sentence:

 

"Is not it written SQL, so long to do."

 

I go, brother to bully me, that I certainly can not tolerate it, so I wrote an article issued in the company's wiki:

640?wx_fmt=png

 

Posted for everyone to see, we omitted some sensitive content.

Of course, the internal version of the words will be milder, hee hee


Where to write SQL?

 

High point of the question is asked what kind of engine with SQL?

 

SparkSQL, Hive, Phoenix, Drill, Impala, Presto, Druid, Kylin (here SQL engine a broad sense, we do not have a dead end)

 

I simply summarized in one sentence at these things, you are now the first matter Kan Bukan have to understand:

  • Hive: The sql parsed run with MapReduce

  • SparkSQL: sql parsed with the Spark run, than the hive quickly

  • Phoenix: a bypass running SQL MapReduce framework on the HBase

  • Drill / Impala / Presto: interactive query OLAP, are similar to what google Dremel, the difference here is not to say

  • Druid / Kylin: emphasis on pre-computed, the same OLAP

 

This involves more problems, and these components are not familiar with the research process, students may have to spend more than a month.

For example: demand is calculated in real time or off-line analysis?

Data is incremental data or static data?

How much data?

We can tolerate long response times?

In short, function, performance, stability, operation and maintenance difficulty, the difficulty of developing these are to be considered


Where data on the implementation of SQL?

 

Do you think for elections engine can be opened write? too naive!

Most of the tools mentioned above are only query engine, store it?

"What, why even managed storage?"

Regardless of storage, it should exist mysql PB-level data is not it ...

 

Relational database such as mysql, query engine and is tightly coupled memory, which is actually help optimize performance, you can not split them apart.

SQL engine and large data systems are generally independent of the data storage system, to obtain greater flexibility. This is due to consider the amount of data and performance.

 

This involves the problem even more. First find out which storage engine supports docking, how to keep it convenient and efficient inquiry.

Tools can be stored persistence which, I have a map, feel (this is only a small part)

640?wx_fmt=png


In what syntax to write SQL?

 

Do you think you can get to store and query opened wrote? You think the whole world is the same sql? Not!

Not all engines support join;

Not all distinct is precisely calculated;

Not all engines support page limit;

Also, if dealing with complex scenarios often requires custom sql method, then how to customize it, write code that way.

 

A few simple and common chestnuts:

 

Seen such a sql it?

  1. select `user`["user_id"] from tbl_test ;

 

This operation ever seen it?

  1. insert overwrite table tbl_test select * from tbl_test  where id > 0;

 

FML, this will not lock it? hive was not, but this is not recommended.

So can write

  1. from tbl_test insert overwrite table tbl_test select * where id > 0;


How to write SQL in a more efficient way?

 

Well, all buttoned up, we can finally start a pleasure to write SQL.

I write the SQL procedure when a small sentence ancestors first came to the company to summarize:

"FML, this SQL has more than 100 lines!"

 

Fact table, dimension tables of various data repeatedly join join, it's not finished even then join data from different time, but also $ # @% ^ $ # ^ ...

 

Do not say, people must know how wrote nausea (omitted more than 100 lines)

 

Finally finished, hardships come to this step, full of joy Qiaoxia Enter ...

Time in the past 1 minutes ...

10 minutes...

30 minutes...

1 hour...

2 hours...

......

 

Do not wait, go on like this there will be no results.

 

Log honest look at it, look at the log is also a lot of learning.

 

First you have to figure out is how to run this sql, the bottom is mapReduce or spark or other analytical applications become put, get other interfaces;

Then have to figure out is how to get the data, there is no data skew occurs, how to optimize.

At the same time you have to pay attention resources, cpu, memory, io, etc.


At last

 

Demand for the product again, the existing system can not achieve the above four steps and then toss it again ...

Yes, we are writing SQL.

Guess you like

Origin www.cnblogs.com/bonelee/p/12441530.html