Big Data technology ecosystem: Hadoop, hive, spark differences and relationships

Big data itself is a very broad concept, Hadoop ecosystem (or pan-ecosystem) are basically in order to process more than a stand-alone scale data processing and birth. You can compare it to the various tools it needs a kitchen. Pots and pans, each have their own use, but also overlap each other. You can just eat a bowl of soup with a soup pot when you can use a knife or plane peeled. However, each tool has its own characteristics, although strange combination can work, but not necessarily the best choice.

Big Data, first you have to be able to save the next big data.

Traditional file systems are stand-alone, not across different machines. On HDFS (Hadoop Distributed FileSystem) is essentially designed for large amounts of data across hundreds or thousands of machines can be, but you see a lot of file system instead of a file system. For example, you say I want to get data / hdfs / tmp / file1, you quoted a file path, but the actual data stored on many different machines. As a user you do not need to know, like the same on a stand-alone file disperse what you do not care what the track in the sector. HDFS management of these data for you.

After the data is stored under, you start thinking about how to process data. Although the overall management of HDFS data can be on different machines for you, but the data is too big. A machine to read the data on the P-T (oh great data, such as the thermal history of the entire Tokyo size of all high-definition movies or even more), a machine may need to run slowly for several days or even weeks. For many companies, the single treatment is intolerable, such as micro-blog to be updated 24 hours hot Bo, it must finish the process within 24 hours. So if I take a lot of processing machines, I faced the allocation of work, if a machine is hung up on how to restart the corresponding task, how to communicate with each other to exchange data between machines to complete the complex calculations and so on. This is the MapReduce / Tez / Spark features. MapReduce calculation engine is the first generation, Tez Spark and the second generation. MapReduce design, using a very simplified calculation model, Map and Reduce only two calculations (with intermediate Shuffle series), using this model, a large part of the problem already handle the large data field.

What is the Map What is Reduce?

Consider if you want to count a huge text files stored on a similar HDFS, you want to know the frequency of occurrence of each word in the text. You started a MapReduce programs. Map stage, hundreds of machines simultaneously read various parts of this document were read to each part separately word frequency statistics, a similar (hello, 12100 times), (world, 15214 times), etc. Such a Pair (I Combine the Map and this together, said to simplify); hundreds of machines which have had their own set as above, then there are hundreds of machines Reduce startup process. Reducer from machine A machine Mapper receive all the vocabulary statistical results to statistical results beginning with A, B machine B will receive at the beginning of (of course, actually do not really start with the letter based, but with a function to generate Hash values avoid string of data since the beginning of X similar words certainly much less than the other, and you do not want the data processing workload of each machine disparities). Then these will be aggregated again Reducer, (hello, 12100) + (hello, 12311) + (hello, 345881) = (hello, 370292). Each Reducer are treated as above, you get the entire document word frequency result.

This appears to be a very simple model, but many algorithms can use this model describes.

Map + Reduce simple model is very yellow, very violent, though easy to use, but very heavy. The second generation of Tez and Spark in addition to the Cache memory like the new feature, in essence, is to let Map / Reduce model is more common, so the boundaries between Map and Reduce vaguer, more flexible data exchange, less disk read write, in order to more easily describe complex algorithms to achieve higher throughput.

With MapReduce, after Tez and Spark, programmers find, MapReduce programs to write real trouble. They want to simplify this process. It's like you have the assembler language, although you almost anything can be done, but you still feel cumbersome. You want to have a higher level more abstract language to describe the layer of algorithms and data processing procedures. Then there is the Pig and Hive. Pig is close to the script way to describe MapReduce, Hive is using a SQL. They scripting and SQL language translated into MapReduce programs, threw calculation engine to calculate, but you're freed from the tedious MapReduce program, with simpler and more intuitive language to write programs.

Once you have Hive, it was found that SQL contrast Java has a huge advantage. One is that it is too easy to write. Just what word frequency, using SQL describe only a line or two, MapReduce to write about for a few hundreds of lines. And more importantly, non-computer users finally felt the love of background: I will write SQL data analysts then finally freed from begging for help dilemma engineer, engineers freed from a strange one-time write handler! . Everybody happy. Hive has grown to become a core component of a large data warehouse. Even many of the company's pipeline operations set entirely in SQL describe as easy to write easy to change, a look to understand, easy to maintain.

Since the data analysis began with the Hive data analysis, they found that, Hive on MapReduce running, dick really slow! Sets might have nothing to pipeline operations, such as 24 hours recommended updates, anyway, even if the finish within 24 hours. But the data analysis, people always want to be able to run faster. For example, I would like to see how many people within the past hour in a inflatable doll page stop, how long to stay, respectively, for the next giant website massive data, this process may take several minutes or even many hours. And this analysis may be only the first step in a long march of you, how many people you depend on how many people saw the Love balls Rachmaninoff CD, in order to report with the boss, our customers are more wretched men Mensao female more or young artists / girls more. You can not endure the torture of waiting, with the only handsome engineer Grasshopper say, fast, fast, then faster!

So Impala, Presto, Drill born (of course there are numerous well-known non-interactive SQL engine will not list them). The core idea is that the three systems, MapReduce engine is too slow, because it is too generic, too strong, too conservative, we need SQL lighter, more aggressive access to resources, and more specifically to optimize SQL to do, and does not require so much fault tolerance to ensure that (a big deal because of a system error to restart the task if the overall processing time shorter words, such as within a few minutes). These systems allow users to more quickly handle SQL task, at the expense of the general characteristics of stability. If MapReduce is a machete, cut consequently are not afraid, that the above three is boning knife, sharp and smart, but not too hard to do too much stuff.

These systems, to be honest, has not reached the expected popularity. Because this time it is made out of two heterogeneous. They are Hive on Tez / Spark and SparkSQL. Their design philosophy is, MapReduce slow, but if I compute engine with a new generation of GM Spark Tez or to run the SQL, then I will be able to run faster. And users do not need to maintain two systems. It's like if your kitchen is small, lazy people, sophistication required to eat limited, you can buy a rice cooker can boil steam can burn energy, save a lot of kitchen utensils.

The above description, a data warehouse is the basic framework of. Bottom HDFS, running above MapReduce / Tez / Spark, run Hive, Pig above. Or directly run Impala, Drill, Presto on HDFS. This solution requires the low-speed data processing.

That faster if I want to deal with it?

If I were a similar microblogging company, I want to show is not 24 hours hot Bo, I want to see an ever-changing hit list, updated in one minute delay, the above methods are not competent. Yet another computational model was then developed, which is Streaming (stream) is calculated. Storm is the most popular stream computing platform. Ideas flow calculation is that if you want to achieve more real-time updates, why do not I come in when the data flow on the deal? Or word frequency statistics such example, my data stream is a one word, I let them flow through the side I began compiling a while. Flow calculations very fast hardware, almost no delay, but its weaknesses is not flexible, you want to count must know in advance what, after all the data flowing through is gone, things you did not count on was unable to make calculations. So it is a good thing, but can not substitute the above data warehousing and batch systems.

There is also a separate module is somewhat KV Store, such as Cassandra, HBase, MongoDB, and many many many many other (more to unimaginable). So KV Store That said, I have a bunch of keys, I can drop very quickly get to this Key data binding. For example, I use the ID number, and to take to your identity data. This action can also be done with MapReduce, but is likely to scan the entire data set. KV Store and dedicated to process this operation, all memory and optimized specifically for this purpose are taken. Find a ID number data from several P, perhaps as long as a few tenths. This allows a large number of specialized data company's operations are greatly optimized. For example, I have orders to find content based on the order page number on the page, while the number of orders for the entire site can not be a stand-alone database storage, I would consider KV Store to save. KV Store philosophy is basically can not handle complex calculations, most of them can not JOIN, maybe not polymerization, there is no strong consistency guarantees (different data distributed across different machines, each time you read might read a different result, Bank transfers can not handle a similar operation as strong conformance requirements). Ah, but it is fast. Fast.

Each different KV Store design has a different trade-offs, some faster, some higher capacity, some can support more complex operations. There must be a right for you.

In addition, there are some more special systems / components, such as machine learning Mahout distributed database, data exchange Protobuf code and libraries, the ZooKeeper high consistency distributed cooperative system access, and so on.

With so much clutter tools, are operating on the same cluster, we need to respect each other and orderly work. Another important component so that the scheduling system. Now the most popular is the Yarn. You can see him as the central management, supervision like your mother in the kitchen, hey, your sister cut vegetable is over, you can take chickens a knife. As long as you obey your mother allocation, that we can happily drop cooking.

You can think of, big data ecosystem is a kitchen tool ecosystem. In order to make different dishes, Chinese food, Japanese food, French cuisine, you need a variety of tools. And the guests needs are complicated, your kitchen ever been invented, nor a universal kitchen can handle all situations, so it will become more and more complex.

Highly recommended reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data engineers must understand the concept of the seven

The future of cloud computing and big data Five Trends

How to quickly build their own knowledge of large data

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92206182