Talking about the connection between Hadoop, Hive and Spark

Big data itself is a very broad concept, and the Hadoop ecosystem (or pan-ecosystem) is basically born to handle data processing beyond the scale of a single machine. You can compare it to the various tools a kitchen needs. Pots and pans, each with its own purpose, overlap with each other. You can use the soup pot directly as a bowl to eat and drink soup, and you can peel it with a knife or a plane. But each tool has its own characteristics, and while strange combinations may work, they may not be the best choice.
Big data, first of all, you must be able to store big data. The
traditional file system is a single machine and cannot span different machines. HDFS (Hadoop Distributed FileSystem) is essentially designed for large amounts of data to span hundreds or thousands of machines, but what you see is one file system rather than many file systems. For example, if you say that I want to get the data of /hdfs/tmp/file1, you refer to a file path, but the actual data is stored on many different machines. As a user, you don't need to know these, just like you don't care about which tracks and sectors the files are scattered on on a single computer. HDFS manages this data for you.
After saving the data, you start to think about how to process the data. Although HDFS can manage the data on different machines as a whole for you, the data is too large. A machine reads data as P on T (a huge amount of data, such as the size of all high-definition movies in the history of Tokyo is even bigger), and it may take days or even weeks for a machine to run slowly. For many companies, single-machine processing is unbearable. For example, if Weibo needs to update its 24-hour hot blog, it must complete these processing within 24 hours. Then if I use many machines to process, I am faced with how to distribute work, how to restart the corresponding task if a machine hangs, how to communicate and exchange data between machines to complete complex calculations, and so on. That's what MapReduce/Tez/Spark do. MapReduce is the first generation computing engine, Tez and Spark are the second generation. The design of MapReduce adopts a very simplified computing model. There are only two computing processes of Map and Reduce (with Shuffle in the middle). Using this model, a large part of the problems in the field of big data can be solved.
So what is Map and what is Reduce?
Consider if you want to count a huge text file stored on something like HDFS, you want to know the frequency of each word in the text. You start a MapReduce program. In the Map stage, hundreds of machines read each part of the file at the same time, count the word frequencies of the parts they read, and generate Pairs like (hello, 12100 times), (world, 15214 times) and so on. Here Map and Combine are put together for simplicity); each of these hundreds of machines produces the above set, and then hundreds of machines start the Reduce process. Reducer machine A will receive all the statistical results starting with A from the Mapper machine, and machine B will receive the statistical results of vocabulary starting with B (of course, it will not actually start with a letter, but use a function to generate a Hash value starting with Avoid data stringification. Because words like X must be much less than others, and you don't want data processing workloads that vary widely from machine to machine). Then these Reducers will aggregate again, (hello, 12100) + (hello, 12311) + (hello, 345881) = (hello, 370292). Each Reducer is processed as above, and you get the word frequency results for the entire file.
This seems to be a very simple model, but many algorithms can be described by this model.
The simple model of Map+Reduce, although easy to use, is cumbersome. In addition to new features such as memory cache, the second-generation Tez and Spark essentially make the Map/Reduce model more general, blur the boundaries between Map and Reduce, make data exchange more flexible, and reduce disk reads. Written to describe complex algorithms more easily and achieve higher throughput.
With MapReduce, Tez and Spark, programmers find that MapReduce programs are really troublesome to write. They want to simplify the process. It's like you have assembly language, although you can do almost everything, but you still feel cumbersome. You want a higher, more abstract language layer to describe algorithms and data processing flows. Hence Pig and Hive. Pig is close to the script to describe MapReduce, Hive uses SQL. They translate the script and SQL language into MapReduce programs, and throw them to the computing engine to calculate, and you are freed from the cumbersome MapReduce programs and write programs in a simpler and more intuitive language.
With Hive, people find that SQL has a huge advantage over Java. One is that it is too easy to write. The word frequency thing just now is described in SQL with only one or two lines, while MapReduce takes about dozens or hundreds of lines to write. And more importantly, non-computer background users finally feel the love: I can write SQL too! So data analysts are finally freed from begging engineers for help, and engineers are freed from writing weird one-off handlers . Everyone is happy. Hive has gradually grown into a core component of big data warehouses. Even the pipeline job set of many companies is completely described in SQL, because it is easy to write and change, understand at a glance, and easy to maintain.
Since data analysts began to use Hive to analyze data, they found that Hive runs on MapReduce, and the pipeline job set may not matter, such as the recommendation updated in 24 hours. But data analysis, people always hope to run faster. For example, I want to see how many people stopped on some specific pages in the past hour, and how long they stayed. For a huge website with massive data, this process may take tens of minutes or even many hours. And this analysis may only be the first step in your long journey, and you still have many other things to analyze. You can't stand the torture of waiting.
So Impala, Presto, Drill were born (of course there are countless non-famous interactive SQL engines, not to list them one by one). The core idea of ​​the three systems is that the MapReduce engine is too slow, because it is too general, too strong, and too conservative, and our SQL needs to be lighter, more aggressive to acquire resources, more specifically optimized for SQL, and not so much. Fault-tolerance guarantees (since a system error is a big deal to restart the task, if the overall processing time is shorter, say within a few minutes). These systems allow users to process SQL tasks faster, at the expense of generality and stability. If MapReduce is a machete, you are not afraid of cutting anything, then the above three are deboning knives, which are smart and sharp, but cannot be too big or too hard.
These systems, let's be honest, haven't lived up to the popularity expected. Because at this time, two different species were created. They are Hive on Tez/Spark and SparkSQL. Their design philosophy is that MapReduce is slow, but if I run SQL with Tez or Spark, a new generation general computing engine, I can run faster. And users do not need to maintain two systems. This is like if you have a small kitchen, lazy people, and limited requirements on the fineness of eating, then you can buy a rice cooker, which can steam, cook and burn, saving a lot of kitchen utensils.
The above introduction is basically the framework of a data warehouse. The underlying HDFS runs MapReduce/Tez/Spark on it, and runs Hive and Pig on it. Or run Impala, Drill, and Presto directly on HDFS. This addresses the requirements for low to medium speed data processing.
What if I want faster processing?
If I were a Weibo-like company, I would like to show that it is not a 24-hour hot blog, but I want to watch a constantly changing hot list, and the update delay is within one minute, and the above methods will not be able to do it. So another computing model was developed, which is Streaming computing. Mr. Chen from Shangxuetang pointed out that Storm is the most popular streaming computing platform. The idea of ​​stream computing is that if I want to achieve more real-time updates, why don't I process the data stream when it comes in? For example, in the example of word frequency statistics, my data stream is one word by one, and I let them flow through. I just started counting. Stream computing is very good and basically has no delay, but its disadvantage is that it is inflexible. What you want to count must be known in advance. After all, the data flow is gone, and what you have not calculated cannot be made up. So it's a great thing, but not a replacement for the data warehouse and batch systems above.
Another somewhat independent module is the KV Store, such as Cassandra, HBase, MongoDB and many, many, many others (too many to imagine). So KV Store means that I have a bunch of key values, and I can quickly get the data bound to this key. For example, I can get your identity data with my ID number. This action can also be done with MapReduce, but it is likely to scan the entire dataset. The KV Store is dedicated to handling this operation, and all access and retrieval are optimized for this purpose. Find an ID number from the data of several P, maybe only a few tenths of a second. This allows some specialized operations of big data companies to be greatly optimized. For example, there is a page on my webpage that searches for order content based on the order number, but the order quantity of the entire website cannot be stored in a single-machine database, so I will consider using KV Store to store it. The concept of KV Store is that it is basically unable to handle complex calculations, most of which cannot be JOINed, may not be able to aggregate, and have no strong consistency guarantee (different data is distributed on different machines, you may read different results each time you read, It also cannot handle operations that require strong consistency like bank transfers). But it's fast. extremely fast.
Each different KV Store design has different trade-offs, some are faster, some have higher capacity, and some can support more complex operations. There must be one for you.
In addition, there are some more specialized systems/components, such as Mahout is a distributed machine learning library, Protobuf is a code and library for data exchange, ZooKeeper is a highly consistent distributed access cooperative system, and so on.
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325382655&siteId=291194637