Big data-the relationship between Hadoop, Hive, and Spark

Big Data

Big data itself is a very broad concept, and the Hadoop ecosystem (or pan-ecosystem) was basically born to handle data processing beyond the stand-alone scale. You can compare it to all the tools needed in a kitchen. Pots and pans have their own uses, and they overlap with each other. You can use the soup pot directly as a bowl for eating and drinking soup, and you can peel it with a knife or a plane. But each tool has its own characteristics, and although strange combinations can work, it may not be the best choice.

Big data, first you need to be able to save big data

The traditional file system is stand-alone and cannot span different machines. HDFS (Hadoop Distributed FileSystem) is essentially designed for a large amount of data that can span hundreds of thousands of machines, but what you see is one file system instead of many file systems. For example, if you say that I want to get the data of /hdfs/tmp/file1, you are referring to a file path, but the actual data is stored on many different machines. As a user, you don't need to know these, just like you don't care about the tracks and sectors of the files on a single machine. HDFS manages this data for you.

After saving the data, you start to think about how to process the data. Although HDFS can manage the data on different machines as a whole for you, the data is too big. A machine reads the data of P on T (large data, such as the size of all high-definition movies in the history of Tokyo fever or even larger), and it may take days or even weeks for a machine to run slowly. For many companies, stand-alone processing is unbearable. For example, if Weibo wants to update hot blogs for 24 hours, it must complete these processing within 24 hours. So if I have to use many machines for processing, I will be faced with how to distribute work, how to restart the corresponding task if one machine is hung up, how to communicate and exchange data between machines to complete complex calculations and so on. This is the function of MapReduce/Tez/Spark. MapReduce is the first-generation computing engine, and Tez and Spark are the second-generation. The design of MapReduce uses a very simplified calculation model. There are only two calculation processes of Map and Reduce (Shuffle is used in series in the middle). With this model, a large part of the problem in the field of big data can already be dealt with.

So what is Map and what is Reduce?

Consider if you want to count a huge text file stored on a similar HDFS, you want to know the frequency of each word in the text. You started a MapReduce program. In the Map stage, hundreds of machines read each part of this file at the same time, and count the word frequencies of the parts they read respectively, and generate pairs like (hello, 12100 times), (world, 15214 times) and so on. Here, Map and Combine are put together for simplicity); these hundreds of machines each produced the above set, and then hundreds of machines started Reduce processing. Reducer machine A will receive all statistical results starting with A from the Mapper machine, and machine B will receive statistical results starting with B (of course, it will not actually start with a letter as a basis, but use a function to generate Hash values Avoid data serialization. Because the words beginning with X are definitely much less than others, and you don't want the workload of data processing machines to be very different). Then these Reducers will be summarized again, (hello, 12100) + (hello, 12311) + (hello, 345881) = (hello, 370292). Each Reducer is processed as above, and you get the word frequency result of the entire file.

This seems to be a very simple model, but many algorithms can be described by this model.

The simple model of Map+Reduce is very yellow and violent. Although it is easy to use, it is very cumbersome. The second generation of Tez and Spark, in addition to new features such as memory Cache, is essentially to make the Map/Reduce model more versatile, making the boundary between Map and Reduce more blurred, data exchange more flexible, and fewer disk reads. Write in order to describe complex algorithms more easily and achieve higher throughput.

With MapReduce, Tez and Spark, programmers found that MapReduce programs were really troublesome to write. They hope to simplify this process. It's like you have assembly language. Although you can do almost everything, you still find it cumbersome. You want a higher and more abstract language layer to describe algorithms and data processing flows. So there was Pig and Hive. Pig is close to scripting to describe MapReduce, while Hive uses SQL. They translate scripts and SQL languages ​​into MapReduce programs, and leave them to the computing engine to calculate, and you are freed from the tedious MapReduce programs and write programs in simpler and more intuitive languages.

With Hive, people found that SQL has a huge advantage over Java. One is that it is too easy to write. The word frequency just now has only one or two lines to describe in SQL, and it takes about dozens or hundreds of lines to write in MapReduce. And more importantly, users with non-computer backgrounds finally feel the love: I can also write SQL! So data analysts are finally freed from the dilemma of begging engineers for help, and engineers are freed from writing strange one-off processing programs. . Everyone is happy. Hive has gradually grown into a core component of a big data warehouse. Even the assembly line job set of many companies is completely described in SQL, because it is easy to write and modify, understand at a glance, and easy to maintain.

Since data analysts started using Hive to analyze data, they found that Hive is really slow to run on MapReduce! The assembly line job set may not matter, such as the recommendation that is updated 24 hours, anyway, even if it runs within 24 hours. But for data analysis, people always hope to run faster. For example, I want to see how many people stopped on the inflatable doll page in the past hour, and how long they stayed. For a huge website with massive data, this process may take tens of minutes or even many hours. And this analysis may be just the first step in your long march. You also need to see how many people have viewed the vibrator and how many people have watched Rachmaninov’s CD in order to report to the boss that our users are wretched men and women. It is mostly literary youth/girls. You can't stand the torture of waiting, you can only tell the handsome engineer Grasshopper, hurry up, hurry up, hurry up!

So Impala, Presto, and Drill were born (of course there are countless non-famous interactive SQL engines, not to mention them all). The core idea of ​​the three systems is that the MapReduce engine is too slow, because it is too general, too strong, and too conservative. Our SQL needs to be lighter, more radically obtain resources, and more specialized to optimize SQL, and it does not need so much. Fault tolerance guarantee (because a system error is a big deal to restart the task, if the entire processing time is shorter, such as within a few minutes). These systems allow users to process SQL tasks faster, sacrificing features such as versatility and stability. If MapReduce is a machete, you are not afraid of cutting anything, then the top three are deboning knives, which are dexterous and sharp, but you can't make things that are too big or hard.

These systems, to be honest, have not achieved the popularity expected by people. Because at this time, two alien species were created. They are Hive on Tez/Spark and SparkSQL. Their design philosophy is that MapReduce is slow, but if I use a new generation of general-purpose computing engine Tez or Spark to run SQL, then I can run faster. And users do not need to maintain two systems. This is like if your kitchen is small, people are lazy, and have limited requirements for the fineness of eating, then you can buy a rice cooker that can steam, cook, and burn, saving a lot of kitchenware.

The above introduction is basically the structure of a data warehouse. At the bottom of HDFS, MapReduce/Tez/Spark runs on it, and Hive and Pig run on it. Or run Impala, Drill, Presto directly on HDFS. This solves the requirements for low- and medium-speed data processing.

What if I want higher speed processing?

If I were a company similar to Weibo, I would like to show not a 24-hour hot blog. I would like to watch a constantly changing hot list. The update delay is within one minute, and the above methods will not be adequate. So another computing model was developed, which is Streaming computing. Storm is the most popular streaming computing platform. The idea of ​​stream computing is, if you want to achieve more real-time updates, why don't I process it when the data stream comes in? For example, it is still the example of word frequency statistics. My data stream is a word by one, and I let them flow through. I just started counting. Streaming calculation is very powerful and basically has no delay, but its shortcoming is that it is not flexible. You must know in advance what you want to count. After all, the data flow is gone, and the things you haven't counted can't be compensated. So it is a good thing, but it cannot replace the above data warehouse and batch processing system.

There is also a somewhat independent module is KV Store, such as Cassandra, HBase, MongoDB and many, many, many others (too many to imagine). So KV Store means that I have a bunch of key values, and I can quickly get the data bound to this key. For example, if I use my ID number, I can get your identity data. This action can also be done with MapReduce, but it is likely to scan the entire data set. The KV Store is dedicated to this operation, and all deposits and withdrawals are optimized for this purpose. It may only take a few tenths of a second to find an ID number from several P data. This allows some specialized operations of big data companies to be greatly optimized. For example, there is a page on my webpage to find the order content based on the order number, and the order quantity of the entire website cannot be stored in a single database, so I will consider using KV Store to store it. The philosophy of KV Store is that it basically cannot handle complex calculations, most of which cannot be JOINs, may not be aggregated, and there is no strong consistency guarantee (different data is distributed on different machines, and you may read different results each time you read it. Nor can it handle operations with strong consistency requirements like bank transfers). But Ya is fast. Extremely fast.

Each different KV Store design has different trade-offs, some are faster, some have higher capacity, and some can support more complex operations. There must be one that suits you.

In addition, there are some more customized systems/components, such as Mahout is a distributed machine learning library, Protobuf is a code and library for data exchange, ZooKeeper is a highly consistent distributed access collaborative system, and so on.

With so many messy tools, all running on the same cluster, everyone needs to respect each other and work in an orderly manner. So another important component is the scheduling system. The most popular one is Yarn. You can think of him as the central management, like your mother supervising the kitchen, hey, your sister has finished cutting vegetables, you can take the knife to kill the chicken. As long as everyone obeys your mother's allocation, everyone can cook happily.

You can think of the big data ecosystem as a kitchen tool ecosystem. In order to cook different dishes, Chinese food, Japanese food, French food, you need a variety of different tools. And the needs of customers are becoming more and more complicated. Your kitchenware is constantly being invented, and there is no universal kitchenware that can handle all situations, so it will become more and more complicated.

End.

Reprinted from: https://www.cnblogs.com/jins-note/p/9513426.html

Guess you like

Origin blog.csdn.net/weixin_47580822/article/details/113854523