The Dutch Three Musketeers of Big Data Processing

Big data is now a hot topic in the industry. With the development of technology, big data storage technology is no longer a difficult problem, but how to do a good job in the next step after storage of big data will be the focus of future competition, and it is currently more popular. The three big data processing tools, Storm, Spark, and Hadoop, are all written in languages ​​on the JVM.
Spark, written in Scala, is a general-purpose parallel computing framework similar to Hadoop MapReduce, which is open sourced by UC Berkeley AMP lab. Spark implements distributed computing based on the map reduce algorithm and has the advantages of Hadoop MapReduce.
Storm is written by java and clojure. The advantage of storm is full memory computing. Because the memory addressing speed is more than a million times that of hard disks, the speed of storm is very fast compared to hadoop.
Hadoop implements the idea of ​​mapreduce, and slices data to process a large amount of offline data. The data processed by hadoop must have been stored in hdfs or a database similar to hbase, so when hadoop is implemented, it improves efficiency by moving computing to these machines where data is stored.

Spark makes up for the deficiencies of hadoop, so that each has its own advantages and uses. The scope of application of the three is as follows: Hadoop is often used for offline complex big data processing; Spark is often used for offline fast big data processing; Storm is often used for online real-time big data processing.

So, what is the core of big data? According to the author's humble opinion, there are three aspects: first, data, without data, everything is useless; second, technology, without big data processing technology, then data is just some disks; third, thinking, with data and processing technology , but also have an idea, that is, how to make the data generate greater value.
The core of big data, first of all, has its value. If the amount of data is large and has no value, then big data is nothing special. Therefore, the most important thing about big data is that we can analyze and mine from a large amount of data, which is beneficial to the organization. Of course, whether the information is useful or not has to be actually verified.
In addition, the speed is fast, and market opportunities are fleeting, so if it takes a week or a month to analyze so much data, it may not make much sense.

Who will become the mainstream among the three big data processing tools, Storm, Spark, and Hadoop? In fact, these are just superficially different tools. The essence of the idea is the same. I believe that more tools will emerge in the future, but it is difficult to change the idea. For example, if you want to be fast, then from the computer architecture From the point of view, there is more memory and less hard disk, because the hard disk is too slow.
In addition, it is the same from the point of view of dealing with problems. By using more resources and in a distributed way to process such data at the same time, the speed will definitely be faster. Of course, the premise is that the cost of interactive communication between different machines, less than the resulting benefits.

Storm is real-time processing, spark and hadoop are batch processing, and the two are complementary. Compared with spark and hadoop, spark mainly makes full use of memory computing and supports more operations than map/reduce, so that some iterative-intensive algorithms will be executed more efficiently, while hadoop may require multiple mr tasks to complete. After 2.0, hadoop uses the new yarn framework, map/reduce is only one of them, and spark can also run under the yarn framework of hadoop, so the two will be integrated.

In the future, the development trend of big data will use a familiar slogan: faster, higher and stronger. However, it also needs to be more standardized. These things, A tool, B tool, and C tool, feel a bit like a toy, not a mature product. Therefore, there may be specialized companies to make commercial software in the future. Mature software. After several years of deliberation, there should be more applications. In addition to the Internet, there will also be results in some industry users.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326702295&siteId=291194637