A Brief History of Big Data Technology Development

1 Introduction

  To learn big data technology, you should understand the ins and outs of its development, why new technologies/tools appear, and what kind of progress has been made compared to old technologies.

  The development of things has its own trends and laws. When you are in the trend, you must seize the opportunity of the trend and find a way to stand out. Even if you fail, you will gain more insight into the pulse of the times and gain valuable knowledge and experience.

2. Troika of big data

  Big data technology originated from three articles published by Google around 2004, which are what we often call the "troika", namely:

  • Distributed file system GFS;
  • Big data distributed computing framework MapReduce;
  • NoSQL database system BigTable;

  One is the file system, the other is the computing framework, and the other is the database system. At that time, the paper published by Google made the industry boil and opened the era of big data.

  Before that, most companies still focused on stand-alone machines, thinking about how to improve the performance of stand-alone machines, and looking for more expensive and better servers. Google's idea is to deploy a large-scale server cluster, store massive data on this cluster in a distributed manner, and then use all the machines on the cluster to perform data calculations. In this way, Google does not actually need to buy a lot of expensive servers. It only needs to organize these ordinary machines together to perform large-scale calculations.

3. The birth of Hadoop

  Doug Cutting, the founder of the Lucene open source project, is developing the open source search engine Nutch. After reading Google's paper, he implemented functions similar to GFS and MapReduce according to the principles of the paper. Three years later, in 2006, Doug Cutting separated these big data-related functions from Nutch, and then started an independent project to develop and maintain big data technology. This is the later famous Hadoop, mainly including Hadoop Distributed file system HDFS and big data computing engine MapReduce.

  If you look at the source code of Hadoop, you will find that this software implemented in pure Java has nothing special. progress.

  After the release of Hadoop, Yahoo began to use it. In 2007, Baidu and Ali also began to use Hadoop for big data storage and computing. Hadoop became the top Apache project in 2008.

4. The birth of Hive

  Some people at Yahoo thought it was too troublesome to use MapReduce for big data programming, so they developed the Pig scripting language, which is similar to SQL syntax. After Pig is compiled, it generates a MapReduce program and runs it on Hadoop. Although writing Pig is easier than MapReduce programming, it still needs to be learned, so Facebook released Hive, which supports the use of SQL syntax for big data calculations. Write a Select statement for data query, and Hive will convert the SQL statement into a MapReduce calculation program.

  In this way, data analysts and engineers who are familiar with databases can use big data for data analysis and processing without threshold.

  After the emergence of Hive, it greatly reduced the difficulty of using Hadoop, and quickly became popular among developers and enterprises. Many Hadoop products began to appear, including: Sqoop, which imports and exports data from relational databases to the Hadoop platform; Flume, which performs distributed collection, aggregation, and transmission of large-scale logs; Oozie, a MapReduce workflow scheduling engine, etc.

5. The Birth of Yarn

  In the early days of Hadoop, MapReduce was not only an execution engine, but also a resource scheduling framework. The resource scheduling management of server clusters was completed by MapReduce itself. But this is not conducive to resource reuse, and also makes MapReduce very bloated. So a new project was launched to separate the MapReduce execution engine from resource scheduling, which is Yarn.

6. The Birth of Spark

  In 2012, Spark developed by UC Berkeley AMP Lab began to emerge. Dr. Ma Tie found that the performance of machine learning calculations using MapReduce is very poor, because machine learning algorithms usually require many iterations of calculations, and MapReduce executes Map and Reduce every time Computing requires restarting the job once, which brings a lot of unnecessary consumption and affects execution efficiency.

MapReduce mainly uses disks as storage media, but in 2012, memory has broken through capacity and cost constraints, and has become the main storage medium in the process of data operation. Once Spark was launched, it was immediately sought after by the industry and gradually replaced MapReduce in enterprise applications.

7. Big Data Batch and Stream Processing

7.1 Batch processing

  The business scenarios processed by computing frameworks such as MapReduce and Spark are called batch computing. Usually, one calculation is performed on a daily or fixed time period. The calculated data is historical data, which is also called big data offline computing.

7.2 Stream processing

  There is also a scenario where real-time calculations are required for a large amount of data generated in real time, such as face recognition. This type of calculation is called big data stream computing. Corresponding stream computing frameworks such as Storm, Flink, and Spark Streaming can satisfy such big data. This type of computing is also called big data real-time computing.

8. NoSQL

  NoSQL was very popular around 2011, and many excellent products such as HBase and Cassandra emerged. Among them, HBase is a NoSQL system based on HDFS separated from Hadoop.

Guess you like

Origin blog.csdn.net/initiallht/article/details/124578748