Meet Hadoop


The whole book has just seen 3.2, the blog progress will be slower, and many problems and examples need to be manipulated to be clear.


Why Hadoop


solved problem

Data storage and analysis

While hard drive storage capacity has increased over the years, access speeds have not.

For this, we use the method of parallel read/write of multiple hard disks. But this also creates new problems. The two main issues are how to deal with possible hardware failures and how to ensure correctness when analyzing data from different sources.

Hadoop provides us with a reliable big data storage and analysis platform, in which HDFS provides storage functions and MapReduce provides analysis functions, which are also the two basic cores of Hadoop.


Advantages of Hadoop

RDBMS

  • The improvement in the addressing time of the hard disk is far less than the improvement in the transfer rate. Therefore, when reading large data sets , it is much faster to use the streaming data read mode (the rate mainly depends on the transfer rate). And when a large amount of data is updated , the efficiency of the B-tree adopted by RDBMS is significantly behind that of MapReduce, because it needs to use "sort/merge" to rebuild the database;
  • MapReduce is suitable for solving problems that need to analyze the entire data set in batch mode (especially for some specific purposes); RDBMS is suitable for point query and update of indexed data sets (small-scale data);
  • MapReduce is suitable for writing once and reading many times , while RDB is more suitable for continuously updated datasets.

Enter image description here

grid computing

  • Grid computing is suitable for computing-intensive jobs. If the data access volume of nodes is too large, it will be limited by network bandwidth; Hadoop tries to store data on computing nodes to achieve local and fast access to data, which is also Hadoop data processing. Core;
  • MPI (Message Passing Interface) requires programmers to explicitly deal with the data flow mechanism; while Hadoop's data flow is implicit to programmers , only the execution of tasks needs to be considered from the perspective of the data model;
  • MapReduce can automatically deal with the partial failure of the system , and the programmer does not need to pay attention to the execution order of the tasks; while the MPI program must explicitly manage the purple checkpoint and recovery mechanism, which is difficult to program.

Volunteer Computing

The problem faced by volunteer computing is that it is highly CPU-intensive, the computation takes far more time than the data transfer time of the work unit, and it runs for a long time on untrusted computers connected to the Internet. Localization is also not required. The design goal of MapReduce is to provide services for jobs that can be completed in a short period of time on reliable, dedicated computers running on the same internal high-speed network connection.


Little yellow elephant Hadoop?

Hadoop is the name given by the children of Doug Cutting, the father of Hadoop, to his plush toy, and yes, it is the happy elephant on the official website.

The ideas of HDFS and MapReduce are inspired by two Google papers (everything is based on Google): The Google File System and MapReduce: Simplified Data Processing on Large Clusters

In those years, Hadoop can turn the book on its own, which is quite interesting, but unfortunately Yahoo is cold now.


learning route

This is the learning roadmap given in the book. I am still learning in the order of chapters, but because Hadoop needs to use Linux and JAVA, these two aspects will be supplemented.

Enter image description here


simple summary

The first chapter is mainly about why Hadoop was created. This is not simply explained by the operation principle of Hadoop, it is closely related to the development of the times.

After that, I will enter the learning of Hadoop itself, not just Hadoop. The related content about Linux and Java will also be updated. I have also gone through hardships to install Hadoop on Linux. .

For the time being.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324729168&siteId=291194637