Introduction to Hadoop!

 

What is Hadoop?

With the rapid increase in the amount of data, the two most direct problems encountered are data storage and calculation (analysis/utilization).

Hadoop is a distributed basic framework implemented in Java developed by the Apache Foundation. It can also be seen as a platform that supports the development and operation of distributed applications on large clusters composed of general-purpose computing devices. The two most important components in Hadoop-HDFS and MapReduce are used to solve massive data (distributed) storage and massive data (distributed) computing. Users can develop distributed programs without understanding the underlying details of distributed. Make full use of the power of clusters for high-speed computing and storage.

Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data sets). set) application. HDFS relaxes the requirements of POSIX, and can access data in the file system in the form of streaming access.

HDFS has two kinds of nodes, NameNode and DataNode. DataNode is mainly used to store data, and NameNode manages the interaction of the entire file system. Compared with ordinary file systems, HDFS is distinguished by its distributed mass storage and backup mechanism.

The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data. MapReduce: Parallel computing framework. MapReduce is actually a distributed computing model in which multiple computers perform parallel computing to do one thing together.

 

Application scenarios of Hadoop:

After a brief understanding of what Hadoop is, let's take a look at what scenarios Hadoop generally applies to.

Hadoop is mainly used in offline scenarios with large data volumes, and is characterized by large data volumes and offline.

Large amount of data: Generally, Hadoop is actually used online, and the cluster size is from hundreds to thousands of machines. In this case, the T-level data is also very small.

Offline: Under the Mapreduce framework, it is difficult to process real-time calculations, and the jobs are mainly offline jobs such as log analysis. In addition, there are generally a large number of jobs waiting to be scheduled in the cluster to ensure full utilization of resources.

In addition, due to the design characteristics of HDFS, Hadoop is suitable for processing files with large file blocks. Using Hadoop to process a large number of small files will be very inefficient.

 

Common scenarios for Hadoop are:

Large data storage: distributed storage (hadoop applications are available on various cloud disks, Baidu, 360~ and cloud platforms)

  1. Log processing
  2. Mass computing, parallel computing
  3. Data mining (such as ad recommendation, etc.)
  4. Behavior analysis, user modeling, etc.
  5. ……

 

More Hadoop courses: Alibaba Cloud University-Developer Class

Guess you like

Origin blog.csdn.net/weixin_40050195/article/details/95347004