When we say big data Hadoop, what exactly are we talking about?

When it comes to big data, there are probably two problems that cannot be escaped. One is how to store massive data, and the other is how to query and calculate so much data .

Fortunately, these problems have been solved before, and Hadoop is one of the best, and it is the most popular big data software on the market. What does it include? What are the characteristics?

1. Introduction to Hadoop

When it comes to Hadoop, what is your understanding?

In a narrow sense , Hadoop refers to an open source software implemented in java language by the Apache Software Foundation, which allows users to use a simple programming model to realize distributed computing processing of massive data across machine clusters.

In a broad sense , Hadoop refers to the big data ecosystem built around Hadoop, as shown in the figure below, where Hadoop is the base and foundation of the entire ecosystem, building the entire ecosystem of big data.insert image description here

2. Three core components of Hadoop

Hadoop is mainly composed of 3 parts, commonly known as the Hadoop Three Musketeers:

1.Hadoop HDFS (Distributed File Storage System)
stands for Distributed File System. It is essentially a file system. Due to the large amount of data, it is impossible to store all the data on a "computer". How can there be such a large disk? Computer, then can it be stored on multiple different "computers", that is, distributed, storing files in different nodes, mainly to solve the problem of massive data storage, it is at the bottom of the ecosystem and core position.

2. Hadoop MapReduce (distributed computing framework)
MapReduce, as the first generation distributed computing framework in the big data ecosystem, mainly solves the computing problems of massive data.

The traditional calculation method is generally to load data from each node, and then perform unified calculation. The biggest disadvantage of this is that the calculation is very slow, and only one node works. The MapReduce computing framework can be distributed on each node for parallel computing, and finally merged.

Note that MapReduce is just a computing framework, or a programming model, not a piece of software, and does not need to be deployed.

3. Hadoop YARN (cluster resource management and task scheduling platform)
YARN is a distributed and general-purpose cluster resource management system and task scheduling platform. How do you understand it?

Many computing tasks of big data, such as MapReduce tasks, or other Spark tasks, etc., they need CPU, memory, disk and other resources when computing, so when multiple tasks are computing, a manager needs to be assigned to them. Resource allocation, scheduling, etc., this administrator is YARN.

3. Advantages of Hadoop

Why Hadoop is so popular is inseparable from its many advantages.

  • Scalability
    Hadoop distributes data and completes computing tasks among available computer clusters, and these clusters can be expanded to thousands of nodes in a convenient and flexible manner.
  • Low cost
    Hadoop cluster allows to process big data by deploying ordinary cheap machines to form a cluster, so that the cost is very low. What matters is the overall capability of the group.
  • High efficiency
    Through concurrent data, Hadoop can dynamically move data between nodes in parallel, making the speed very fast.
  • Reliability
    can automatically maintain multiple copies of data, and can automatically redeploy computing tasks after task failures. So Hadoop's ability to store and process data bit by bit is worthy of people's trust.
  • Open source
    Because Hadoop is open source, the entire community is very active, and many companies build their big data platforms based on Hadoop.

4. Hive and SQL Studio

Hadoop is just a general term for a set of tools. It consists of three parts: HDFS, Yarn, and MapReduce. The functions are distributed file storage, resource scheduling, and computing.

Logically speaking, this is enough, and big data analysis can be completed.

But the first problem is troublesome. This set is equivalent to using Yarn to schedule resources and read the content of HDFS files for MR calculation. Want to write Java code, but what is the best tool for doing data? SQL! So Hive is equivalent to the SQLization of this set of standard processes.

Hive is a Hadoop-based data warehouse tool for data extraction, transformation, and loading. It is a mechanism that can store, query, and analyze large-scale data stored in Hadoop.

The advantage of Hive is that the learning cost is low, and it can realize fast MapReduce statistics through similar SQL statements, making MapReduce easier without developing a dedicated MapReduce application program. Hive is very suitable for statistical analysis of data warehouses.

At present, there are not many SQL tools that support Hadoop. In addition to Hive, there is also SQL Studio. Recently, SQL Studio has fully supported Hadoop. insert image description here
SQL Studio is a cross-platform database management tool that supports Linux, Mac, and Windows systems:

  1. The most noteworthy thing about SQl Studio is that it is free ;
  2. Nowadays, domestic databases are developing rapidly, but there are not many SQL tools that support domestic databases. SQL Studio is the gold of the few. It not only supports mainstream databases such as MySQL and Oracel, but also supports domestic databases such as Wuhan Dameng and Renda Jincang . ;
  3. It is a web version tool , one-click decompression, and it can be used without installation. It is very convenient to support the team to communicate and coordinate codes online in real time, which is more efficient;
  4. Automatically generate test data function, no need to write code anymore, SQL Studio will automatically generate millions of test data for you, saving worry and effort;
  5. Supports huge data volume : Query tens of millions of data in milliseconds, export 3000W faster than Navicat, and expand 1W tables in a stable and silky smooth manner without lagging.

You can download and explore more features and advantages by yourself.

Guess you like

Origin blog.csdn.net/ylguoguo6666/article/details/130357578