Big data architecture: data collection-processing-analysis-tool introduction hadoop

Hadoop is an open source distributed computing platform under Apache, which can run on computer clusters and provide reliable and scalable distributed computing functions. At the heart of Hadoop are the distributed file system (HDFS) and the parallel programming framework MapReduce.

history

Hadoop is inseparable from three papers:

  • In 2003, Google released a paper on the distributed file system GFS, which can be used to solve the problem of massive data storage.
  • In 2004, Google published a paper on MapReduce, which can be used to solve massive data computing problems.
  • In 2006, Google released a paper on BigTable, which is a distributed storage system based on GFS as the underlying data storage.

GFS, MapReduce, and BigTable are what we often call the "three carriages". The relationship between Hadoop and these three papers is as follows:

  • HDFS in Hadoop is an open source implementation of GFS;
  • MapReduce in Hadoop is an open source implementation of Google's MapReduce;
  • HBase in Hadoop is an open source implementation of Google's BigTable.

Hadoop Features

Hadoop features are:

  • Cross-platform: hadoop is developed based on java language, has good cross-platform, and can run on Linux platform;
  • High reliability: HDFS in Hadoop is a distributed file system, which can distribute and store massive data redundantly on different machine nodes. Even if a machine copy fails, other machine copies can still run normally;
  • High fault tolerance: HDFS distributes and stores files on many different machine nodes, and can automatically save multiple copies, so when a task on a node fails, it can also be automatically redistributed;
  • Efficiency: Hadoop's core components HDFS and MapReduce, one is responsible for distributed storage and the other is responsible for distributed processing, capable of processing PB-level data;
  • Low cost and high expansion: Hadoop can run on cheap computer clusters, so the cost is relatively low, and it can be extended to thousands of computer nodes to complete the storage and calculation of massive data.

Components of the ecosystem

related items

The Hadoop ecosystem includes many subsystems. Here are some common subsystems, as follows:

HDFS: Distributed File System

HDFS is the Hadoop distributed file system. It is one of the core projects in the Hadoop ecosystem and the basis of data storage management in distributed computing. HDFS has a highly fault-tolerant data backup mechanism, which can detect and respond to hardware failures, and runs on low-cost general-purpose hardware. In addition, HDFS has streaming data access features, provides high-throughput application data access functions, and is suitable for applications with large data sets.

MapReduce: Distributed Computing Framework

MapReduce is a computing model for parallel computing of large-scale data sets (greater than 1TB). "Map" performs specified operations on independent elements on the data set to generate intermediate results in the form of key-value pairs; "Reduce" reduces all "values" of the same "key" in the intermediate results to obtain the final result. The "divide and conquer" idea of ​​MapReduce greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming.

Yarn: a resource management framework

Yarn (Yet Another Resource Negotiator) is a resource manager in Hadoop 2.0. It can provide unified resource management and scheduling for upper-layer applications. benefit.

Sqoop: data migration tool

Sqoop is an open source data import and export tool, which is mainly used for data conversion between Hadoop and traditional databases. It can import data from a relational database (such as MySQL, Oracle, etc.) into Hadoop's HDFS , HDFS data can also be exported to a relational database, making data migration very convenient.

Mahout: A library of data mining algorithms

Mahout is an open source project under Apache. It provides some scalable implementations of classic algorithms in the field of machine learning, aiming to help developers create intelligent applications more conveniently and quickly. Mahout contains many implementations, including clustering, classification, recommendation filtering, and frequent subitem mining. Additionally, Mahout can be efficiently scaled to the cloud by using the Apache Hadoop library.

Hbase: distributed storage system

HBase is a Google Bigtable clone, which is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase adopts the data model of BigTable: an enhanced sparse sorted mapping table (Key/Value), where the key consists of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed using MapReduce, which perfectly combines data storage and parallel computing.

Zookeeper: Distributed Collaboration Service

Zookeeper is a distributed, open source distributed application coordination service, an open source implementation of Google's Chubby, and an important component of Hadoop and HBase. It is a software that provides consistent services for distributed applications. Its functions include: configuration maintenance, domain name service, distributed synchronization, group service, etc. It is used to build distributed applications and reduce the coordination tasks undertaken by distributed applications.

Hive: Hadoop-based data warehouse

Hive is a distributed data warehouse tool based on Hadoop, which can map structured data files into a database table and convert SQL statements into MapReduce tasks for execution. Its advantages are simple operation, low learning cost, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.

Flume: a log collection tool

Flume is a highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection; at the same time, Flume provides support for The ability to simply process data and write to various data recipients (customizable).

Pig: an abstraction for MapReduce

All data processing operations can be performed in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-level language called Pig Latin. The language provides various operators with which programmers can develop their own functions for reading, writing and manipulating data. It allows programmers who are not very good at writing Java programs to analyze and process big data.

Spark: a general computing engine

Spark does not depend on MapReduce, it uses its own data processing framework. Spark uses memory for calculations and is much faster. Spark itself is an ecosystem. In addition to the core API, the Spark ecosystem also includes other additional libraries that can provide more capabilities in the field of big data analysis and machine learning, such as Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX , BlinkDB, Tachyon, etc.

Impala: A New Query System

Impala is a new query system developed by Cloudera. It provides SQL semantics and can query PB-level big data stored in Hadoop's HDFS and HBase. Although the existing Hive system also provides SQL semantics, because the underlying execution of Hive uses the MapReduce engine, it is still a batch process, which is difficult to satisfy the interactive nature of queries. In contrast, Impala's biggest feature and biggest selling point is its speed. Impala can be used in combination with Hive, and it can directly use Hive's metadata database.

Kafka: Distributed message queue

Kafka is a distributed, publish/subscribe-based message system, similar to the function of a message queue, which can receive data from producers (such as webservice, files, hdfs, hbase, etc.), cache it, and then send it to Consumers (same as above) play the role of buffer and adaptation.

Ambari: big data cluster management system

Ambari is an open source big data cluster management system that can be used to create, manage, and monitor Hadoop clusters, and provides a WEB visual interface for users to manage.

Oozie: Workflow Scheduling Engine

Oozie is an open source workflow scheduling engine for the Hadoop platform. It is used to manage Hadoop jobs. It belongs to the web application and consists of two components: Oozie client and Oozie Server. Oozie Server is a web program running in a Java Servlet container (Tomcat).

Guess you like

Origin blog.csdn.net/weixin_29403917/article/details/128111987