[Basic Knowledge] Hadoop Ecosystem

Hadoop is an open source distributed computing framework, mainly used for the storage and processing of big data. It is a comprehensive distributed system containing a variety of components. The components cooperate with each other to complete complete functions from data storage to computing analysis.

Keywords - disaster recovery

Master-slave structure, multiple copies

main feature

  1. Distributed storage - Hadoop uses the HDFS file system to distribute big data on multiple servers in the cluster.
  2. Distributed Computing - Hadoop's computing framework MapReduce can process large amounts of data in parallel on distributed servers.
  3. High fault tolerance - Hadoop can automatically save multiple copies of data and can automatically transfer the work on the failed node to another node when a node fails.
  4. High scalability - Hadoop clusters can be easily expanded to thousands of nodes. Hadoop's computing and storage capabilities can expand linearly as new nodes are added.
  5. Low cost - Hadoop can run on cheap commercial servers, greatly reducing the cost of big data processing.

Component related information

core components

  • HDFS (Hadoop Distributed File System): Hadoop's distributed file system, used to store and access large amounts of data.
  • YARN (Yet Another Resource Negotiator): Hadoop's resource management and job scheduling platform.
  • MapReduce: Hadoop's distributed parallel computing framework for batch computing of large-scale data sets.

functional components

  • Hive: A data warehouse based on Hadoop that provides SQL query functions.
  • Sqoop: Used to import and export data between Hadoop and relational databases.
  • Flume: A system for collecting, aggregating and transmitting large amounts of log data in real time.
  • HBase: Hadoop's distributed column store database.
  • ZooKeeper: A coordination service for building distributed applications.
  • Ambari: Provisioning, management and monitoring tool for Hadoop clusters.

Other components

  • Pig: A high-level data flow language based on Hadoop for analyzing large-scale data sets.
  • Common: Common tools and utilities for Hadoop, including IO, RPC, serialization, configuration, etc.
  • Oozie: Workflow scheduling and coordination system for Hadoop.
  • Avro: A data serialization system for Hadoop.
  • Mahout: A machine learning algorithm library for Hadoop.

おすすめ

転載: blog.csdn.net/weixin_44325637/article/details/134982158