Hadoop is an open source distributed computing framework, mainly used for the storage and processing of big data. It is a comprehensive distributed system containing a variety of components. The components cooperate with each other to complete complete functions from data storage to computing analysis.
Keywords - disaster recovery
Master-slave structure, multiple copies
main feature
- Distributed storage - Hadoop uses the HDFS file system to distribute big data on multiple servers in the cluster.
- Distributed Computing - Hadoop's computing framework MapReduce can process large amounts of data in parallel on distributed servers.
- High fault tolerance - Hadoop can automatically save multiple copies of data and can automatically transfer the work on the failed node to another node when a node fails.
- High scalability - Hadoop clusters can be easily expanded to thousands of nodes. Hadoop's computing and storage capabilities can expand linearly as new nodes are added.
- Low cost - Hadoop can run on cheap commercial servers, greatly reducing the cost of big data processing.
Component related information
core components
- HDFS (Hadoop Distributed File System): Hadoop's distributed file system, used to store and access large amounts of data.
- YARN (Yet Another Resource Negotiator): Hadoop's resource management and job scheduling platform.
- MapReduce: Hadoop's distributed parallel computing framework for batch computing of large-scale data sets.
functional components
- Hive: A data warehouse based on Hadoop that provides SQL query functions.
- Sqoop: Used to import and export data between Hadoop and relational databases.
- Flume: A system for collecting, aggregating and transmitting large amounts of log data in real time.
- HBase: Hadoop's distributed column store database.
- ZooKeeper: A coordination service for building distributed applications.
- Ambari: Provisioning, management and monitoring tool for Hadoop clusters.
Other components
- Pig: A high-level data flow language based on Hadoop for analyzing large-scale data sets.
- Common: Common tools and utilities for Hadoop, including IO, RPC, serialization, configuration, etc.
- Oozie: Workflow scheduling and coordination system for Hadoop.
- Avro: A data serialization system for Hadoop.
- Mahout: A machine learning algorithm library for Hadoop.