Big data technology and architecture - (2) Big data processing architecture Hadoop (Part 1)

1.Hadoop Overview

1.1Introduction to Hadoop

  • Hadoop is an open source distributed computing platform under the Apache [ pætʃi] Software Foundation, which
    provides users withThe underlying details of the system are transparentofDistributed infrastructure
    Insert image description here
  • Hadoop is developed based on the Java language, has good cross-platform features , and can be deployed in cheap computer clusters
  • Hadoop can support multiple programming languages, such as C, C++, Java, and Python
    Insert image description here
  • Hadoop = HDFS (storage) + MapReduce (computation)
    Insert image description here

1.2 Brief history of Hadoop development

  • FounderDoug Cutting
    Insert image description here
  • Nutch is an open source Java-implemented search engine. It provides all the tools we need to run our own search engine. Includes full-text search and web crawling.
  • In 2003, Google released the distributed file system GFS (Google File System)
  • In 2004, the Nutch project also imitated GFS and developed its own distributed file system NDFS (Nutch Distributed File System), which is the predecessor of HDFS.
    Insert image description here
  • In 2004, Google released MapReduce, a distributed parallel programming framework
  • By February 2006, NDFS and MapReduce in Nutch began to become independent
    and became a sub-project of the Lucene project, called Hadoop.
  • In January 2008, Hadoop officially became an Apache top-level project.
  • The history of Hadoop’s fame: In April 2008, Hadoop broke the world record and became the fastest
    system to sort 1TB of data. It used a cluster of 910 nodes to perform operations, and the sorting time
    only took 209 seconds.
  • In May 2009, Hadoop even shortened the sorting time of 1TB data to 62 seconds. Since then, Hadoop
    has become famous and has rapidly developed into the most influential open source distributed development platform in the big data era
    , and has become the de facto standard for big data processing.

1.3 Features of Hadoop

Hadoop is a software framework that can perform distributed processing of large amounts of data in a reliable, efficient, and scalable manner. It has the following characteristics:

  • High reliability:
    Multiple machines form a cluster. If some machines fail, the remaining machines can continue to provide services to the outside world.
  • Efficiency:
    hundreds or thousands of machines calculate together
  • High scalability:
    machines can be continuously added to the cluster
  • High fault tolerance
    When data is sent to a single node, the data is also replicated to other nodes in the cluster, which means that if a failure occurs, there is another copy available.
  • Low cost
    Hadoop uses ordinary cheap machines to form a server cluster to distribute and process data, so that the cost is very low.
  • Runs on Linux platform
  • Support multiple programming languages

1.4 Current application status of Hadoop

  • With its outstanding advantages, Hadoop has been widely used in various fields, and the Internet field is its main application area.
  • As a world-renowned social networking site, Facebook mainly uses the Hadoop platform for log processing, recommendation systems and data warehouses.
  • Domestic companies that use Hadoop mainly include Baidu, Taobao, NetEase, Huawei, China Mobile, etc. Among them, Taobao's Hadoop cluster is relatively large.

Insert image description here

  • Hadoop-related applications support 3 types of upper-layer applications
  • Different Hadoop components enable different enterprise analytics
  • The lowest level HDFS meets the needs of large amounts of data storage in enterprises
  • Analyze after storage:
  • Offline analysis can perform batch processing of data, such as MR (MapReduce). Data warehouse products Hive and Pig can also be used.
  • Hbase database for real-time query
  • Data mining using Mahout

1.5 Apache Hadoop version evolution

  • The Apache Hadoop version is divided into two generations. We call the first generation of Hadoop Hadoop 1.0 and the second generation of Hadoop as Hadoop 2.0.
    • The first generation of Hadoop included three major versions, namely 0.20.x, 0.21.x and 0.22.x. Among them, 0.20.x finally evolved into 1.0.x and became a stable version, while 0.21.x and 0.22.x New major features such as NameNode HA have been added.
    • The second generation of Hadoop includes two versions, namely 0.23.x and 2.x. They are completely different from Hadoop 1.0 and are a completely new architecture.
  • Hadoop 1.0 two cores
    Insert image description here
  • It includes two parts of work = data processing + cluster resource management (cluster CPU, memory allocation)
  • Changes from Hadoop 1.0 to Hadoop 2.0
    Insert image description here
  • YARN is also responsible for resource scheduling of stream computing.
    Insert image description here
  • Batch computing is built on YARN, and YARN performs resource scheduling.
    Insert image description here

1.6 Various versions of Hadoop (enterprise development products)

Insert image description here

2.Hadoop project structure

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/m0_63853448/article/details/126647762