Java Big Data Road--Hadoop(1)

Hadoop

Table of contents

Hadoop

Introduction to Big Data (6V)

Hadoop overview

1. Development history:

2. Module:

3. Version:

4. Download and install


 

Introduction to Big Data (6V)

  1. Volume: The amount of data is large, including the amount of collection, storage, and calculation. The starting unit of measurement for big data is at least T, P (1024 T), E (1 million T) or Z (1 billion T)
  2. Variety: Variety of types and sources. Including structured, semi-structured and unstructured data, specifically manifested as network logs, audio, video, pictures, geographic location information, etc., multiple types of data put forward higher requirements for data processing capabilities
  3. Value: The value density of data is relatively low, or it can be said that it is still precious. With the wide application of the Internet and the Internet of Things, information perception is ubiquitous, and there is a huge amount of information, but the value density is low. How to combine business logic and use powerful machine algorithms to mine data value is the most urgent problem to be solved in the era of big data.
  4. Velocity: The data growth rate is fast, the processing speed is also fast, and the timeliness requirement is relatively high
  5. Veracity: the accuracy and trustworthiness of data, that is, the quality of data
  6. Valence: Connectivity Between Big Data
  7. With the development of big data, Vitality (dynamic), Visualization (visualization), Validity (legality), etc. have been added.

Hadoop overview

Hadoop is an open source , reliable, and scalable system architecture provided by Apache , which can use a distributed architecture to store and calculate massive amounts of data. It should be noted that Hadoop processes offline data , that is, it is used in scenarios where the data is known and does not require real-time performance.

1. Development history:

Founders: Doug Cutting and Mike Caferalla

  1. In 2002, when Doug Cutting and Mike Cafarella were designing the search engine Nutch, they crawled the entire Internet and obtained a total of 1 billion web page data. Because most of the data on the Internet is unstructured, it cannot be stored in traditional relational databases

  2. In 2003, Google published a paper on cluster system storage: "The Google File System" (referred to as GFS)

  3. In 2004, Cutting designed NDFS (Nutch Distributed File System) based on GFS

  4. In 2004, Google published another paper on cluster system computing: "MapReduce: Simplified Data Processing on Large Clusters"

  5. In 2005, Doug designed MapReduce used in Nutch based on Google's paper

  6. After Nutch0.8, the NDFS module and MapReduce module were separated, renamed Hadoop, and NDFS was renamed HDFS - Hadoop Distributed File System

  7. In 2006, Doug Cutting joined Yahoo. Yahoo set up a dedicated team and resources to develop Hadoop into a system that can operate at the scale of the Web network.

  8. During his work at Yahoo, Doug Cutting designed Hive, Pig, HBase, etc.

  9. Later, Yahoo contributed Hadoop, Hive, Pig, HBase, etc. to Apache

2. Module:

  1. Hadoop Common: the basic module used to support other modules
  2. Hadoop Distributed File System: Distributed storage framework
  3. Hadoop Yarn: Task Scheduling and Cluster Resource Management
  4. Hadoop MapReduce: Distributed Computing
  5. Hadoop Ozone: Object Storage
  6. Hadoop submarine: machine learning engine

3. Version:

  1. Hadoop1.0:Common、HDFS、MapReduce
  2. Hadoop2.0: Common, HDFS, MapReduce, Yarn 1.0 and 2.0 are not compatible
  3. Hadoop3.0: Common, HDFS, MapReduce, Yarn, Ozone, Submarine is included in the latest version

4. Download and install

  1. Download address: http://hadoop.apache.org/releases.html Installation
  2. The installation of Hadoop is divided into stand-alone mode, pseudo-distributed mode and fully distributed mode.
  3. Standalone mode is the default mode of Hadoop. When decompressing the Hadoop source package for the first time, Hadoop could not understand the hardware installation environment, so it conservatively chose the minimum configuration. All 3 XML files are empty in this default mode. When the configuration file is empty, Hadoop runs entirely locally. Because there is no need to interact with other nodes, HDFS is not used in stand-alone mode, and no Hadoop daemons are loaded. This mode is mainly used to develop and debug the application logic of the MapReduce program.
  4.  Pseudo-distributed mode Hadoop daemons run on the local machine, simulating a small-scale cluster. Can use HDFS and MapReduce
  5. Fully distributed Hadoop daemons run on a cluster. Start all the daemon processes, have complete functions of hadoop, can use HDFS, MapReduce and Yarn, and these daemon processes run in the cluster, can really use the cluster to provide high performance, and use it in a production environment

​​​​​​Pseudo-  distributed installation tutorial

おすすめ

転載: blog.csdn.net/a34651714/article/details/102800144