Hadoop learning one (first understanding of big data)

Table of contents

1 What is big data?

Two big data characteristics

Three distributed computing

4 What is Hadoop?

Five Hadoop development and version

Six Why use Hadoop

七 Hadoop vs. RDBMS

Eight Hadoop ecosystem

Nine Hadoop architecture 


1 What is big data?

Big data refers to the collection of data whose content cannot be captured, managed and processed by conventional software tools within a certain period of time.

Problems to be solved by big data technology: massive data storage and massive data computing

 

Two big data characteristics

  • 4V characteristic
    • Volume (large amount of data): 90% of the data was generated in the past two years
    • Velocity (fast): the data growth rate is fast,
    • High timeliness Variety (diversification): Data types and sources are diversified Structured data (such as tabular data), semi-structured data (such as json), unstructured data (such as log information)
    • Value (low value density): Need to mine to obtain data value
  • Inherent feature
    • Timeliness
    • immutability

Three distributed computing

Distributed computing divides larger data into smaller parts for processing.

traditional distributed computing

The New Distributed Computing - Hadoop

Calculation

Copy data to compute nodes

Computing in parallel on different data nodes

The amount of data that can be processed

small amount of data

Large amount of data

CPU performance limit

Highly limited by CPU

Limited by a single device

Improve computing power

Improve the computing power of a single machine

Scale low-cost server clusters

 

4 What is Hadoop?

  • Hadoop is an open source distributed system architecture that solves the problems of massive data storage and massive data computing
  • Architecture of choice for handling massive amounts of data
  • Complete big data computing tasks very quickly
  • Has developed into a Hadoop ecosystem

Five Hadoop development and version

  •  Hadoop originated from the search engine Apache Nutch
    • Founder: Doug Cutting
    • 2004 - Initial version implemented
    • 2008 - Became an Apache top-level project
  • Hadoop distribution
    • Community Edition: Apache Hadoop
    • Cloudera distribution: CDH
    • Hortonworks Distribution: HDP

Six Why use Hadoop

  • high scalability
    • Distribute task data among clusters, easily expand thousands of nodes
  • high reliability
    • Hadoop bottom layer maintains multiple data copies
  • high fault tolerance
    • The Hadoop framework can automatically reassign failed tasks
  • low cost
    • Hadoop architecture allows deployment on inexpensive machines
  • Flexible, can store any type of data
  • Open source, active community

七 Hadoop vs. RDBMS

Comparison between Hadoop and relational database

RDBMS

Hadoop

Format

required when writing data

required when reading data

speed

read data fast

write data fast

data governance

standard structured

arbitrary structured data

data processing

limited processing power

powerful processing capability

type of data

structured data

structured, semi-structured, unstructured

Application Scenario

Interactive OLAP analysis

ACID transaction processing

Enterprise business system

Handle unstructured data

Massive Data Storage Computing

 

Eight Hadoop ecosystem

 

Nine Hadoop architecture 

  • HDFS(Hadoop Distributed File System)
    • Distributed file system, solving distributed storage
  • MapReduce
    • Distributed Computing Framework
  • YARN
    • Distributed resource management system introduced in Hadoop 2.x
  • Common
    • Common utilities supporting all other modules

     

Guess you like

Origin blog.csdn.net/jojo_oulaoula/article/details/132429748