big data learning

Who said elephants can't dance

Hadoop - easy to deal with massive data storage and analysis

Massive Data:

The amount of data is large, and the data is large. The amount of data reaches the level of PB and ZB , and the number of entries reaches billions and tens of billions.

1KB (Kilobyte  ) =1024B ,

1MB (Megabyte  is abbreviated as " megabyte ")=1024KB , 

1GB (Gigabyte  , also known as " gigabyte ") = 1024MB , 

1TB (Terabyte Terabyte  ) =1024GB , where 1024=2^10 (2 to the 10th power ) ,

1PB ( Petabyte Petabyte ) = 1024TB ,

1EB ( Exabyte  ) = 1024PB , _

1ZB (Zettabyte  ) = 1024 EB,

1YB (Yottabyte )  = 1024 ZB,

1BB (Brontobyte ) = 1024 YB.

 

storage:

Distributed, cluster concept , management (master node, slave node), HDFS ( Hadoop Distributed FileSystem )

analyze:

Distributed, parallel, offline computing framework, management (master node, slave node) , MapReduce

 

Apache Hadoop logo

 

 

origin

Apache lucene: an open source high-performance full-text search toolkit

Apache Nutch: Open Source Web Search Engine

Google's three major papers: MapRedure/GFS/BigTable

Apache Hadoop: Large-scale data processing

 

 

HDFS->GFS open source file system

Google MapReduce->Hadoop MapReduce open source distributed parallel settlement framework

BigTable->HBase open source distributed database

 

big data, cloud computing

Big Data:

The amount of data is large, the data has a price, analysis and mining

cloud computing:

Cloud computing consists of three layers: IAAS, PAAS, SAAS

IAAS : Infrastructure as a Service, typical implementations are Amazon EC2 , OpenStack, CloudStack , Rackspace , etc.

OpenStack can build a company's private cloud platform

PAAS : Platform as a Service, typical implementations are Google AppEngine , Apache Hadoop

SAAS : Software as a Service, typical implementations are: Google Apps

 

 

Hadoop——Big Data Platform

data storage

HDFS

-Distributed across " nodes "

-Natively redundant localization

Name node tracks location

data processing

Map Reduce

-Splits a task across processors , " near" line data & assembles

-self-Heading, High Brandwidth clustered Storage 's own handling, high

 

 

Apache Hadoop Features

Scalable ( Scalable )

low cost

High efficiency ( Flexible )

reliability

 

What problems can Apache Hadoop solve

appeal

speed, depth, fixed assets

question

Disk IO becomes a bottleneck, not a CPU resource

Network bandwidth is a scarce resource

Hardware failure becomes a major factor affecting stability

 

 

Hadoop development history

Classic version: 0.20.2 -> 1.0.0 ( the first official version of 1.0.0 ) -> 1.0.3 or 1.0.4 is very good

 

2.x version:

2,2,0, 2.3.0, 2.4.0 official version, for actual

 

 

ETL

Extract  ->   Transform   ->   Load

Obtain data from the database, and perform a series of data cleaning and cleaning screening, convert the qualified data into a certain format data for storage, and store the formatted data on the HDFS file system for data analysis by the computing framework and dig.

Format data:

1-TSV format: each column of data in each row is separated by [tab character \t ]

2-CVS format: each column of each row of data is separated by [comma]

Sqoop

Import and export data in relational database and data in HDFS (HDFS file, HBase bid , Hive) to each other

Flume

Collect the logs of each application system and framework, and put them in the corresponding directory of the HDFS distributed file system.

 

 

 For the architecture of distributed systems and frameworks, it is generally divided into two parts,

The first part: the management layer, which is used to manage the application layer

Part 2: Application Layer (Working)

 

HDFS , Distributed File System

NameNode (metadata server): belongs to the management layer and is used to manage the storage of data

Secondary NameNode (Auxiliary Metadata Server): It also belongs to the management layer and is managed by the auxiliary NameNode

DataNodes (block storage) belong to the application layer and are used for data storage. They are managed by the NameNode , report work to the NameNode regularly , and perform tasks assigned and distributed by the NameNode .

MapReduce distributed parallel computing framework

JobTracker (task scheduler) belongs to the management layer, manages cluster resources and schedules tasks, and monitors the execution of tasks.

TaskTracker (task execution) belongs to the application layer, executes the tasks assigned and distributed by the JobTracker , and reports the work status to the JobTracker .

 

 

An introduction to the HDFS framework

 

NameNode. Stores the metadata of the file

1) file name

2) The directory structure of the file

3) Attributes of the file (permissions, number of copies, time of generation)

4)  File - > (corresponding to) Block block - > (stored in) on DataNodes

 

 

Explanation of the MapReduce framework and the principle of MapReduce

 

 

Apache Hadoop installation and deployment mode

Standalone Mode ( Standalone Mode )

Pseudo - Distributed Mode

Fully Distributed Mode

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325986131&siteId=291194637