Who said elephants can't dance
Hadoop - easy to deal with massive data storage and analysis
Massive Data:
The amount of data is large, and the data is large. The amount of data reaches the level of PB and ZB , and the number of entries reaches billions and tens of billions.
1KB (Kilobyte ) =1024B ,
1MB (Megabyte is abbreviated as " megabyte ")=1024KB ,
1GB (Gigabyte , also known as " gigabyte ") = 1024MB ,
1TB (Terabyte Terabyte ) =1024GB , where 1024=2^10 (2 to the 10th power ) ,
1PB ( Petabyte Petabyte ) = 1024TB ,
1EB ( Exabyte ) = 1024PB , _
1ZB (Zettabyte ) = 1024 EB,
1YB (Yottabyte ) = 1024 ZB,
1BB (Brontobyte ) = 1024 YB.
storage:
Distributed, cluster concept , management (master node, slave node), HDFS ( Hadoop Distributed FileSystem )
analyze:
Distributed, parallel, offline computing framework, management (master node, slave node) , MapReduce
Apache Hadoop logo
origin
Apache lucene: an open source high-performance full-text search toolkit
Apache Nutch: Open Source Web Search Engine
Google's three major papers: MapRedure/GFS/BigTable
Apache Hadoop: Large-scale data processing
HDFS->GFS open source file system
Google MapReduce->Hadoop MapReduce open source distributed parallel settlement framework
BigTable->HBase open source distributed database
big data, cloud computing
Big Data:
The amount of data is large, the data has a price, analysis and mining
cloud computing:
Cloud computing consists of three layers: IAAS, PAAS, SAAS
IAAS : Infrastructure as a Service, typical implementations are Amazon EC2 , OpenStack, CloudStack , Rackspace , etc.
OpenStack can build a company's private cloud platform
PAAS : Platform as a Service, typical implementations are Google AppEngine , Apache Hadoop
SAAS : Software as a Service, typical implementations are: Google Apps
Hadoop——Big Data Platform
data storage
HDFS
-Distributed across " nodes "
-Natively redundant localization
Name node tracks location
data processing
Map Reduce
-Splits a task across processors , " near" line data & assembles
-self-Heading, High Brandwidth clustered Storage 's own handling, high
Apache Hadoop Features
Scalable ( Scalable )
low cost
High efficiency ( Flexible )
reliability
What problems can Apache Hadoop solve
appeal
speed, depth, fixed assets
question
Disk IO becomes a bottleneck, not a CPU resource
Network bandwidth is a scarce resource
Hardware failure becomes a major factor affecting stability
Hadoop development history
Classic version: 0.20.2 -> 1.0.0 ( the first official version of 1.0.0 ) -> 1.0.3 or 1.0.4 is very good
2.x version:
2,2,0, 2.3.0, 2.4.0 official version, for actual
ETL :
Extract -> Transform -> Load
Obtain data from the database, and perform a series of data cleaning and cleaning screening, convert the qualified data into a certain format data for storage, and store the formatted data on the HDFS file system for data analysis by the computing framework and dig.
Format data:
1-TSV format: each column of data in each row is separated by [tab character \t ]
2-CVS format: each column of each row of data is separated by [comma]
Sqoop:
Import and export data in relational database and data in HDFS (HDFS file, HBase bid , Hive) to each other
Flume:
Collect the logs of each application system and framework, and put them in the corresponding directory of the HDFS distributed file system.
For the architecture of distributed systems and frameworks, it is generally divided into two parts,
The first part: the management layer, which is used to manage the application layer
Part 2: Application Layer (Working)
HDFS , Distributed File System
NameNode (metadata server): belongs to the management layer and is used to manage the storage of data
Secondary NameNode (Auxiliary Metadata Server): It also belongs to the management layer and is managed by the auxiliary NameNode
DataNodes (block storage) belong to the application layer and are used for data storage. They are managed by the NameNode , report work to the NameNode regularly , and perform tasks assigned and distributed by the NameNode .
MapReduce distributed parallel computing framework
JobTracker (task scheduler) belongs to the management layer, manages cluster resources and schedules tasks, and monitors the execution of tasks.
TaskTracker (task execution) belongs to the application layer, executes the tasks assigned and distributed by the JobTracker , and reports the work status to the JobTracker .
An introduction to the HDFS framework
NameNode. Stores the metadata of the file
1) file name
2) The directory structure of the file
3) Attributes of the file (permissions, number of copies, time of generation)
4) File - > (corresponding to) Block block - > (stored in) on DataNodes
Explanation of the MapReduce framework and the principle of MapReduce
Apache Hadoop installation and deployment mode
Standalone Mode ( Standalone Mode )
Pseudo - Distributed Mode
Fully Distributed Mode