1, Hadoop ecological profile
Hadoop is an Apache Foundation developed by the distributed integration architecture system, the user may not know at the underlying details of the distributed case, the development of distributed programs, to take full advantage of the power of the cluster for high-speed computing and storage, reliable, efficient, telescopic features
The core Hadoop is YARN, HDFS, Mapreduce, common modular architecture follows
2、HDFS
GFS papers from Google, and published in October 2013, HDFS is a clone version of GFS, HDFS is the basis for Hadoop data storage management system, which is a highly fault-tolerant system that can detect and respond to hardware failure
HDFS file consistency simplifies the model, by streaming data access, provides high-throughput access application data for applications with large data sets, which provides a mechanism for a write-once read many times, a block of data form, while in different physical machine cluster
3、Mapreduce
Derived from Google's MapReduce paper, is calculated for a large number of data, which shields the distributed computing frameworks detail, the calculation map and reduce abstracted into two portions
4, HBASE (column memory distributed database)
Bigtable paper from Google, is a built on top of HDFS, column-oriented data for structured scalable, highly reliable, high-performance column-oriented distributed and dynamic mode database
5、zookeeper
Solve data management problems in a distributed environment, a unified name, state synchronization, cluster management, configuration synchronization
6、HIVE
Revenue from the Facebook, defines a similar sql query language, SQL will be converted into mapreduce tasks performed in Hadoop above
7、flume
Log collection tool
8, yarn distributed resource managers
Is the next generation mapreduce, mainly to solve the original poor scalability of Hadoop, a variety of computing framework does not support the proposed architecture follows
The concept of big data and artificial intelligence are vague, in accordance with what the line to learn where to completion of the development, want to learn, want to learn the students welcome to join the Big Data learning skirt: 606 859 705, there are a lot of dry goods (zero basic and advanced combat classical) share to everyone, so that we know the most complete large domestic high-end real practical learning data process system. Starting from java and linux, followed by gradually deep into HADOOP-hive-oozie-web-flume-python-hbase-kafka-scala-SPARK eleven other related knowledge to share!
9、spark
spark provides a faster and more versatile data processing platform, and Hadoop comparison, spark can make your program running in memory
10、kafka
Distributed message queue, mainly for streaming data processing Active
11, Hadoop pseudo-distributed deployment
For now, no charge version of Hadoop There are three, all foreign manufacturers, namely
1, Apache original version
2, CDH version, for domestic users, the vast majority of the selected version
3, HDP version
Here we choose CDH version hadoop-2.6.0-cdh5.8.2.tar.gz, the environment is CentOS7.1, jdk 1.7.0_55 need more
[root@hadoop1 ~]# useradd hadoop
My system comes with default java environment are as follows
Add the following environmental variables
Do the following authorization
Here to Hadoop users to manage a variety of services and startup of Hadoop
View service starts circumstances