Currently artificial intelligence and big data fiery scene used more and more widely, the development of front-end daily contact with the students gradually more associated with big data development needs. Therefore large data knowledge is necessary to understand some learning.
Basic concepts
First, the data storage: distributed file system (distributed storage)
Second, the calculation data: deploying computing points
Basics
Big Data requires Java learning knowledge base and knowledge base of Linux
Learning Path
(1) Java and Linux foundation basic
(2) Hadoop study: architecture, theory, programming
The first stage: HDFS, MapReduce, HBase (NoSQL database)
Phase II: Data analysis engine -> Hive, Pig
Data acquisition engine -> Sqoop, Flume
The third stage: HUE: Web Administration Tool
ZooKeeper: Hadoop achieve the HA
Oozie: workflow engine
(3) Spark learning
The first stage: Scala Programming Language
The second stage: Spark Core -> memory-based, data calculated
The third stage: Spark SQL -> similar to mysql sql statement
Phase IV: Spark Streaming -> streaming calculated: For example: waterworks
(4) Apache Storm similar: Spark Streaming -> streaming calculated
NoSQL: Redis memory-based database
HDFS
Distributed File System addresses the following issues:
• Hard disk is not big enough: more than a few hard drives, can theoretically infinite
• Data security is not enough: redundancy, hdfs redundant default is 3, with a copy to improve efficiency levels, according to the database transmission units: Hadoop1.x 64M, Hadoop2.x 128M
• Administrator: NameNode hard disk: DataNode
![image.png](http://ata2-img.cn-hangzhou.img-pub.aliyun-inc.com/8ca9f78b244c7f991e73f71fd1e56421.png)
MapReduce
Basic programming model: to split a big task into smaller tasks, then summary
• MR task: Job = Map + Reduce
Map Reduce output is input, the input and output of the MR is in HDFS
MapReduce data flow analysis:
• Map of the output is input Reduce, Reduce the input is a collection of Map
Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive
HBase
What's BigTable ?: save all data into a table, redundant ---> Benefits: Improve efficiency
• Because bigtable thought: NoSQL: HBase database
• HBase Hadoop-based HDFS's
• Description of HBase table structure
The core idea is: the use of space for efficiency
Hadoop environment to build
Preparing the Environment
Linux环境、JDK、http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0-src.tar.gz
installation
1, install jdk, and configuration environment variable
vim /etc/profile 末尾添加 ![image.png](http://ata2-img.cn-hangzhou.img-pub.aliyun-inc.com/a9bf2e19410f9b3d38c8b0ca64b2f264.png)
2, extract hadoop-3.0.0.tar.gz, and configure the environment variables
tar -zxvf hadoop-3.0.0.tar.gz -C /usr/local/ mv hadoop-3.0.0/ hadoop
vim / etc / profile added to the end
Configuration
Hadoop has three installation modes:
Local mode:
• 1 host
• do not have HDFS, MapReduce can only test program
Pseudo-distributed mode:
Host • 1
• Hadoop includes all the features of a distributed simulation on a single machine environment
• (1) HDFS: Main: NameNode, data node: DataNodes
• (2) the Yarn: container MapReduce programs running
• the master node: ResourceManager
• from node: NodeManager
Full distribution pattern:
• At least three
Our pseudo-distributed mode, for example configuration:
Modify hdfs-site.xml: 1 redundancy, permission checks false
Modified core-site.xml
Modify mapred-site.xml
Modify yarn-site.xml
Formatting NameNode
hdfs namenode -format
See common.Storage: Storage directory / usr / local / hadoop / tmp / dfs / name has been successfully formatted representation format success
start up
start-all.sh
(*) HDFS: storing data
(*)YARN:
access
HDFS: http://192.168.56.102:50070
Yarn: http://192.168.56.102:8088
View HDFS management interface and yarn Resource Management System
Basic operation:
HDFS related commands
MapReduce example
result:
As a simple example on the successful implementation of the MapReduce
Think
Hadoop is a Java-based, front-end development is the daily use of PHP in use, when the error is still pretty hard to find the. Outside of work still needs to add more points to other language-related knowledge, we have developed a programming language, learning tools, and should not become a bottleneck restricting the growth of our technology