Training Notes 7.11
- 7.11
-
- 1. Motto
- Two, vmware steps to install the operating system
- 3. Basic network operation of Linux operating system
- 4. Mutual login between multiple hosts (host ip mapping, SSH password-free login)
- 5. In the era of big data, there are two main problems
-
- 5.1 The storage problem of massive data
- 5.2 Calculation problems of massive data
- 5.3 Google three papers
- 6. Hadoop technology--three papers from google
- 6.1 Three core components inside Hadoop
- 6.2 An Ecosystem Born of Hadoop Technology
- 6.3 The course is mainly explained around Apache Hadoop distribution
- 6.4 Four modes of hadoop installation
- 6.5 Pseudo-distribution installation process of Hadoop
- 7. Spark technology
- 8. Flink Technology: Computing Framework
- 9. Storm Technology: Computing Framework
7.11
1. Motto
You tell my story, I leave my words, you decide how much I am worth, and I choose where I go.
Two, vmware steps to install the operating system
2.1 Need to package some resources on windows into a container
Virtual machine connection network is divided into three types
2.1.1 Bridged network
The operating system installed on this machine can be accessed by other hosts under the same LAN
The bridge network and our LAN network are under the same network segment
2.1.2 NAT network
The locally installed operating system can only be accessed by our host, and cannot be accessed by other hosts under the same LAN
The network segment used by the network in NAT mode and the network segment of the LAN are not the same network segment, and the network segment is provided by the vmnet8 network card.
2.1.3 Host-only networking
no one can access
2.2 Associate the image driver with the container, and then you can install the operating system after opening it
3. Basic network operation of Linux operating system
3.1 Linux operations related to network
ip addr
: View the IP address of the Linux system
ping 域名/ip地址
You can check whether you can access a certain network
All network configurations of linux are in a configuration file:/etc/sysconfig/network-scripts/ifcfg-ens33
EnableONBOOT="yes"
the current network card
IPADDR="192.168.XX.XXX"
Configure static
IP GATEWAY="192.168.XX.2"
Configure the gateway's
DNS1="114.114.114.114"
Configure domain name resolution server
3.2 If the network modification is completed, the network card service of Linux needs to be restarted
systemctl restart network
3.3 Network Services
There is also a network service NetworkManagaer on Linux. We don’t need this service, but it is always on. If it is on, it will affect our subsequent operations. Therefore, we request that this service be closed and permanently disabled.
3.4 Firewall
There is also a service on Linux called firewalld (firewall service). It is best to turn off the firewall service, so that our subsequent big data software installation will be smoother.
systemctl stop firewalld
systemctl disable firewalld
3.5 For the node server installed in Linux, we generally need a host name to find the only host in the cluster.
vim /etc/hostname
After the hostname change is complete, the virtual machine needs to be restarted:reboot
shutdown now
shut down the virtual machine
4. Mutual login between multiple hosts (host ip mapping, SSH password-free login)
When multiple hosts log in to each other, you need to use the ssh IP address command to log in, but there are two problems with this operation
-
There may be many nodes in a cluster, each node has an ip address, ip is not easy to remember
-
ssh needs to enter the password when logging in to other nodes in the cluster
4.1 Host name and IP mapping configuration: To put it bluntly, it is actually the configuration of domain name resolution
To solve the problem that the IP addresses of multiple nodes in the cluster are not easy to remember. When installing the operating system, we have intentionally given each node a unique hostname. If you log in, you can use the hostname To automatically identify the IP address
Domain name resolution file:
/etc/hosts
ip 域名
4.2 SSH keyless login configuration
The principle of keyless login is very simple. Generate a public key and private key file on the current node, and then we distribute a copy of the public key file to other nodes, so that the current node does not need a password to connect to other nodes
-
Generate public and private key files
- Change to the directory of the public key and private key files:
~/.ssh
- Generate public and private key files:
ssh-keygen -t rsa
- Change to the directory of the public key and private key files:
-
Send the public key file to other nodes that need to log in to the current node without password
ssh-copy-id 主机名/IP
5. In the era of big data, there are two main problems
5.1 The storage problem of massive data
5.2 Calculation problems of massive data
5.3 Google three papers
5.3.1 DFS
5.3.2 Map-Reduce
5.3.3 BigTable
6. Hadoop technology – three papers from google
Solve all the two core problems encountered in big data
6.1 Three core components inside Hadoop
6.1.1 HDFS: Distributed File Storage System
-
Distributed thinking solves the storage problem of massive data
-
Composed of three core components
- NameNode: master node
- Store metadata (directory structure) of the entire HDFS cluster
- Manage the entire HDFS cluster
- DataNode: data nodes/slave nodes store data, and DataNodes store files in the form of Block blocks
- SecondaryNameNode: Small secretary - to help NameNode merge log data (metadata)
- NameNode: master node
6.1.2 YARN: Distributed Resource Scheduling System
Composed of two core components
- ResourceManager: master node
Manage the entire YARN cluster and be responsible for the overall resource allocation
- NodeManager: slave node
really responsible for resource provision
Master-slave software
6.1.3 MapReduce: Distributed Offline Computing Framework
Distributed thinking solves the computing problem of massive data
6.1.4 Hadoop Common
6.2 An Ecosystem Born of Hadoop Technology
- Data collection and storage - flume, kafka, hbase, hdfs
- Data cleaning preprocessing - MapReduce, Spark
- Data statistical analysis - Hive, Pig
- Data Migration - sqoop
- Data visualization - echarts
- zookeeper
6.3 The course is mainly explained around Apache Hadoop distribution
-
Official website: https://hadoop.apache.org
-
apache hadoop distribution
- hadoop1.x
- hadoop2.x
- hadoop3.x
- hadoop3.1.4
6.4 Four modes of hadoop installation
HDFS and YARN in Hadoop software are a system, and a distributed system, and they are also a master-slave architecture software.
The first type: local installation mode - only MapReduce can be used, HDFS and YARN cannot be used
The second type: Pseudo-distributed installation mode: the master-slave architecture software of hdfs and yarn are all installed on the same node
The third type: fully distributed installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes
The second and third types: single point failure problem
The fourth type: HA high-availability installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes, and two or three more master nodes need to be installed at the same time, but only one master node can be provided externally at the same time Serve
6.5 Pseudo-distribution installation process of Hadoop
-
JDK needs to be installed on Linux first, and the bottom layer of Hadoop is developed based on Java
-
Configure the host mapping of the current host and ssh password-free login
There are two main places to configure environment variables
1.
/etc/profile
: system environment variable2.
~/.bash_profile
: user environment variable3. After the environment variable configuration is complete, the configuration file source environment variable file path must be reloaded
-
Install a local version of Hadoop
- upload
- decompress
- Configure environment variables
-
Install the pseudo-distributed version of Hadoop - just modify various hadoop configuration files
- hadoop-env.sh configures the path of Java
- core-site.xml configures some common configuration items of HDFS and YARN
- HDFS NameNode path
- File path stored in HDFS cluster
- hdfs-site.xml Configure related components of HDFS Configure NameNode’s web access path, DN’s web access website, SNN’s web access path, etc. . .
- mapred-env.sh configures the associated software (Java YARN) path when the MR program is running
- mapred-site.xml configures the MR program operating environment configuration to run the MR program on YARN
- yarn-env.sh configures the component path associated with YARN
- yarn-site.xml Configure related components of YARN Configure RM, NM web access path, etc.
- workers/slaves Configure HDFS and YARN's slave node's host configuration DN and NM on which nodes need to be installed
-
Format HDFS cluster
hdfs namenode -format
-
Start HDFS and YARN
- HDFS
start-dfs.sh
stop-dfs.sh
- Provides a web access website that can monitor the status information of the entire HDFS cluster http://ip:9870 hadoop3.x ip:50070 hadoop2.x
- yarn
start-yarn.sh
stop-yarn.sh
- A web site is provided to monitor the status of the entire YARN cluster: http://ip:8088
- HDFS
7. Spark technology
Solve the calculation problem of massive data