Training Notes 7.11

Training Notes 7.11

7.11

1. Motto

You tell my story, I leave my words, you decide how much I am worth, and I choose where I go.

Two, vmware steps to install the operating system

2.1 Need to package some resources on windows into a container

Virtual machine connection network is divided into three types

2.1.1 Bridged network

The operating system installed on this machine can be accessed by other hosts under the same LAN

The bridge network and our LAN network are under the same network segment

2.1.2 NAT network

The locally installed operating system can only be accessed by our host, and cannot be accessed by other hosts under the same LAN

The network segment used by the network in NAT mode and the network segment of the LAN are not the same network segment, and the network segment is provided by the vmnet8 network card.

2.1.3 Host-only networking

no one can access

2.2 Associate the image driver with the container, and then you can install the operating system after opening it

3. Basic network operation of Linux operating system

3.1 Linux operations related to network

ip addr: View the IP address of the Linux system

ping 域名/ip地址You can check whether you can access a certain network

All network configurations of linux are in a configuration file:/etc/sysconfig/network-scripts/ifcfg-ens33

​EnableONBOOT="yes" the current network card

IPADDR="192.168.XX.XXX"Configure static

IP GATEWAY="192.168.XX.2"Configure the gateway's

DNS1="114.114.114.114"Configure domain name resolution server

3.2 If the network modification is completed, the network card service of Linux needs to be restarted

systemctl restart network

3.3 Network Services

There is also a network service NetworkManagaer on Linux. We don’t need this service, but it is always on. If it is on, it will affect our subsequent operations. Therefore, we request that this service be closed and permanently disabled.

3.4 Firewall

There is also a service on Linux called firewalld (firewall service). It is best to turn off the firewall service, so that our subsequent big data software installation will be smoother.

systemctl stop firewalld

systemctl disable firewalld

3.5 For the node server installed in Linux, we generally need a host name to find the only host in the cluster.

vim /etc/hostname

After the hostname change is complete, the virtual machine needs to be restarted:reboot

shutdown nowshut down the virtual machine

4. Mutual login between multiple hosts (host ip mapping, SSH password-free login)

When multiple hosts log in to each other, you need to use the ssh IP address command to log in, but there are two problems with this operation

  1. There may be many nodes in a cluster, each node has an ip address, ip is not easy to remember

  2. ssh needs to enter the password when logging in to other nodes in the cluster

4.1 Host name and IP mapping configuration: To put it bluntly, it is actually the configuration of domain name resolution

To solve the problem that the IP addresses of multiple nodes in the cluster are not easy to remember. When installing the operating system, we have intentionally given each node a unique hostname. If you log in, you can use the hostname To automatically identify the IP address

Domain name resolution file:

/etc/hosts

ip 域名

4.2 SSH keyless login configuration

The principle of keyless login is very simple. Generate a public key and private key file on the current node, and then we distribute a copy of the public key file to other nodes, so that the current node does not need a password to connect to other nodes

  1. Generate public and private key files

    1. Change to the directory of the public key and private key files:~/.ssh
    2. Generate public and private key files:ssh-keygen -t rsa
  2. Send the public key file to other nodes that need to log in to the current node without password

ssh-copy-id 主机名/IP

5. In the era of big data, there are two main problems

5.1 The storage problem of massive data

5.2 Calculation problems of massive data

5.3 Google three papers

5.3.1 DFS

5.3.2 Map-Reduce

5.3.3 BigTable

6. Hadoop technology – three papers from google

Solve all the two core problems encountered in big data

6.1 Three core components inside Hadoop

6.1.1 HDFS: Distributed File Storage System

  1. Distributed thinking solves the storage problem of massive data

  2. Composed of three core components

    1. NameNode: master node
      1. Store metadata (directory structure) of the entire HDFS cluster
      2. Manage the entire HDFS cluster
    2. DataNode: data nodes/slave nodes store data, and DataNodes store files in the form of Block blocks
    3. SecondaryNameNode: Small secretary - to help NameNode merge log data (metadata)

6.1.2 YARN: Distributed Resource Scheduling System

Composed of two core components

  1. ResourceManager: master node

Manage the entire YARN cluster and be responsible for the overall resource allocation

  1. NodeManager: slave node

really responsible for resource provision

Master-slave software

6.1.3 MapReduce: Distributed Offline Computing Framework

Distributed thinking solves the computing problem of massive data

6.1.4 Hadoop Common

6.2 An Ecosystem Born of Hadoop Technology

  1. Data collection and storage - flume, kafka, hbase, hdfs
  2. Data cleaning preprocessing - MapReduce, Spark
  3. Data statistical analysis - Hive, Pig
  4. Data Migration - sqoop
  5. Data visualization - echarts
  6. zookeeper

6.3 The course is mainly explained around Apache Hadoop distribution

  1. Official website: https://hadoop.apache.org

  2. apache hadoop distribution

    1. hadoop1.x
    2. hadoop2.x
    3. hadoop3.x
    4. hadoop3.1.4

6.4 Four modes of hadoop installation

HDFS and YARN in Hadoop software are a system, and a distributed system, and they are also a master-slave architecture software.

The first type: local installation mode - only MapReduce can be used, HDFS and YARN cannot be used

The second type: Pseudo-distributed installation mode: the master-slave architecture software of hdfs and yarn are all installed on the same node

The third type: fully distributed installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes

The second and third types: single point failure problem

The fourth type: HA high-availability installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes, and two or three more master nodes need to be installed at the same time, but only one master node can be provided externally at the same time Serve

6.5 Pseudo-distribution installation process of Hadoop

  1. JDK needs to be installed on Linux first, and the bottom layer of Hadoop is developed based on Java

  2. Configure the host mapping of the current host and ssh password-free login

There are two main places to configure environment variables

1. /etc/profile: system environment variable

2. ~/.bash_profile: user environment variable

3. After the environment variable configuration is complete, the configuration file source environment variable file path must be reloaded

  1. Install a local version of Hadoop

    1. upload
    2. decompress
    3. Configure environment variables
  2. Install the pseudo-distributed version of Hadoop - just modify various hadoop configuration files

    1. hadoop-env.sh configures the path of Java
    2. core-site.xml configures some common configuration items of HDFS and YARN
      1. HDFS NameNode path
      2. File path stored in HDFS cluster
    3. hdfs-site.xml Configure related components of HDFS Configure NameNode’s web access path, DN’s web access website, SNN’s web access path, etc. . .
    4. mapred-env.sh configures the associated software (Java YARN) path when the MR program is running
    5. mapred-site.xml configures the MR program operating environment configuration to run the MR program on YARN
    6. yarn-env.sh configures the component path associated with YARN
    7. yarn-site.xml Configure related components of YARN Configure RM, NM web access path, etc.
    8. workers/slaves Configure HDFS and YARN's slave node's host configuration DN and NM on which nodes need to be installed
  3. Format HDFS cluster

    hdfs namenode -format

  4. Start HDFS and YARN

    1. HDFS
      1. start-dfs.sh
      2. stop-dfs.sh
      3. Provides a web access website that can monitor the status information of the entire HDFS cluster http://ip:9870 hadoop3.x ip:50070 hadoop2.x
    2. yarn
      1. start-yarn.sh
      2. stop-yarn.sh
      3. A web site is provided to monitor the status of the entire YARN cluster: http://ip:8088

7. Spark technology

Solve the calculation problem of massive data

8. Flink Technology: Computing Framework

9. Storm Technology: Computing Framework

Guess you like

Origin blog.csdn.net/cai_4/article/details/131671448