Pseudo-distributed mode installation Hadoop

Introduction to Hadoop

Developed by the Apache Foundation's Hadoop distributed system infrastructure is the use of cluster software framework for distributed processing of large amounts of data and storage. Users can easily develop and run applications handling massive amounts of data in a Hadoop cluster. Hadoop high reliability, scalability, high efficiency, high fault tolerance advantages. The core of the framework is designed Hadoop HDFS and MapReduce. HDFS provides storage of vast amounts of data, MapReduce is provided for the calculation of the mass data. In addition, Hadoop also includes Hive, Hbase, ZooKeeper, Pig, Avro, Sqoop, Flume, Mahout and other projects.

Hadoop operation mode is divided into three 本地运行模式kinds: 伪分布运行模式, 完全分布运行模式, .

Local mode (local mode)

This mode of operation to run on a single machine, without HDFS distributed file system, but directly read and write the native operating system file system. Daemon does not exist in the local operation mode (local mode), all processes are running on a JVM. Stand-alone mode is suitable for running MapReduce program development phase, which is the least used of a pattern.
Pseudo-distributed mode

This mode of operation is fully distributed model simulations Hadoop on a single server, distributed on a stand-alone is not truly distributed, but the use of distributed simulation threads. In this mode, all daemons (NameNode, DataNode, ResourceManager, NodeManager, SecondaryNameNode) are running on the same machine. Since the dummy operation mode distribution Hadoop cluster only one node, the HDFS block copy will be limited to a single copy, which secondary-master and slave will also run on the local host. In addition to this model is not really distributed, the program execution logic completely analogous to a fully distributed, therefore, commonly used in developer testing program execution. The experiment is carried out to build the pseudo pattern distributed on a server.
Fully distributed mode

This model is commonly used in the production environment, the use of N hosts make up a Hadoop cluster, each host on Hadoop daemon is running. Here there will be a host Namenode run, Datanode run the host, and the host SecondaryNameNode run. In a completely distributed environment, and will be separated from the master node node.

lab environment

Ubuntu Linux 4.14

installation steps

Create a new user and user group

此步为可选项, It is recommended to create a new user and user group, subsequent operations are basically in the user down operation. However, users may be operated at their current non-root user.

Create a user named zhangyu, and create a home directory for this user at this time will create a default user group with the same name zhangyu.
```
sudo useradd -d /home/zhangyu -m zhangyu
```
Set a password for the user zhangyu
```
sudo passwd zhangyu
```
The zhangyu user permissions, sudo upgrade to super-user level
```
sudo usermod -G sudo zhangyu
```
Switch to the user completes the subsequent operation zhangyu
```
su zhangyu
```

Configuring SSH password-free login

SSH password-free login required to perform the following command at the server, generate a public and private key pairs

Note : At this point there will be multiple input reminder text after the colon, where the main requirement is the placement ssh password and enter the password. Here, just use the default, press Enter.

ssh-keygen -t rsa

At this time, ssh public and secret key generation is complete, and allowed to stand at ~ / .ssh directory. Switch to the ~ / .ssh directory; in ~ / .ssh directory, create an empty text, called authorized_keys; id_rsa.pub public key file and store the contents, appended to the authorized_keys.

cd ~./.ssh
touch ~/.ssh/authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Perform the following test ssh ssh localhost is configured correctly (using ssh access time, will continue to remind the connection)

ssh localhost

If the console print out the welcome statement Use ssh login password-free configuration is successful, subsequent re-execute ssh localhost, do not have to enter the password.

Hadoop installation

Preparation before installation

First, create two directories: apps and data. Action are two directories: / apps directory used to store installation frame, / data directory used to store temporary data, the HDFS data, program code or scripts.

sudo mkdir /apps
sudo mkdir /data

And is / apps and / data directory zhangyu switch user belongs and the user group zhangyu

sudo chown -R zhangyu:zhangyu /apps
sudo chown -R zhangyu:zhangyu /data

Performed in the root directory ls -lcommand, see if the root directory / apps and / data directory belongs to user and group has been switched zhangyu: zhangyu, users and user groups successful handover.

Java and configure Hadoop environment

Create / data / hadoop1 directory used to store related installation tools, such as the installation package jdk jdk-7u75-linux-x64.tar.gz installation package and hadoop hadoop-2.6.0-cdh5.4.5.tar.gz.
```
mkdir -p /data/hadoop1
```
Change directory to / data / hadoop1 directory, using wget commands, download the required installation package hadoop jdk-7u75-linux-x64.tar.gz and hadoop-2.6.0-cdh5.4.5.tar.gz.
```
cd /data/hadoop1
wget http://59.64.78.41:60000/allfiles/hadoop1/jdk-7u75-linux-x64.tar.gz  
wget http://59.64.78.41:60000/allfiles/hadoop1/hadoop-2.6.0-cdh5.4.5.tar.gz
```
Installation jdk. The jdk-7u75-linux-x64.tar.gz solution under / data / hadoop1 compressed to the directory / apps directory. (Wherein, tar -xzvf decompress the file, after extracting -C specify the file into / apps directory)
```
tar -xzvf /data/hadoop1/jdk-7u75-linux-x64.tar.gz -C /apps
```
Switch to next / apps directory, rename directory jdk1.7.0_75 java
```
mv /apps/jdk1.7.0_75/  /apps/java
```
Let's modify environment variables: system environment variable or a user environment variable. We are here to modify user environment variables. Open the file storage environment variables (.bashrc). Empty lines, the java environment variables, additional variables into the user environment.
```
# java
export JAVA_HOME=/apps/java
export PATH=$JAVA_HOME/bin:$PATH
```
Let environment variables to take effect immediately
```
source ~/.bashrc
```
Source command execution, so java environment variables to take effect. After finished, you can enter java, test environment variable is correctly configured.
Hadoop installation, switches to the next / data / hadoop1 directory, the compressed hadoop-2.6.0-cdh5.4.5.tar.gz to the solution / apps directory.
```
cd /data/hadoop1  
tar -xzvf /data/hadoop1/hadoop-2.6.0-cdh5.4.5.tar.gz -C /apps/
```
For ease of operation, we will hadoop-2.6.0-cdh5.4.5 rename hadoop.
```
mv /apps/hadoop-2.6.0-cdh5.4.5/ /apps/hadoop
```
Modify the user environment variables, add the path to the hadoop path. First open user environment variables file (.bashrc), and Hadoop environment variables appended to the environment variable configuration file.
```
sudo vim ~/.bashrc
```
```
# 将下列内容写在.bashrc文件中
#hadoop  
export HADOOP_HOME=/apps/hadoop  
export PATH=$HADOOP_HOME/bin:$PATH
```
Let environment variables to take effect immediately
```
source ~/.bashrc
```
Verify hadoop environment variable configuration is normal, if normal print out the Hadoop version information indicates that the configuration is correct.
```
hadoop version
```

Hadoop itself modify some configuration

Edit the file hadoop-env.sh

Switch to the Hadoop configuration directory, edit the file hadoop-env.sh

cd /apps/hadoop/etc/hadoop
vim /apps/hadoop/etc/hadoop/hadoop-env.sh

The following JAVA_HOME hadoop-env.sh appended to the file.

export JAVA_HOME=/apps/java

Edit core-site.xml file

Open core-site.xml configuration file

vim /apps/hadoop/etc/hadoop/core-site.xml

Add the following arranged between the <configuration> and </ configuration> tag

<property>  
    <name>hadoop.tmp.dir</name>  
    <value>/data/tmp/hadoop/tmp</value>  
</property>  
<property>  
    <name>fs.defaultFS</name>  
    <value>hdfs://0.0.0.0:9000</value>  
</property>

Item Description:

hadoop.tmp.dir, hadoop configuration process, the storage location of the temporary file. Here's directory / data / tmp / hadoop / tmp needs to be created in advance
fs.defaultFS, configure hadoop HDFS file system address

Edit hdfs-site, xml file

Open hdfs-site.xml configuration file

vim /apps/hadoop/etc/hadoop/hdfs-site.xml

Add the following arranged between the <configuration> and </ configuration> tag

<property>  
    <name>dfs.namenode.name.dir</name>  
    <value>/data/tmp/hadoop/hdfs/name</value>  
</property>  
 <property>  
     <name>dfs.datanode.data.dir</name>  
     <value>/data/tmp/hadoop/hdfs/data</value>  
 </property>  
 <property>  
     <name>dfs.replication</name>  
     <value>1</value>  
 </property>  
 <property>  
     <name>dfs.permissions.enabled</name>  
     <value>false</value>  
 </property>

Item Description:

dfs.namenode.name.dir, configuration storage location information of the metadata
dfs.datanode.data.dir, the specific configuration data storage location
dfs.replication, configure each backed-up database, due to the current node we are using one, so that is set to 1, if set to 2, then running will complain
dfs.replications.enabled, configure whether to enable the certification authority hdfs

In addition / data / tmp / hadoop / hdfs path needs to be created in advance, so we need to perform

mkdir -p /data/tmp/hadoop/hdfs

Edit slaves file

Open slaves profile

vim /apps/hadoop/etc/hadoop/slaves

The host name of the node in the cluster slave roles, add slaves file. At present, only one node, so the slaves contents of the file:

localhost

The following format HDFS file system, execute:

hadoop namenode -format

Change directory to the next / apps / hadoop / sbin directory, start the Hadoop hdfs related processes

cd /apps/hadoop/sbin/
./start-dfs.sh

In a terminal jpscommand to view HDFS related process is started, if there is Datanode, NameNode, SecondaryNameNode, Jpsthese words, the relevant instructions HDFS process has started.

The following verification operation state of HDFS

Create a directory on HDFS

hadoop fs -mkdir /myhadoop1

Run the following command to check the directory is successfully created

hadoop fs -ls -R /

Configuration MapReduce

Switch to the profile directory hadoop

cd /apps/hadoop/etc/hadoop

The mapreduce profile mapred-site.xml.template, rename mapred-site.xml

mv /apps/hadoop/etc/hadoop/mapred-site.xml.template  /apps/hadoop/etc/hadoop/mapred-site.xml

Edit mapred-site.xml file

vim /apps/hadoop/etc/hadoop/mapred-site.xml

The configuration mapreduce, added to between <configuration> tag

<property>  
    <name>mapreduce.framework.name</name>  
    <value>yarn</value>  
</property>

Herein designated task processing frame mapreduce used -> yarn

Edit yarn-site.xml file

vim /apps/hadoop/etc/hadoop/yarn-site.xml

The yarn configuration, to add between <configuration> tag

<property>  
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value>  
</property>

Start yarn

Let's start to calculate the level of the relevant processes, switch to the startup directory hadoop

cd /apps/hadoop/sbin/

Execute the command to start the yarn

./start-yarn.sh

In a terminal jpscommand to view HDFS related process is started, if there is Datanode, NameNode, SecondaryNameNode, Jps, ResourceManager, NodeManagerthese words, the process has already started the relevant instructions HDFS

Hadoop test run

Switch to next / apps / hadoop / share / hadoop / mapreduce directory

cd /apps/hadoop/share/hadoop/mapreduce

Mapreduce run a program in the catalog, to detect what if hadoop normal operation

hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.4.5.jar pi 3 3

This program is a mathematical calculation of the value of pi. Of course, being the first without considering the accuracy of the data.

When the program can calculate the value of Pi, indicating hadoop installed configuration.