On Linux hadoop and spark build record

Just for fun because of the need to use three to build spark (192.168.1.10,192.168.1.11,192.168.1.12), because of spark built on top of hadoop, then you need to build hadoop. After a two in the afternoon, and finally completed structures, specially recorded as follows.

 

Click here to add a caption

 

Ready to work

1. jdk installed.

2. Download

  It contains scala, hadoop, spark

3. ssh without password authentication

    Three non-password authentication step with each other:

 The first step to create private keys rsa Convention:

[root@jw01 .ssh]# ssh-keygen -t rsa[root@jw02 .ssh]# ssh-keygen -r rsa[root@kt01  .ssh]# ssh-keygen -t rsa[root@kt02  .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.Enter file in which to save the key (/root/.ssh/id_rsa): #回车代表无需密码登陆Enter passphrase (empty for no passphrase): #回车Enter same passphrase again: #回车Your identification has been saved in /root/.ssh/id_rsa. #代表私钥Your public key has been saved in /root/.ssh/id_rsa.pub. #代表公钥The key fingerprint is:04:45:0b:47:10:92:0c:b2:b9:d7:11:5b:49:05:e4:d9 root@jw01

A second step, the two generated public key id_rsa.pub 192.168.1.11,192.168.1.12 rename id_rsa.pub_11, the directory id_rsa.pub_12 /root/.ssh/ transferred to the 192.168.1.10,

Then the public key file on all public 192.168.1.10 added for authentication authorized_keys(若没有该文件,则下面的命令会生成文件), the command is:

cat ~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys

The third step: the files on the distribution of 192.168.1.10 copied to the directory 192.168.1.11,192.168.1.12 /root/.ssh/ two machines

The last test, whether you can use ssh ip address each landing.

Preparing the Environment

Modify the hostname

We will build a master, 2 Ge slave cluster program. First, modify the host name vi /etc/hostname, as modified on the master master, which changed to a Slave slave1, the other the same way.

Configuring hosts

Modify the host file on each host

vi /etc/hosts

192.168.1.10      master
192.168.1.11      slave1
192.168.1.12      slave2

hadoop installation

1. Extract

tar -zxvf hadoop-2.6.0.tar.gz

2. modify the configuration file

  As shown in Reference [1]

Enter on the machine 192.168.1.10 (master) hadoop configuration directory, you need to configure the following seven documents: hadoop-env.sh, yarn-env.sh, slaves, core-site.xml, hdfs-site.xml, maprd-site.xml,yarn-site.xml

  1. In hadoop-env.shthe configuration JAVA_HOME

    # The java implementation to use.
    export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
    
  2. In yarn-env.shthe configuration JAVA_HOME

    # some Java parameters
    export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
    
  3. In the slavesconfiguration ip slave node or host,

    192.168.1.11
    192.168.1.12
  4. modifycore-site.xml

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://master:9000/</value>
        </property>
        <property>
             <name>hadoop.tmp.dir</name>
             <value>file:/home/spark/workspace/hadoop-2.6.0/tmp</value>
        </property>
    </configuration>
    
  5. modifyhdfs-site.xml

    <configuration>
        <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>master:9001</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/home/spark/workspace/hadoop-2.6.0/dfs/name</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/home/spark/workspace/hadoop-2.6.0/dfs/data</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>3</value>
        </property>
    </configuration>
    
  6. modifymapred-site.xml

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    </configuration>
    
  7. modifyyarn-site.xml

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.resourcemanager.address</name>
            <value>master:8032</value>
        </property>
        <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>master:8030</value>
        </property>
        <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>master:8035</value>
        </property>
        <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>master:8033</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>master:8088</value>
        </property>
    </configuration>

3. configured hadoop-2.6.0folder distributed to slave machine 192.168.1.11,192.168.1.12

4. Start at 192.168.1.10

cd ~/workspace/hadoop-2.6.0     #进入hadoop目录
bin/hadoop namenode -format     #格式化namenode
sbin/start-dfs.sh               #启动dfs 
sbin/start-yarn.sh              #启动yarn

5. Test

10 on the machine

$ jps  #run on master
3407 SecondaryNameNode
3218 NameNode
3552 ResourceManager
3910 Jps

11, 12 machine

$ jps   #run on slaves
2072 NodeManager
2213 Jps
1962 DataNode

admin 端

Enter in your browser http://192.168.1.10:8088, there should be hadoop out management interface, and can see slave1 and slave2 node. Port disposed on the yarn-site.xml

    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>master:8088</value>
    </property>

Installation scala

References [1]

Operate on three machines were: Machine 192.168.1.10,192.168.1.11,192.168.1.12

Decompression

tar -zxvf scala-2.10.4.tgz

 

Modify environment variables again sudo vi /etc/profile, add the following:

export SCALA_HOME=$WORK_SPACE/scala-2.10.4
export PATH=$PATH:$SCALA_HOME/bin

 

The same way the environment variables to take effect, and scala verify whether the installation was successful

$ source /etc/profile   #生效环境变量
$ scala -version        #如果打印出如下版本信息,则说明安装成功
Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL

Possible to solve the problems encountered:

[1] Hadoop jps appear process information unavailable prompt solution: References [2]

After starting Hadoop, using jps java command to view the current process of the system, display:   

hduser@jack:/usr/local/hadoop$ jps
18470 SecondaryNameNode
19096 Jps
12167 -- process information unavailable
19036 NodeManager
18642 ResourceManager
18021 DataNode
17640 NameNode


    Can at this time to enter the local file system through the / tmp directory, delete the name hsperfdata_ {username} folder, and then restart Hadoop.

[2] all kinds of rights issues

Solution: ready to redo the work without ssh password authentication

 [3] "Incompatible clusterIDs" error when starting Hadoop HDFS Analysis 

  Solution: "Incompatible clusterIDs" is the cause of the error before performing the "hdfs namenode -format", no data directory empty DataNode node. Empty it.

Installation spark

As shown in Reference [1]

Extracting machine at 10

tar -zxvf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6 spark-1.4    #原来的文件名太长了,修改下

Change setting:

cd ~/workspace/spark-1.4/conf    #进入spark配置目录
cp spark-env.sh.template spark-env.sh   #从配置模板复制
vi spark-env.sh     #添加配置内容

 

在spark-env.sh末尾添加以下内容(这是我的配置,你可以自行修改):
export SCALA_HOME=/home/spark/workspace/scala-2.10.4
export JAVA_HOME=/home/spark/workspace/jdk1.7.0_75
export HADOOP_HOME=/home/spark/workspace/hadoop-2.6.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/home/spark/workspace/spark-1.3.0
SPARK_DRIVER_MEMORY=1G
修改slaves文件
cp slaves.template slaves

  Change setting:

192.168.1.11

192.168.1.12

The above configuration distributed: 192.168.1.11,192.168.1.12

In 10 starts:

sbin/start-all.sh

检查是否启动:

master上

$ jps
7949 Jps
7328 SecondaryNameNode
7805 Master
7137 NameNode
7475 ResourceManager

在slave2

$jps
3132 DataNode
3759 Worker
3858 Jps
3231 NodeManager

进入Spark的Web管理页面: http://192.168.1.10:8080

如果8080被别的程序占用,使用8081端口。

 

点击此处添加图片说明文字

END

 

点击此处添加图片说明文字

 

碧茂课堂精彩课程推荐:

1.Cloudera数据分析课;

2.Spark和Hadoop开发员培训;

3.大数据机器学习之推荐系统;

4.Python数据分析与机器学习实战;

 

点击此处添加图片说明文字

 

详情请关注我们公众号:碧茂大数据-课程产品-碧茂课堂

现在注册互动得海量学币,大量精品课程免费送!

Guess you like

Origin blog.csdn.net/ShuYunBIGDATA/article/details/90762675