Big Data Virtual mixed computing platform Moonbox Configuration Guide

First, prepare the environment

The installed Apache Spark 2.2.0 (this version only supports Apache Spark 2.2.0, other Spark subsequent versions will be compatible)
MySQL is installed and started, and open Remote Access
Each node is already configured to install ssh login-free secret

Second, download

moonbox-0.3.0-beta download: https://github.com/edp963/moonbox/releases/tag/0.3.0-beta

Third, unzip

tar -zxvf moonbox-assembly_2.11-0.3.0-beta-dist.tar.gz

Fourth, modify the configuration file

Configuration files are located in the conf directory

step 1: Modify slaves

  mv slaves.example slaves
  vim slaves

You will see the following:

  localhost

Please modified according to the actual situation to address the need to deploy worker nodes, one address per line

step 2: Modify moonbox-env.sh

  mv moonbox-env.sh.example moonbox-env.sh
  chmod u+x moonbox-env.sh
  vim moonbox-env.sh

You will see the following:

  export JAVA_HOME=path/to/installed/dir
  export SPARK_HOME=path/to/installed/dir
  export YARN_CONF_DIR=path/to/yarn/conf/dir
  export MOONBOX_SSH_OPTS="-p 22"
  export MOONBOX_HOME=path/to/installed/dir
  # export MOONBOX_LOCAL_HOSTNAME=localhost
  export MOONBOX_MASTER_HOST=localhost
  export MOONBOX_MASTER_PORT=2551

Please modified according to the actual situation

step 3: Modify moonbox-defaults.conf

  mv moonbox-defaults.conf.example moonbox-defaults.conf
  vim moonbox-defaults.conf

You will see the following, where:

catalog

Configuration metadata storage locations must be modified according to the actual situation edit

rest

Configuring rest Services, modified as needed

Configure tcp (jdbc) service, on-demand modifications

local

Spark Local configuration mode of operation, is an array, the number indicates how many elements of each Worker Spark Local node starts operation mode. If not needed can be deleted.

cluster

Spark yarn configuration mode of operation, is an array, the number indicates how many elements of each Worker Spark Yarn node starts operation mode. If not needed can be deleted.

  moonbox {
  deploy {
      catalog {
          implementation = "mysql"
          url = "jdbc:mysql://host:3306/moonbox?createDatabaseIfNotExist=true"
          user = "root"
          password = "123456"
          driver = "com.mysql.jdbc.Driver"
      }
      rest {
          enable = true
          port = 9099
          request.timeout = "600s"
          idle.timeout= "600s"
      }
      tcp {
          enable = true
          port = 10010
      }
  }
  mixcal {
      pushdown.enable = true
      column.permission.enable = true
      spark.sql.cbo.enabled = true
      spark.sql.constraintPropagation.enabled = false

      local = [{}]
      cluster = [{
        spark.hadoop.yarn.resourcemanager.hostname = "master"
        spark.hadoop.yarn.resourcemanager.address = "master:8032"
        spark.yarn.stagingDir = "hdfs://master:8020/tmp"
        spark.yarn.access.namenodes = "hdfs://master:8020"
        spark.loglevel = "ERROR"
        spark.cores.max = 2
        spark.yarn.am.memory = "512m"
        spark.yarn.am.cores = 1
        spark.executor.instances = 2
        spark.executor.cores = 1
        spark.executor.memory = "2g"
      }]
  }
  }

optional: if configured HDFS high availability (HA), or HDFS kerberos configured, or configured YARN high availability (HA), or configured kerberos YARN

The cluster element to the relevant part of the following configurations, modify the actual situation. Hdfs specific values can be found in the configuration file and yarn profiles.

  #### HDFS HA ####
  spark.hadoop.fs.defaultFS="hdfs://service_name"
  spark.hadoop.dfs.nameservices="service_name"
 spark.hadoop.dfs.ha.namenodes.service_name="xxx1,xxx2"
  spark.hadoop.dfs.namenode.rpc-address.abdt.xxx1="xxx1_host:8020"
  spark.hadoop.dfs.namenode.rpc-address.abdt.xxx2="xxx2_host:8020"
  spark.hadoop.dfs.client.failover.proxy.provider.abdt="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
  spark.yarn.stagingDir = "hdfs://service_name/tmp"

  #### HDFS kerberos ####
  dfs.namenode.kerberos.principal = ""
  dfs.namenode.kerberos.keytab = ""

  #### YARN HA ####
  spark.hadoop.yarn.resourcemanager.ha.enabled=true
  spark.hadoop.yarn.resourcemanager.ha.rm-ids="yyy1,yyy2"
 spark.hadoop.yarn.resourcemanager.hostname.rm1="yyy1_host"
  spark.hadoop.yarn.resourcemanager.hostname.rm2="yyy2_ho
st"

  #### YARN kerberos ####
  spark.yarn.principal = ""
  spark.yarn.keytab = ""

Fifth, the distribution installation package

The MySQL Jdbc libs driver package and is placed under the runtime directory, then the entire installation moonbox catalog copy all nodes installation, ensure consistent location of the master node position.

Sixth, start the cluster

Executed in the master node

   sbin/start-all.sh

Seven, the cluster stops

Executed in the master node

   sbin/stop-all.sh

Eight, to check whether the cluster started successfully

Execute the following command in the master node, you will see MoonboxMaster process

   jps | grep Moonbox

Execute the following command worker node, you will see MoonboxWorker process

   jps | grep Moonbox

Execute the following command worker node, you will see SparkSubmit process corresponds to the number of profiles

   jps -m | grep Spark

Use moonbox-cluster command to view the cluster information

   bin/moonbox-cluster workers
   bin/moonbox-cluster apps

If the check is passed, the cluster started successfully, you can see examples section began to experience it. If the check fails, you can troubleshoot by viewing the master node or worker node logs directory under the log.

Open Source Address: https://github.com/edp963/moonbox

Source: CreditEase Institute of Technology