First, prepare the environment
- The installed Apache Spark 2.2.0 (this version only supports Apache Spark 2.2.0, other Spark subsequent versions will be compatible)
- MySQL is installed and started, and open Remote Access
- Each node is already configured to install ssh login-free secret
Second, download
moonbox-0.3.0-beta download: https://github.com/edp963/moonbox/releases/tag/0.3.0-beta
Third, unzip
tar -zxvf moonbox-assembly_2.11-0.3.0-beta-dist.tar.gz
Fourth, modify the configuration file
Configuration files are located in the conf directory
step 1: Modify slaves
mv slaves.example slaves
vim slaves
You will see the following:
localhost
Please modified according to the actual situation to address the need to deploy worker nodes, one address per line
step 2: Modify moonbox-env.sh
mv moonbox-env.sh.example moonbox-env.sh
chmod u+x moonbox-env.sh
vim moonbox-env.sh
You will see the following:
export JAVA_HOME=path/to/installed/dir
export SPARK_HOME=path/to/installed/dir
export YARN_CONF_DIR=path/to/yarn/conf/dir
export MOONBOX_SSH_OPTS="-p 22"
export MOONBOX_HOME=path/to/installed/dir
# export MOONBOX_LOCAL_HOSTNAME=localhost
export MOONBOX_MASTER_HOST=localhost
export MOONBOX_MASTER_PORT=2551
Please modified according to the actual situation
step 3: Modify moonbox-defaults.conf
mv moonbox-defaults.conf.example moonbox-defaults.conf
vim moonbox-defaults.conf
You will see the following, where:
- catalog
Configuration metadata storage locations must be modified according to the actual situation edit
- rest
Configuring rest Services, modified as needed
- tcp
Configure tcp (jdbc) service, on-demand modifications
- local
Spark Local configuration mode of operation, is an array, the number indicates how many elements of each Worker Spark Local node starts operation mode. If not needed can be deleted.
- cluster
Spark yarn configuration mode of operation, is an array, the number indicates how many elements of each Worker Spark Yarn node starts operation mode. If not needed can be deleted.
moonbox {
deploy {
catalog {
implementation = "mysql"
url = "jdbc:mysql://host:3306/moonbox?createDatabaseIfNotExist=true"
user = "root"
password = "123456"
driver = "com.mysql.jdbc.Driver"
}
rest {
enable = true
port = 9099
request.timeout = "600s"
idle.timeout= "600s"
}
tcp {
enable = true
port = 10010
}
}
mixcal {
pushdown.enable = true
column.permission.enable = true
spark.sql.cbo.enabled = true
spark.sql.constraintPropagation.enabled = false
local = [{}]
cluster = [{
spark.hadoop.yarn.resourcemanager.hostname = "master"
spark.hadoop.yarn.resourcemanager.address = "master:8032"
spark.yarn.stagingDir = "hdfs://master:8020/tmp"
spark.yarn.access.namenodes = "hdfs://master:8020"
spark.loglevel = "ERROR"
spark.cores.max = 2
spark.yarn.am.memory = "512m"
spark.yarn.am.cores = 1
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "2g"
}]
}
}
- optional: if configured HDFS high availability (HA), or HDFS kerberos configured, or configured YARN high availability (HA), or configured kerberos YARN
The cluster element to the relevant part of the following configurations, modify the actual situation. Hdfs specific values can be found in the configuration file and yarn profiles.
#### HDFS HA ####
spark.hadoop.fs.defaultFS="hdfs://service_name"
spark.hadoop.dfs.nameservices="service_name"
spark.hadoop.dfs.ha.namenodes.service_name="xxx1,xxx2"
spark.hadoop.dfs.namenode.rpc-address.abdt.xxx1="xxx1_host:8020"
spark.hadoop.dfs.namenode.rpc-address.abdt.xxx2="xxx2_host:8020"
spark.hadoop.dfs.client.failover.proxy.provider.abdt="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
spark.yarn.stagingDir = "hdfs://service_name/tmp"
#### HDFS kerberos ####
dfs.namenode.kerberos.principal = ""
dfs.namenode.kerberos.keytab = ""
#### YARN HA ####
spark.hadoop.yarn.resourcemanager.ha.enabled=true
spark.hadoop.yarn.resourcemanager.ha.rm-ids="yyy1,yyy2"
spark.hadoop.yarn.resourcemanager.hostname.rm1="yyy1_host"
spark.hadoop.yarn.resourcemanager.hostname.rm2="yyy2_ho
st"
#### YARN kerberos ####
spark.yarn.principal = ""
spark.yarn.keytab = ""
Fifth, the distribution installation package
The MySQL Jdbc libs driver package and is placed under the runtime directory, then the entire installation moonbox catalog copy all nodes installation, ensure consistent location of the master node position.
Sixth, start the cluster
Executed in the master node
sbin/start-all.sh
Seven, the cluster stops
Executed in the master node
sbin/stop-all.sh
Eight, to check whether the cluster started successfully
Execute the following command in the master node, you will see MoonboxMaster process
jps | grep Moonbox
Execute the following command worker node, you will see MoonboxWorker process
jps | grep Moonbox
Execute the following command worker node, you will see SparkSubmit process corresponds to the number of profiles
jps -m | grep Spark
Use moonbox-cluster command to view the cluster information
bin/moonbox-cluster workers
bin/moonbox-cluster apps
If the check is passed, the cluster started successfully, you can see examples section began to experience it. If the check fails, you can troubleshoot by viewing the master node or worker node logs directory under the log.
Open Source Address: https://github.com/edp963/moonbox
Source: CreditEase Institute of Technology