Apache Hadoop deployment (4): Hive/Hbase/Storm/Spark/Flink configuration

table of Contents

Hive configuration

Configuration

Startup and verification

problem

HBase configuration

Configuration

Startup and verification

problem

Storm configuration

Configuration

Startup and verification

Spark (on yarn) configuration

Configuration

Startup and verification

problem

Flink (on yarn) configuration

Configuration

Startup and verification

to sum up


Hive configuration

Hive is a data warehouse tool based on Hadoop, which can be used for data sorting, special query and analysis processing of data sets stored in Hadoop files. Hive has a low learning threshold because it provides Hive QL, a query language similar to the SQL language of relational databases. It can quickly implement simple MR statistics through HiveQL statements. Hive itself can automatically convert HiveQL statements into MR tasks for operation, without the need to implement MR. api development, so it is very suitable for statistical analysis of data warehouse;

Hive deployment has three deployment modes: single-user, multi-user and remote server mode. Single-user mode is connected to an In-memory database Derby, generally used for Unit Test; multi-user mode is the most commonly used deployment mode, connected to a database through the network, the database generally chooses MySQL to store Metastore metadata; remote server mode That is, Metastore metadata is stored in the database on the remote server, and the client accesses the MetastoreServer database through the Thrift protocol.

The deployment is divided into two parts: first install mysql and create users and databases, and then configure Hive;

Configuration

M ysql portion

1. Download the tar package from the official website of Mysql, 5.6.33 (64-bit linux universal version), the download address is: https://dev.mysql.com/downloads/mysql/5.6.html#downloads ; under the installation path /usr/ local/unzip, rename to mysql;

Note: getconf LONG_BIT gets the system bit

2. Create a new mysql user (group mysql), change the owner and group of the installation path /usr/local/mysql to mysql, create new data paths /var/lib/mysql and /var/lib/mysql/data, change The owner and group of the file becomes mysql;

groupadd mysql
useradd -r -g mysql mysql
mkdir –p /var/lib/mysql/data
chown -R mysql:mysql /usr/local/mysql
chown -R mysql:mysql /var/lib/mysql

3. Install the database, transfer parameters: data directory and installation directory;

sudo ./scripts/mysql_install_db --basedir=/usr/local/mysql --datadir=/var/lib/mysql/data --user=mysql

4. Startup script and configuration file modification:

./support-files/mysql.server and my.cnf; the former is the script to be run at startup, the latter is the mysql configuration read at startup, if no adjustment is made, the default basedir is /usr/local/mysql , Datadir is /var/lib/mysql/data, if you modify these two parameters, you need to modify a lot of configuration;

Linux searches for my.cnf in order when starting the MySQL service, first find it in the /etc directory, if not found, it will search for "$basedir/my.cnf", there will be a my.cnf in the /etc directory in the Linux operating system , This file needs to be renamed to another name, such as /etc/my.cnf.bak, otherwise the file will interfere with the correct configuration and cause failure to start.

sudo cp ./support-files/mysql.server/etc/init.d/mysqld
sudo chmod755/etc/init.d/mysqld
sudo cp./support-files/my-default.cnf/etc/my.cnf

//Modify the data directory and installation directory in the my.cnf configuration file:

sudo vi/etc/init.d/mysqld
basedir=/usr/local/mysql/
datadir=/usr/local/mysql/data/mysql

5. Start the service

sudo service mysqld start

//Close the mysql service

sudo service mysqld stop

#View mysql service running status

sudo service mysqld status
  1. Set environment variables, test the connection, and configure login permissions:
#设置环境变量
export MYSQL=/usr/local/mysql
export PATH=${MYSQL}/bin:${PATH}
#赋权所有库下的所有表在任何IP地址或主机都可以被root用户连接
grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option;
flush privileges;
#修改root用户的登录密码(须停服务,完成后重启)
UPDATE user SET Password=PASSWORD(‘123456’) where USER=’root’;  
flush privileges;

7. Create a hive library and hive user to save the metadata information of the hive warehouse and empower hive users:

create database hive character set latin1;
create user hive;

The authorization operation of the hive user under the table mysql.user (omitted)

flush privileges;

Hive part

The default metadata of Hive is stored in Derby, here to modify its metadata database to mysql, you need to download the mysql driver, address: https://dev.mysql.com/downloads/file/?id=480090 , and copy the driver to hive /lib file.

The configuration files read by Hive: hive-default.xml and hive-site.xml. You don’t need to modify a large number of default configuration items of the former, and the configuration information of the latter can overwrite the former. Refer to the official website: Getting Started Guide ;

1. Create a new hive metadata path on hdfs 

 $HADOOP_HOME/bin/hadoop fs –mkdir /tmp
 $HADOOP_HOME/bin/hadoop fs –mkdir /user/hive/warehouse
 $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
 $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

2. Create and modify the hive-site.xml configuration file as follows (the mysql user who reads and writes metadata in the configuration is root, and the permissions are too high and need to be optimized):

cp hive-default.xml.template hive-site.xml

//Detailed configuration of hive-site.xml:

<configuration>
  <property>
      <name>hive.metastore.local</name>
      <value>true</value>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>root</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>123456</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://172.19.52.155:3306/hive</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
  </property>
</configuration>

3. Initialize the hive meta-database and verify whether the mysql setting is successful (after success, there will be many hive meta-data tables in the tables under the hive library);

schematool -dbType mysql -initSchema

Startup and verification

After completing the above configuration, start the hive service directly with the hive command to test and build the table;

//hdfs has a corresponding folder, indicating that the configuration is successful:

problem

1. Configure mysql and start successfully, but after modifying the login user configuration information of the mysql.user table, restart and find that the root user on the localhost host cannot read the mysql system library (mysql library). Solution: Turn off the mysql process and delete the datadir After the data directory in, re-initialize;

2. The login permission problem when configuring the mysql library, empower users and login hosts;

HBase configuration

Hbase is a highly reliable, high-performance, column-oriented, scalable distributed database, mainly used to store unstructured and semi-structured loose data. Hbase can support ultra-large-scale data storage, and can use cheap hardware clusters to process data sets composed of more than 1 billion elements and millions of columns of elements through horizontal expansion.

Configuration

Configuration reference official website: Example Configurations , you need to modify 3 configuration files: hbase-env.sh and hbase-site.xml and regionserver files;

1. hbase-env.sh modify JAVA_HOME, HBASE_CLASSPATH, BASE_MANAGES_ZK. If you use a separately installed zk, modify HBASE_MANAGES_ZK to false, and reuse the previously deployed zk here;

export JAVA_HOME=/home/stream/jdk1.8.0_144
export HBASE_CLASSPATH=/home/stream/hbase/conf
export HBASE_MANAGES_ZK=false

2. Set the zk address and other configurations in the hbase-site.xml file:

<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>172.19.72.155,172.19.72.156,172.19.72.157</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/stream/zk/zookeeper/dataDir</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://172.19.72.155/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
    <property>
    <name>hbase.tmp.dir</name>
    <value>/home/stream/hbase/temp</value>
  </property>
</configuration>

3. Modify the regionserver file and add the regionserver host:

172.19.72.156
172.19.72.157
172.19.72.158
172.19.72.159

Startup and verification

Run the ./start-hbase.sh script in the bin directory to view the Hmaster process, use the hbase shell command to start the hbase command line to build tables, and test;

You can use zkCli.sh to enter zk to view the node information registered by hbase:

problem

1. Startup error: java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder

Solution: cp $HBASE_HOME/lib/client-facing-thirdparty/htrace-core-3.1.0-incubating.jar $HBASE_HOME/lib/

2、启动报错:Please check the config value of 'hbase.procedure.store.wal.use.hsync' to set the desired level of robustness and ensure the config value of 'hbase.wal.dir' points to a FileSystem mount that can provide it.

Solution: add configuration in hbase-site.xml:

<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>

Storm configuration

Strom is a distributed, highly available real-time computing framework. Zookeeper is responsible for the communication between the Nimbus node and the Supervior node, and monitors the status of each node. The task is submitted on the Nimbus node. The Nimbus node distributes the task through the zk cluster, and the Supervisor is where the task is actually performed.

The Nimbus node monitors the status of each Supervisor node through the zk cluster. When a Supervisor node fails, Nimbus redistributes the tasks on that Supervisor node to other Supervisor nodes for execution through the zk cluster.

If the Nimbus node fails, the entire task will not stop, but the management of the task will be affected. In this case, we only need to restore the Nimbus node.

Nimbus nodes do not support high availability, which is also a problem currently faced by Storm. In general, Nimbus nodes are not under pressure and usually no problems.

Configuration

Unzip it directly to the /home/stream directory. /home/stream/apache-storm-0.9.5/conf/storm.yaml needs to be configured. Common configuration items and values ​​are as follows:

//配置zk的地址和端口和storm存放在zookeeper里目录
storm.zookeeper.server: 
 -  “192.168.159.145”
 -  “192.168.159.144”
 -  “192.168.159.143”
storm.zookeeper.port: 21810
storm.zookeeper.root: /storm_new10
//storm主节点的地址 web页面的端口
nimbus.host: “192.168.159.145”
ui.port: 8989
//每个worker使用的内存
worker.heap.memory.mb: 512
storm.local.dir: "/home/zyzx/apache-storm-0.9.5/data"
//配置工作节点上的进程端口。你配置一个端口,意味着工作节点上启动一个worker,在实际的生产环境中,我们需要根据实际的物理配置以及每个节点上的负载情况来配置这个端口的数量。在这里每个节点我象征性的配置5个端口。
supervisor.slots.ports:
- 6700
- 6701 
- 6702 
- 6703
- 6700
nimbus.thrift.max_buffer_size: 204876
worker.childopts: “-Xmx1024m”

Startup and verification

The control node starts nimbus and storm ui, other nodes start the Supervisor node, view the process through JPS, the master node starts the nimbus process, and the worker node starts the supervisor process;

nohup storm ui &
nohup storm nimbus &
nohup storm supervisor &

Storm's configuration file storm.yaml has high requirements for format specifications. Extra spaces can cause failure to read the configuration. After starting all nodes, you can check the number of supervisor nodes surviving under storm in the root directory of the zkCli client of zookeeper to verify whether the startup is successful:

Run the ./storm list command under ~/storm/bin to list the topologies submitted on the cluster:

Spark (on yarn) configuration

Spark is written in the concise and elegant Scala language, provides an interactive programming experience based on Scala, and provides a variety of convenient and easy-to-use APIs. Spark follows the design concept of "a software stack meets different application scenarios", and has gradually formed a complete ecosystem (including Spark provides memory computing framework, SQL ad hoc query (Spark SQL), streaming computing (Spark Streaming), machine learning (MLlib), graph computing (Graph X), etc.), Spark can be deployed on the yarn resource manager to provide a one-stop big data solution, while supporting batch processing, stream processing, and interactive query.

The MapReduce computing model has high latency and cannot meet the needs of real-time and fast computing. Therefore, it is only suitable for offline scenarios. Spark draws on the MapReduce computing model, but it has the following advantages:

  • Spark provides more types of data set operations, and its programming model is more flexible than MapReduce;
  • Spark provides in-memory calculations and puts the calculation results directly in the memory, which reduces the IO overhead of iterative calculations and has more efficient calculation efficiency.
  • Spark is a DAG-based task scheduling execution mechanism, which is more efficient in iteration;

In actual development, MapReduce needs to write a lot of low-level code, which is not efficient enough. Spark provides a variety of high-level and concise APIs to implement applications with the same function, and the amount of implementation code is much less than MapReduce.

As a computing framework, Spark only replaces the MapReduce computing framework in the Hadoop ecosystem. It needs HDFS to implement distributed storage of data. Other components in Hadoop still play an important role in the enterprise big data system;

Spark configuration on yarn mode only needs to modify very few configurations, and does not use the start spark cluster command. When you need to submit a task, you must specify the task on yarn.

Configuration

Spark requires the Scala language to run. You must download Scala and Spark and unzip to the home directory, set the current user's environment variables (~/.bash_profile), increase the SCALA_HOME and SPARK_HOME paths and take effect immediately; start the scala command and spark-shell command to verify whether they are successful; If the configuration file modification of Spark is not easy to understand according to the pipe network tutorial, please refer to the blog and debugging for the configuration completed here.

Spark needs to modify two configuration files: spark-env.sh and spark-default.conf. The former needs to specify the Hadoop hdfs and yarn configuration file paths and Spark.master.host address, and the latter needs to specify the jar package address;

The spark-env.sh configuration file is modified as follows:

export JAVA_HOME=/home/stream/jdk1.8.0_144
export SCALA_HOME=/home/stream/scala-2.11.12
export HADOOP_HOME=/home/stream/hadoop-3.0.3
export HADOOP_CONF_DIR=/home/stream/hadoop-3.0.3/etc/hadoop
export YARN_CONF_DIR=/home/stream/hadoop-3.0.3/etc/hadoop
export SPARK_MASTER_HOST=172.19.72.155
export SPARK_LOCAL_IP=172.19.72.155

The spark-default.conf configuration is modified as follows:

//Increase the jar package address,

spark.yarn.jars=hdfs://172.19.72.155/spark_jars/*

This setting indicates that the jar address is defined on hdfs, and all jar packages under the ~/spark/jars path must be uploaded to the /spark_jars/ path (hadoop hdfs -put ~/spark/jars/*) of hdfs, otherwise an error will be reported Unable to find the compiled jar package error;

Startup and verification

Start ./spark-shell directly without parameters, and it runs in local mode:

Start ./spark-shell –master yarn and run on yarn mode, provided that yarn is successfully configured and available:

Create the file README.md in the hdfs file system and read it into the RDD. Use the parameter conversion that comes with the RDD. RDD defaults to each line with a value:

Use ./spark-shell --master yarn to start spark and run the command: val textFile=sc.textFile("README.md") to read the README.md file on hdfs to RDD, and use the built-in function to test as follows, indicating spark on yarn configuration is successful;

problem

When starting spark-shell, an error is reported that the maximum allocated memory configured in Yarn-site.xml is insufficient. Increase this value to 2048M, and you need to restart yarn to take effect.

The set hdfs address conflicts. The hdfs-site.xml setting in the hdfs configuration file does not have a port, but the spark.yarn.jars value in spark-default.conf has a port. Modify the configuration address of spark-default.conf to be the same as the former Consistent:

Flink (on yarn) configuration

Flink is a distributed memory computing framework for streaming data and batch data. The design ideas are mainly derived from Hadoop, MPP databases, streaming computing systems, etc. Fink is mainly implemented by Java code and developed mainly by the contributions of open source communities. The main scenario handled by Flink is streaming data. By default, all tasks are processed as streaming data. Batch data is only a special case of streaming data, which supports local rapid iteration and some loop iteration tasks.

Flink builds a software stack in the form of a hierarchical system. The stacks of different layers are built on the basis of its lower layers. Its characteristics are as follows:

  • Provides DataStreaming API for stream processing and DataSet API for batch processing. The DataSet API supports Java, Scala and Pyhton, and the DataStreaming API supports Java and Scala.
  • Provide a variety of candidate deployment solutions, such as local mode (Local), cluster mode (Cluster), cloud mode (Cloud), for clusters, standalone mode (Standalone) or yarn can be adopted;
  • Provides better Hadoop compatibility, not only supporting yarn, but also supporting data sources such as HDFS and Hbase;

Flink supports incremental iterations and has the function of self-optimizing iterations, so the performance of tasks submitted on yarn is slightly better than Spark. Flink processes data line by line, Spark is a small batch processing based on RDD, so Spark will inevitably increase some delays in streaming data processing, and its real-time performance is not as good as Flink. Flink and Storm can support millisecond-level calculation responses, while Spark can only support second-level responses. Spark's market influence and community activity are significantly stronger than Flink, which limits Flink's development space to a certain extent;

Configuration

Unzip, enter the bin directory and run ./yarn-session.sh –help to view the help to verify whether yarn is successfully configured, ./yarn-session.sh –q displays all nodeManager node resources of yarn;

Flink provides two ways to submit tasks on yarn: start a running YARN session (separated mode) and run a Flink task on YARN (client mode). Flink only needs to modify a configuration conf/flink-conf.yaml, For detailed parameters, please refer to the official website:

General configuration: Configuration , HA configuration: High Availability (HA) 

//You need to set the fs.hdfs.hadoopconf parameter in conf/flink-conf.yaml to locate the YARN and HDFS configuration;

//In yarn mode, jobmanager.rpc.address does not need to be specified, because which container is used as the jobManager is determined by Yarn, not by Flink configuration; taskmanager.tmp.dirs does not need to be specified either, this parameter will be specified by the tmp parameter of yarn, default That is, in the /tmp directory, save some jar or lib files for uploading to ResourceManager. Parrallelism.default does not need to be specified, because when starting yarn, the number of slots for each taskmanager is specified by -s.

//You need to modify the yarn.resourcemanager.am.max-attempts in the yarn-site.xml configuration so that the number of retries for the resourcemanager connection is 4, and the default is 2; at the same time, add yarn.application- in flink-conf.yaml attempts: 4;

//flink-on-yarn cluster HA relies on Yarn's own cluster mechanism, but Flink Job relies on the snapshots generated by checkpoints when restoring. These snapshots are configured in hdfs, but their metadata information is stored in zookeeper, so we still To configure the HA information of zookeeper; the recovery.zookeeper.path.namespace can also be overridden by the -z parameter when starting Flink on Yarn.

The complete configuration of flink-conf.yaml is as follows:

# The RPC port where the JobManager is reachable.
jobmanager.rpc.port: 6123
# The heap size for the JobManager JVM
jobmanager.heap.size: 1024m
# The heap size for the TaskManager JVM
taskmanager.heap.size: 1024m
# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.
taskmanager.numberOfTaskSlots: 1
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 1
# env
HADOOP_CONF_DIR:/home/stream/hadoop-3.0.3/etc/hadoop
YARN_CONF_DIR:/home/stream/hadoop-3.0.3/etc/hadoop
# Fault tolerance and checkpointing
state.backend:filesystem
state.checkpoints.dir:hdfs://172.19.72.155/yzg/flink-checkpoints
state.savepoints.dir:hdfs://172.19.72.155/yzg/flink-checkpoints
# hdfs
#The absolute path to the Hadoop File System’s (HDFS) configuration directory 
fs.hdfs.hadoopconf:/home/stream/hadoop-3.0.3/etc/hadoop
#The absolute path of Hadoop’s own configuration file “hdfs-site.xml” (DEFAULT: null).
fs.hdfs.hdfssite:/home/stream/hadoop-3.0.3/etc/hadoop/hdfs-site.xml
#HA
high-availability: zookeeper
high-availability.zookeeper.quorum: 172.19.72.155:2181,172.19.72.156:2181,172.19.72.157:2181
high-availability.storageDir: hdfs:///yzg/flink/recovery
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 4

In HA mode, you need to configure zk and modify zoo.cfg under conf; the zoo.cfg configuration is as follows:

dataDir=/home/stream/zk/zookeeper/logs
# The port at which the clients will connect
clientPort=2181
# ZooKeeper quorum peers
server.1=172.19.72.155:2888:3888
server.1=172.19.72.156:2888:3888
server.1=172.19.72.157:2888:3888

Startup and verification

Use separate mode to start Flink Yarn Session. After submission, it prompts that the yarn application has been successfully submitted to yarn and returns the id. Use yarn application -kill application_id to stop the task submitted on yarn;

yarn-session.sh -n 3 -jm 700 -tm 700 -s 8 -nm FlinkOnYarnSession -d –st

You can directly submit the built-in word frequency statistics use case to verify whether the on yarn mode is successfully configured:

~/bin/flink run -m yarn-cluster -yn 4 -yjm 2048 -ytm 2048 ~/flink/examples/batch/WordCount.jar

to sum up

The above basically completes the basic component deployment of the big data platform (including batch and stream). In summary, the bulk deployment based on apache Hadoop is more troublesome, you need to adapt the components yourself, and the configuration of the components is more, the configuration is more cumbersome; the current running status of the completed components is as follows (spark and flink in on yarn mode have no processes):

Guess you like

Origin blog.csdn.net/yezonggang/article/details/106916047