HDFS+Spark(Hive On Spark)+Flume/Shell architecture for big data analysis

foreword

The company needs to conduct operational analysis of data. In order to meet the needs of operational big data analysis, it decides to use hadoop for data analysis and query.

After research, it is planned to adopt the following structure



The game server writes log messages to the BI server through http, and the BI server records log information through log4j. Then import the log file into HDFS, and perform statistical query of the data through Spark.

There are two ways to import log files into HDFS:

1、flume

Regularly copy the log files to the directory monitored by Flume, and then Flume will automatically import the log files into HDFS.

The advantage of this method is that the size of the HDFS file can be configured without generating many small files. The disadvantage is that the import speed is relatively slow, and if a large file is moved to the monitoring directory of flume, an exception will be reported (there is a solution on the Internet), causing flume to stop executing.

Flume also has other advantages, such as distributed collection, etc.; the disadvantage is that it will stop execution when it encounters an exception, and the problem of copying large files. After testing, copy a file of more than 400 M to the flume monitoring directory. If the flume channel uses memory It takes nearly 10 minutes (single machine) to import into HDFS. If the flume channel adopts the file mode, it will time out. However, the memory method cannot guarantee the consistency of the message.

2、shell

You can directly import log files into HDFS through the hadoop fs -put method through shell scripts. The advantage of this method is that it is fast and simple; the disadvantage is that the single machine is not distributed, and the size of the log file needs to be controlled by oneself. If the log file is successfully imported, you need to mark it yourself. It may also be necessary to do small file merge processing on HDFS.

 

The software versions used in this installation are:

hadoop2.6

spark-1.6.1-bin-hadoop2.6

flume1.6

 

1. Hadoop installation and configuration

What is explained here is the single-machine pseudo-distributed configuration. There are many specific configurations on the Internet. I will not introduce them in detail here, but only explain some key points.

1. Unzip hadoop

2. Install JDK7

3. vim /etc/profile, configure the java_home and hadoop_home environments (the article has detailed configuration information at the end)

4, ssh password-free login settings

cd ~/.ssh/                     # 若没有该目录,请先执行一次ssh localhost

ssh-keygen -t rsa              # 会有提示,都按回车就可以

cat ./id_rsa.pub >> ./authorized_keys  # 加入授权

5、修改hadoop配置文件(/hadoop/hadoop-2.6.0/etc/hadoop)

5.1 vim hadoop-env.sh

增加export JAVA_HOME=${JAVA_HOME}

5.2 vim core-site.xml

 

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.6.0/tmp</value>
</property>
<property>
  <name>fs.defaultFS</name>
      <value>hdfs://10.10.31.35:9000</value>
      </property>
      <property>
      <name>fs.hdfs.impl</name>
      <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
      <description>The FileSystem for hdfs: uris.</description>
      </property>
</configuration>

 5.3 vim hdfs-site.xml

 

<configuration>
<property>
 <name>dfs.replication</name>
  <value>1</value>
  </property>

 <property>
         <name>dfs.permissions</name>
                 <value>false</value>
                   </property>

<property>
  <name>dfs.namenode.name.dir</name>
      <value>file:/usr/local/hadoop/hadoop-2.6.0/tmp/dfs/name</value>
      </property>

<property>
  <name>dfs.datannode.data.dir</name>
      <value>file:/usr/local/hadoop/hadoop-2.6.0/tmp/dfs/data</value>
      </property>
</configuration>
  5.4 vim mapred-site.xml

 

 

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>10.10.31.35:9001</value>
</property>
</configuration>
   5.5 vim yarn-site.xml

 

 

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

 <property>
   <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
        </property>
<!-- Site specific YARN configuration properties -->

</configuration>

6、执行NameNode 格式化

./bin/hdfs namenode -format

7、启动hadoop进程

./sbin/start-dfs.sh

 

到这里hadoop的配置就已经完成了,详细可以参考http://www.powerxing.com/install-hadoop/

 2. For details of spark installation and configuration, please refer to http://www.thebigdata.cn/Hadoop/28957.html to start the thriftserver. For details, please refer to http://blog.csdn.net/wind520/article/details/44061563. Here you can pass jdbc Access hive on spark (hereinafter referred to as hive database) There are several ways to access hive database, one is through sparksql, the other is through beeline, and the other is jdbc create table example create table test(id int,name string) row format delimited fields terminated by '\t' stored as textfile location 'hdfs://10.10.31.35:9000/user/hive/warehouse/temp.db/test'; 3. For details of flume installation and configuration, please refer to http://www.flybi .net/blog/lp_hadoop/1241 Run flume: bin/flume-ng agent --conf-file conf/test.conf --name agent1 -Dflume.root.logger=INFO,console Notes: 1. If you want to let the external network Access requires vim /etc/hosts to add 127.0.0.1 ip-10-10-31-35 export JRE_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.99.x86_64/jdk1.8.0_73/ jre export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

 

export SCALA_HOME=/usr/local/scala/scala-2.11.8

export PATH=$SCALA_HOME/bin:$PATH

 

export SPARK_HOME=/usr/local/spark/spark-1.6.1-bin-hadoop2.6

export PATH=$SPARK_HOME/bin:$PATH

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin   export SCALA_HOME=/usr/local/scala/scala-2.11.8 export PATH=$SCALA_HOME/bin:$PATH   export SPARK_HOME=/usr/local/spark/spark-1.6.1-bin-hadoop2.6 export PATH=$SPARK_HOME/bin:$PATH

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326850250&siteId=291194637