Hadoop-模拟搭建日志收集系统

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/javaee_sunny/article/details/80304817

一. 技术点梳理

  • Nginx:其实一个web server,此流程中做反向代理,起到分发用户请求的作用,在集群环境时,也可以用它实现负载均衡;
  • Spawn cgi:提供一个cgi网关接口,可以将server服务,快速的暴露出去以便对外提供服务,对外提供的服务走fcgi协议,fcgi协议是一种特殊的http请求,而http请求安全性相对差一些,因为容易受到外部的攻击;
  • Thrift RPC: 通过执行thrift命令,可以帮助我们快速的生成client和server代码,同时由于rpc跨语言,同时只要遵循client和server端通信的rpc协议和接口规范,两端完全可以使用不同的语言进行开发.另外,当生产中,使用c++或java进行开发时,可以大大提高冰并发请求的性能;
  • Flume:作为一个通道对接log server产生的log文件,使其经过source,channel,最后抵达自己重定义的HBaseSink.之所以要使用自己重定义的HBaseSink,是为了完成将非结构的日志数据,转换成结构化的数据分别存储在表中不同的列里面,而HBaseSink原来的sink只能帮助我们指定不同的列;
  • HBase:存储经过结构化处理的用户行为日志信息,后期可以使用Hive对HBase中的数据进行统计和分析;
  • Glog :谷歌的一个开源日志框架,可以实现快速的写入log日志文件,并可以指定文件存放的位置,单个文件的大小,分割log日志文件的周期等;功能类似java中常用的log4j;
  • ab压测工具:用来模拟大量用户的并发请求,测试并发请求时,所有请求的平均响应时间;

二.任务拆分

环境提示:

Python版本: Python 2.7
Java版本: 1.7.0_80
Linux版本: Centos7

2.1 [任务1]:安装thrift,并调通单机版的thrift(Python版本)

1. thrift版本号

thrift-0.9.3

2. 下载源码包

wget http://apache.fayea.com/thrift/0.9.3/thrift-0.9.3.tar.gz

安装wget命令:

yum -y install wget

3. 安装thrift

3.1 安装依赖条件

thrift核心代码使用c++编写,用了boost库.执行以下命令,安装相关依赖:

yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel 

3.2 解压安装

tar xzvf thrift-0.9.3.tar.gz

3.3 配置语言

进入解压之后的文件夹,配置支持与不支持的语言

./configure --with-cpp --with-boost --with-python --without-csharp --with-java --without-erlang --without-perl --with-php --without-php_extension --without-ruby --without-haskell  --without-go

执行命令,可能会报出如下错误信息:

configure: error: "Error: libcrypto required."

解决办法:

安装 openssl openssl-devel,执行命令如下:

yum -y install openssl openssl-devel

3.4 编译

> 执行make命令: make (编译前,要保证系统中安装有g++)
> 执行make install命令: make install

3.5 测试是否安装成功

> 执行thrift命令: thrift 或 thrift -help
> 查看thrift安装路径: which thrift

4 设置server和client通信的接口数据格式:

定义scheme,为通过执行thrift命令生成client和server代码做准备:

cat RecSys.thrift
service RecSys {
    string rec_data(1:string data)
}

5 根据接口格式(scheme)生成代码(python)

执行命令: 

thrift --gen py RecSys.thrift

执行过程,可能会出错,可根据提示安装对应的内容,实例如下:

pip install thrift==0.9.3

成功生成之后,查看client端代码:

cat client.py
#! /usr/bin/env python
# -*- coding: utf-8 -*-

import sys
#追加目录,识别对应的库
sys.path.append("gen-py")

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

from RecSys import RecSys
# from demo.ttypes import *

try:
    # Make Socket
    # 建立socket, IP 和port要写对
    transport = TSocket.TSocket('localhost', 9900)

    # Buffering is critical. Raw sockets are very slow
    # 选择传输层,这块要和服务器的设置一样
    transport = TTransport.TBufferedTransport(transport)

    # Wrap in a protocol
    # 选择传输协议,这个也要和服务器保持一致,负责无法通信
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    client = RecSys.Client(protocol)

    # Connect!
    transport.open()

    # Call server services
    rst = client.rec_data("are you ok!")
    print rst

    # close transport
    transport.close()
except Thrift.TException, ex:
    print "%s" % (ex.message)

server端代码:

cat server.py 
#! /usr/bin/env python
# -*- coding: utf-8 -*-

import sys
sys.path.append('gen-py')

from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from thrift.server import TServer

from RecSys import RecSys
from RecSys.ttypes import *


class RecSysHandler(RecSys.Iface):
    def rec_data(self, a):
        print "Receive: %s" %(a)
        return "ok"


if __name__ == "__main__":

    # 实例化handler
    handler = RecSysHandler()

    # 设置processor
    processor = RecSys.Processor(handler)

    # 设置端口
    transport = TSocket.TServerSocket('localhost', port=9900)

    # 设置传输层
    tfactory = TTransport.TBufferedTransportFactory()

    # 设置传输协议
    pfactory = TBinaryProtocol.TBinaryProtocolFactory()

    server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)

    print 'Starting the server...'
    server.serve()
    print 'done'

6 启动测试

分别启动client和server

1. python client.py
2. python server.py
在启动过程中,如果提示没有哪些模块,使用pip命令进行安装对应版本的模块即可.

示例: 

pip install thrift==0.9.3

6.1 下载安装pip

wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb"  --no-check-certificate

6.2 pip安装

# tar -xzvf pip-1.5.4.tar.gz
# cd pip-1.5.4
# python setup.py install

6.3 pip安装包

# pip install SomePackage
  [...]
  Successfully installed SomePackage


7 访问测试

启动成功后,可以进行访问测试一下.

通过以上操作,基本就完整了一个最基本的C/S架构(python).

2.2 [任务2]:调通单机版的thrift(c++版本)

thrift是帮我们仅仅生成了server端,之所以要用c++版本,是因为c++性能更高,并发效果更好,生产中是最常用的.

1. 执行thrift命令,生产server端代码

thrift --gen cpp RecSys.thrift

2. 编译之前要安装一个包,否则编译可能会通不过,已安装的话,请忽略:

yum install boost-devel-static

3. 复制lib包,统一管理

cp -raf thrift/ /usr/local/include/

4. 编译

编译server:

执行thrift命令之后,会生成一个gen-cpp的文件夹,进入执行以下命令,分别进行编译:

g++ -g -Wall -I./ -I/usr/local/include/thrift RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp -L/usr/local/lib/*.so -lthrift -o server

client端代码需要我们自己手写,代码实例如下:

cat client.cpp 
#include "RecSys.h"
#include <iostream>
#include <string>


#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>


using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;

using namespace std;
using std::string;
using boost::shared_ptr;

int main(int argc, char **argv) {

    boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
    boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
    boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));

    transport->open();

    RecSysClient client(protocol);

    string send_data = "are you ok?";
    string receive_data;

    client.rec_data(receive_data, send_data);
    cout << "receive server data: " << receive_data << endl;

    transport->close();
    return 0;
}

编译client:

g++ -g -Wall -I./ -I/usr/local/include/thrift RecSys.cpp client.cpp -L/usr/local/lib/*.so -lthrift -o client

5. 运行

运行server

./server

(1) 运行过程可能会出现以下错误:

./server: error while loading shared libraries: libthrift-0.9.3.so: cannot open shared object file: No such file or directory

解决方案: 修改ld.so.conf文件

(py27tf) [root@singler cgi_demo]# vim /etc/ld.so.conf
  1 include ld.so.conf.d/*.conf
  2 /usr/local/lib
执行命令,重新加载配置: ldconfig

(2)二次执行时,可能报以下异常信息:

[root@master gen-cpp]# ./server
Thrift: Tue May 15 12:49:33 2018 TServerSocket::listen() BIND 9090
terminate called after throwing an instance of 'apache::thrift::transport::TTransportException'
  what():  Could not bind: Transport endpoint is not connected
Aborted

解决方案: 执行以下命令,找到对应已经启动的程序,然后杀掉即可:

ps -aux

并通过以下命令杀掉即可:

kill -9 进程号

运行client:

./client

注意:

在执行修改client或server端代码前,要删除client和server的可执行程序,并重新进行编译,删除命令如下:

rm -rf server client

2.3 [任务3]:实现thrift多语言互通

Thrift RPC跨语言,只要client和server端遵循通信的rpc协议和接口规范,两端完全可以使用不同的语言进行开发.

client: python
server: c++

1. 执行thrift命令,生成server端代码

thrift --gen cpp RecSys.thrift
自动生成gen-cpp目录,这个目录下的代码仅完成server功能

2. 在server端代码进行如下内容的修改:

server端代码要进行局部修改,以便更好的进行测试:

 23   void rec_data(std::string& _return, const std::string& data) {
 24     // Your implementation goes here
 25     printf("==============\n");
 26     std::cout << "receive client data: " << data << std::endl;
 27     
 28     std::string ack = "i am ok !!!";
 29     _return = ack;
 30   
 31   }

3. client端代码仍使用任务1中的Python代码:

只需要对应server端,修改端口号即可.

4.运行

运行server

./server

启动server过程中,可能会出现端口被占用的情况,通过以下命令,可以查到占用的进程,并杀死进程

ps aux
kill -9 进程号

运行client

Python client.py

2.4 [任务4]:搭建ngx服务器

1. nginx版本

nginx-1.14.0

2. 下载:

wget http://nginx.org/download/nginx-1.14.0.tar.gz

3. 解压安装:

tar xvzf nginx-1.14.0.tar.gz 

4. 配置安装路径:

解压后,进入文件夹nginx-1.14.0中,执行如下配置,并执行安装路径:

./configure --prefix=/usr/local/nginx

5. 安装依赖包

若安装出错,可尝试安装如下依赖包:

yum -y install pcre-devel zlib-devel

6. 执行编译

进入安装路径/usr/local/nginx中,执行如下命令,进行编译:

make
make install

7. 启动nginx

./sbin nginx

8. 访问测试一下 

在浏览器中输入安装nginx所在机器的ip,测试访问一下,出现以下页面,即表示安装成功.


9. 查看一下运行的端口

netstat -antup | grep -w 80
ps aux | grep nginx

10. kill掉线程

killall -9 nginx

centos7精简安装后,使用中发现没有killall命令。经查找,可以通过以下命令解决:

yum install psmisc

2.5 [任务5]:配合Spawn cgi完成独立server

通过cgi提供的网关接口,可以将自己的server服务提供给外部,供外部用户进行访问请求.可以理解为提供了一种代理,可以在非应用程序所在的机器上操作应用程序,并对应用程序发送请求.

1. 下载cgi

wget http://download.lighttpd.net/spawn-fcgi/releases-1.6.x/spawn-fcgi-1.6.4.tar.gz

2. 解压

tar xzvf spawn-fcgi-1.6.4.tar.gz

3. 三部曲:配置,编辑,安装

configure
make
make install

4. copy bin目录,将所有的bin目录都放在一起

cp src/spawn-fcgi /usr/local/nginx/sbin/

5. 安装fcgi

fcgi是cgi应用程序依赖的库.

5.1 下载

wget ftp://ftp.ru.debian.org/gentoo-distfiles/distfiles/fcgi-2.4.1-SNAP-0910052249.tar.gz

5.2 解压

tar xzvf fcgi-2.4.1-SNAP-0910052249.tar.gz

5.3 修改一处代码

进入解压之后的目录,执行如下命令,修改一处代码:

]# find . -name fcgio.h
./include/fcgio.h
]# vim ./include/fcgio.h
在#include <iostream>下添加一个标准输出:
#include <cstdio>

5.3 配置安装三部曲

./configure
make
make install

6. 创建一个fcgi demo:

(py27tf) [root@singler cgi_demo]# vim test.c

代码如下:

#include <stdio.h>
#include <stdlib.h>
#include <fcgi_stdio.h>
 
int main() {
 
    int count = 0;
    while(FCGI_Accept() >= 0) {
        printf("Content-type: text/html\r\n"
                "\r\n"
                ""
                "Hello Badou EveryBody!!!"
                "Request number %d running on host %s "
                "Process ID: %d\n ", ++count, getenv("SERVER_NAME"), getpid());
 
    }
 
    return 0;
}

7. 编译我们的代码:

gcc -g -o test test.c -lfcgi

8. 可能提示找不到lib库,修改ld.so.conf文件

(py27tf) [root@singler cgi_demo]# vim /etc/ld.so.conf
  1 include ld.so.conf.d/*.conf
  2 /usr/local/lib
执行命令,重新加载配置: ldconfig

9. 执行二进制可执行文件,测试:

./test

10. 测试成功后,启动spawn cgi进行代理:

/usr/local/nginx/sbin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /root/thrift_test/cgi_demo/test

参数说明:

-f 启动应用文件的存放路径
-p 启动一个应用程序(常驻进程),对外访问的端口

11. 检查端口是否正常

netstat -antup |grep 8088 

12. 配置ngx的反向代理:

配置nginx反向代理,使用户请求通过nginx反向代理打到通过cgi暴露的server服务上.

配置反向代理,主要依赖nginx.conf文件:

 43         location / {
 44             root   html;
 45             index  index.html index.htm;
 46         }
 47 
 48         location ~ /recsys$ {
 49             fastcgi_pass 127.0.0.1:8088;
 50             include fastcgi_params;
 51         }

13. 启动nginx

 ./nginx/sbin/nginx

14. 测试:(ip换成本地自己的ip)

http://192.168.87.100/recsys
以上test代码,还可以进一步升级,以完成只读型的demo:

也即是由于无法接收和解析参数,扩展性不强,需要代码加固和升级:

比如能够解析如下方式的请求:

http://192.168.87.100/recsys?itemid=111&userid=012&action=click&ip=10.11.11.10

2.6 [任务6]:Thrift RPC和Spawn cgi进行联合试验,完成日志服务器

1. c++的client代码:

cat client.cpp 
#include "RecSys.h"
#include <iostream>
#include <string>


#include <transport/TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>


#include <fcgi_stdio.h>
#include <fcgiapp.h>


using namespace apache::thrift;
using namespace apache::thrift::protocol;
using namespace apache::thrift::transport;


using namespace std;
using std::string;
using boost::shared_ptr;


inline void send_response(
            FCGX_Request& request, const std::string& resp_str) {


    FCGX_FPrintF(request.out, "Content-type: text/html;charset=utf-8\r\n\r\n");
    FCGX_FPrintF(request.out, "%s", resp_str.c_str());
    FCGX_Finish_r(&request);
}


int main(int argc, char **argv) {
    // step 1. init fcgi
    FCGX_Init();
    FCGX_Request request;
    FCGX_InitRequest(&request, 0, 0);


    // step 2. connect server rpc
    boost::shared_ptr<TSocket> socket(new TSocket("localhost", 9090));
    boost::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
    boost::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));


    transport->open();
    RecSysClient client(protocol);


    while(FCGX_Accept_r(&request) >= 0) {


        // http page -> client
        std::string send_data = FCGX_GetParam("QUERY_STRING", request.envp);


        string receive_data;


        // client -> server
        // server -> client
        client.rec_data(receive_data, send_data);


        cout << "receive http params: " << send_data << std::endl;
        cout << "receive server data: " << receive_data << endl;


        // client -> http page
        send_response(request, receive_data);
    }


    transport->close();


    return 0;
}

注意:

这里要注意头文件引入的路径,不要搞错了,我就在这里栽了一个小跟头.

编译命令:

g++ -g -Wall -I./ -I/usr/local/include RecSys.cpp client.cpp -L/usr/local/lib/*.so -lthrift -lfcgi -o client
注意编译命令中,我将头文件的引入路径设置为:  /usr/local/include

同样我对client.cpp的头文件引入路径,也做了修改,否则会找不到的.

#include <thrift/transport/TSocket.h>
#include <thrift/transport/TBufferTransports.h>
#include <thrift/protocol/TBinaryProtocol.h>

#include <fcgi_stdio.h>
#include <fcgiapp.h>

当你不确定头文件存放路径时,可通过执行如下命令,进行全局搜索:

find / -name 库文件名
实例: find / -name fcgiapp.h
说明:
/ 表示全盘查找
-name 名称

2. 汇总编译命令

Makefile修改为:

(py27tf) [root@singler gen-cpp]# cat Makefile 
G++ = g++
CFLAGS = -g -Wall
INCLUDES = -I./ -I/usr/local/include/thrift
LIBS = -L/usr/local/lib/*.so -lthrift -lfcgi


OBJECTS = RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp
CLI_OBJECTS = RecSys.cpp client.cpp


server: $(OBJECTS)
        $(G++) $(CFLAGS) $(INCLUDES) $(OBJECTS) $(LIBS) -o server


client: $(CLI_OBJECTS)
        $(G++) $(CFLAGS) $(INCLUDES) $(CLI_OBJECTS) $(LIBS) -o client


.PHONY: clean
clean:
        rm -rf server client

3. 运行

运行server

./server

观察响应服务是否启动:

nginx:netstat -antup | grep nginx
cgi代理:netstat -antup | grep 8088

4. 启动cgi服务

/usr/local/nginx/sbin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /root/thrift_test/thrift_demo/gen-cpp/client

5.启动nginx

设置反向代理,这里就不赘述了.

usr/local/nginx/sbin/nginx

观察一下nginx是否启动成功:

nginx:netstat -antup | grep nginx

6.浏览器访问测试

http://192.168.175.20/recsys?item=111&userid=123
如果能正常返回信息,同时server端正常输出日志,说明没有问题了.

2.7 [任务7]:ab压测,模拟log日志

ab压测,是一个非常常用的工具

1. yum安装ab压测工具

在任意目录下,执行均可:

yum -y install httpd-tools

2. 测试是否安装成功

ab -V

3.执行ab命令,压力测试任务6的服务

ab 
-c 20 
-n 5000 
'http://192.168.87.100/recsys?itemids=111,222,333,444&userid=012&action=click&ip=10.11.11.10'

参数说明:

-c 一次产生的请求个数。默认是一次一个。
-n 是所有的请求个数

请求地址要带上引号,否则命令中,遇到&号后边的内容会被处理成一个后台任务.

测试结果如下图所示:



2.8 [任务8]:写入log(glog - google 日志模块)

glogs功能类似java中常用的log4j功能.

1. 下载glog:

git clone https://github.com/google/glog.git

2. 编译glog:

./autogen.sh && ./configure && make && make install
完成后,会在/usr/local/lib路径下看到libglog*一系列库

3. 完善server代码:

3.1 首先引入头文件:

#include <glog/logging.h>

3.2 在主流程起始初始化glog

#定义log产生的位置
FLAGS_log_dir = "/root/thrift_test/thrift_demo/gen-cpp/logs";
google::InitGoogleLogging(argv[0]);

3.3 代码中log输出命令:

LOG(INFO) << data;
LOG(ERROR) << data;
LOG(WARNING) << data;
LOG(FATAL) << data;

说明一下,在程序中输出FATAL级别的日志,会导致程序运行结束.若想观察日志连续输出,请使用INFO,或WARNING级别.

3.4 运行

运行server

 g++ -g -Wall -I./ -I/usr/local/include RecSys.cpp RecSys_constants.cpp RecSys_types.cpp RecSys_server.skeleton.cpp -L/usr/local/lib/*.so -lthrift -lglog -o server

通过spawn cgi启动client

/usr/local/bin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /usr/local/src/thrift_test/gen-cpp/client

启动nginx

/usr/local/nginx/sbin/nginx

注意:

server -> client -> nginx 要依次起来,并进行检查确认.

4. 执行压测,配置日志测试

ab -c 2 -n 50 'http://192.168.175.20/recsys?userid=110&itemid=333&type=show&ip=10.11.1.2'

2.9 [任务9]:对接flume进行实时流打通

两个flume组合使用,一个寄生在log server服务器上,一个对接hbase.

log server + flume    --  >   flume + hbase

启动位于log server上的flume:

./bin/flume-ng agent 
-c conf 
-f conf/flume-client.properties 
-n a1 
-Dflume.root.logger=INFO,console

启动位于hbase机器上的flume:

./bin/flume-ng agent 
-c conf 
-f conf/flume-server.properties 
-n a1 
-Dflume.root.logger=INFO,console

日志格式log data如下:

I0513 14:49:56.568998 33273 RecSys_server.skeleton.cpp:32] userid=110&itemid=333&type=show&ip=192.11.1.200

在实践中,做一下简化,使用一个Flume对接INFO级别的日志测试一下:

flume配置文件:

这里要注意指向正确的日志文件路径:

[root@master conf]# vi log_exec_console.conf
mple.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /usr/local/tmp/logs/gen-cpp/server.INFO

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

通过如下脚本启动flume-ng:

flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/log_exec_console.conf --name a1 -Dflume.root.logger=INFO,console

输出日志如下所示:

2018-05-16 23:12:12,102 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,103 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,106 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,108 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,110 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,111 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,114 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,116 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,120 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,122 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 I0516 23:12:06.0 }
2018-05-16 23:12:12,125 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 30 W0516 23:12:06.0 }
2018-05-16 23:12:12,127 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,129 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,130 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,130 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,131 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }
2018-05-16 23:12:12,131 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 W0516 23:12:06.1 }
2018-05-16 23:12:12,132 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 49 30 35 31 36 20 32 33 3A 31 32 3A 30 36 2E 31 I0516 23:12:06.1 }

2.10 [任务10]:连接hbase完成日志入库

这里模拟logserver和hbase不在同一台机器上,由flume client采集log日志,传输到位于hbase服务器上的flume server接收.

hbase版本说明:

1. hadoop : hadoop-1.2.1
2. java : jdk1.7.0_80
3. hbase : hbase-0.98.24-hadoop1

flume版本:

4. flume : apache-flume-1.6.0-bin

注意: 之所以在这里特别强调版本,是因为我在这掉进坑里了,在里面折腾了好几天.

1. 在hbase中创建表

执行如下hbase命令:

create 'user_action_table','action_log'

put 'user_action_table', '111', 'action_log:userid', '2002xue'
put 'user_action_table', '111', 'action_log:itemid', '12345'
put 'user_action_table', '111', 'action_log:type', 'click'
put 'user_action_table', '111', 'action_log:ip', '11.10.5.27'

scan 'user_action_table'

truncate 'user_action_table'

describe 'user_action_table'

2. 位于hbase服务器上的flume-server

配置文件内容:

[root@master conf]# cat flume-server.properties 
#agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = 192.168.175.10
a1.sources.r1.port = 52020

# set sink to hdfs
#a1.sinks.k1.type = logger

a1.sinks.k1.type = hbase
a1.sinks.k1.table = user_action_table
a1.sinks.k1.columnFamily = action_log
#a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.serializer = com.badou.hbase.FlumeHbaseEventSerializer
a1.sinks.k1.serializer.columns = userid,itemid,type,ip

a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1

自己实现的RegexHbaseEventSerializer代码如下:

package com.badou.hbase;

import java.nio.charset.Charset;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
import java.util.Locale;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Pattern;

import org.apache.commons.lang.RandomStringUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.HbaseEventSerializer;
import org.apache.hadoop.hbase.client.Increment;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Row;
import com.google.common.base.Charsets;
import com.google.common.collect.Lists;


public class FlumeHbaseEventSerializer implements HbaseEventSerializer {
  
	// Config vars  
    /** Regular expression used to parse groups from event data. */  
    public static final String REGEX_CONFIG = "regex";  
    public static final String REGEX_DEFAULT = " ";  
    /** Whether to ignore case when performing regex matches. */  
    public static final String IGNORE_CASE_CONFIG = "regexIgnoreCase";  
    public static final boolean INGORE_CASE_DEFAULT = false;  
    /** Comma separated list of column names to place matching groups in. */  
    public static final String COL_NAME_CONFIG = "colNames";  
    public static final String COLUMN_NAME_DEFAULT = "ip";  
    /** Index of the row key in matched regex groups */  
    public static final String ROW_KEY_INDEX_CONFIG = "rowKeyIndex";  
    /** Placeholder in colNames for row key */  
    public static final String ROW_KEY_NAME = "ROW_KEY";  
    /** Whether to deposit event headers into corresponding column qualifiers */  
    public static final String DEPOSIT_HEADERS_CONFIG = "depositHeaders";  
    public static final boolean DEPOSIT_HEADERS_DEFAULT = false;  
    /** What charset to use when serializing into HBase's byte arrays */  
    public static final String CHARSET_CONFIG = "charset";  
    public static final String CHARSET_DEFAULT = "UTF-8";  
    /* 
     * This is a nonce used in HBase row-keys, such that the same row-key never 
     * gets written more than once from within this JVM. 
     */  
    protected static final AtomicInteger nonce = new AtomicInteger(0);  
    protected static String randomKey = RandomStringUtils.randomAlphanumeric(10);  
    protected byte[] cf;  
    private byte[] payload;  
    private List<byte[]> colNames = Lists.newArrayList();  
    private boolean regexIgnoreCase;  
    private Charset charset;  
    @Override  
    public void configure(Context context) {  
        String regex = context.getString(REGEX_CONFIG, REGEX_DEFAULT);  
        regexIgnoreCase = context.getBoolean(IGNORE_CASE_CONFIG, INGORE_CASE_DEFAULT);  
        context.getBoolean(DEPOSIT_HEADERS_CONFIG, DEPOSIT_HEADERS_DEFAULT);  
        Pattern.compile(regex, Pattern.DOTALL + (regexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));  
        charset = Charset.forName(context.getString(CHARSET_CONFIG, CHARSET_DEFAULT));  
  
        String cols = new String(context.getString("columns"));  
        String colNameStr;  
        if (cols != null && !"".equals(cols)) {  
            colNameStr = cols;  
        } else {  
            colNameStr = context.getString(COL_NAME_CONFIG, COLUMN_NAME_DEFAULT);  
        }  
  
        String[] columnNames = colNameStr.split(",");  
        for (String s : columnNames) {  
            colNames.add(s.getBytes(charset));  
        }  
    }  
    
    @Override  
    public void configure(ComponentConfiguration conf) {}  
  
    @Override  
    public void initialize(Event event, byte[] columnFamily) {  
        event.getHeaders();  
        this.payload = event.getBody();  
        this.cf = columnFamily;  
    }  
	
    protected byte[] getRowKey(Calendar cal) {  
        String str = new String(payload, charset);  
        String tmp = str.replace("\"", "");  
        String[] arr = tmp.split(" ");  
        String log_data = arr[5];
        String[] param_arr = log_data.split("&");
        String userid = param_arr[0];
        String itemid = param_arr[1];
        String type = param_arr[2];
        String ip_str = param_arr[3];
        
//        String dataStr = arr[3].replace("[", "");  
//        String rowKey = getDate2Str(dataStr) + "-" + clientIp + "-" + nonce.getAndIncrement();
        String rowKey = ip_str + "-" + nonce.getAndIncrement();
        
        return rowKey.getBytes(charset);  
    }  
  
    protected byte[] getRowKey() {  
        return getRowKey(Calendar.getInstance());  
    }  

    @Override  
    public List<Row> getActions() throws FlumeException {  
        List<Row> actions = Lists.newArrayList();  
        byte[] rowKey;  
  
        String body = new String(payload, charset);  
        System.out.println("body===>"+body);
        String tmp = body.replace("\"", ""); 
        System.out.println("tmp===>"+tmp);
//        String[] arr = tmp.split(REGEX_DEFAULT); 
        String[] arr = tmp.split(" ");
        System.out.println("arr[1]===>"+arr[1].toString());
        System.out.println("arr[2]===>"+arr[2].toString());
        System.out.println("arr[3]===>"+arr[3].toString());
        System.out.println("arr[4]===>"+arr[4].toString());
        System.out.println("arr[5]===>"+arr[5].toString());
        String log_data = arr[5];
        String[] param_arr = log_data.split("&");
        
        String userid = param_arr[0].split("=")[1];
        String itemid = param_arr[1].split("=")[1];
        String type = param_arr[2].split("=")[1];
        String ip_str = param_arr[3].split("=")[1];
                      
        System.out.println("===========");
        System.out.println("===========");
        System.out.println("===========");
        System.out.println("===========");
        System.out.println(userid);
        System.out.println(itemid);
        System.out.println(type);
        System.out.println(ip_str);
        System.out.println("===========");
        System.out.println("===========");
        System.out.println("===========");
        System.out.println("===========");
         
        try {  
            rowKey = getRowKey();
            Put put = new Put(rowKey);  
            put.add(cf, colNames.get(0), userid.getBytes(Charsets.UTF_8));  
            put.add(cf, colNames.get(1), itemid.getBytes(Charsets.UTF_8));  
            put.add(cf, colNames.get(2), type.getBytes(Charsets.UTF_8));
            put.add(cf, colNames.get(3), ip_str.getBytes(Charsets.UTF_8));
            actions.add(put);  
        } catch (Exception e) {  
            throw new FlumeException("Could not get row key!", e);  
        }  
        return actions;  
    }  
  
    @Override  
    public List<Increment> getIncrements() {  
        return Lists.newArrayList();  
    }  
  
    @Override  
    public void close() {}  
  
    public static String getDate2Str(String dataStr) {  
        SimpleDateFormat formatter = null;  
        SimpleDateFormat format = null;  
        Date date = null;  
        try {  
            formatter = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss", Locale.ENGLISH);  
            date = formatter.parse(dataStr);  
            format = new SimpleDateFormat("yyyy-MM-dd-HH:mm:ss");  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
  
        return format.format(date);  
    }  
}

将编写的RegexHbaseEventSerializer类打好jar放到如下flume的lib目录下:

/usr/local/src/apache-flume-1.6.0-bin/lib

在这一步,我也报了异常(主要是数组脚本越界),最后通过详细输出日志的方式,排查出了原因.

另外,说明一点,这里在编写java代码时,引入的flume的jar包以及hbase的jar都要和服务器上的保持一致,同时编译的java环境也要和服务器上的保持一致.

当flume和hbase版本不匹配时,可能会出现如下异常信息:

Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)V

具体可以参考文章: https://cloud.tencent.com/developer/article/1025430


启动flume-sever的命令:

flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume-server.properties --name a1 -Dflume.root.logger=INFO,console

3. flume-client位于不同的虚拟机上:

其中从conf配置文件内容为:

[root@master conf]# cat flume_client.conf 
#a1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sources.r1.type = exec
#a1.sources.r1.command = tail -F /tmp/1.log
a1.sources.r1.command = tail -F /usr/local/tmp/logs/gen-cpp/server.INFO

# set sink1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.175.10
a1.sinks.k1.port = 52020

a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1

启动flume_client:

flume-ng agent --conf $FLUME_HOME/conf --conf-file $FLUME_HOME/conf/flume_client.conf --name a1 -Dflume.root.logger=INFO,console

4. 启动server

[root@master gen-cpp]# pwd
/usr/local/src/thrift_test/gen-cpp
[root@master gen-cpp]# ./server

5. 使用cgi启动client

/usr/local/bin/spawn-fcgi -a 127.0.0.1 -p 8088 -f /usr/local/src/thrift_test/gen-cpp/client

6.启动nginx

/usr/local/nginx/sbin/nginx

7.浏览器访问测试

http://192.168.175.20/recsys?userid=111&itemid=123&type=click&ip=10.1.8.27

8. 查询hbase

scan 'user_action_table'

猜你喜欢

转载自blog.csdn.net/javaee_sunny/article/details/80304817