使用hadoop和hive来进行应用的日志数据分析

整个架构流程的背景是：

1、各个应用产生日志打印约定格式的syslog，然后在服务器端部署syslog-ng server做日志的统一收集。

2、在syslog-ng server所在的服务器做日志文件的分类然后发送日志消息给storm做实时流数据统计。

3、同时每日凌晨启动rsync服务将前一天的日志文件发送到hadoop和hive服务器做非实时数据分析

使用hadoop和hive来进行应用的日志数据分析的详细流程：

1、安装hadoop

hadoop的安装以及配置在我的文章中有详细的描述：

http://blog.csdn.net/jsjwk/article/details/8923999

2、安装hive

hive的安装非常简单，只需要下载安装包：

wget http://mirrors.cnnic.cn/apache/hive/hive-0.10.0/hive-0.10.0.tar.gz

然后解压后，修改一点点配置文件用于连接hadoop的配置即可。

3、在hive中创建表

    /**
     * 根据日期来创建hive的邮件日志表
     * @param date
     * @return
     * @throws SQLException
     */
    public String createTable(Calendar cal) throws SQLException
    {
	String tableName = getTableName(cal.getTime());
	StringBuilder sql = new StringBuilder();
        sql.append("create table if not exists ");
        sql.append(tableName);
        sql.append("( ");
        sql.append("syslog_month string,  ");
        if(cal.get(Calendar.DAY_OF_MONTH)<10){
            sql.append("syslog_day_pre string,  ");
        }
        sql.append("syslog_day string, ");
        sql.append("syslog_time string, ");
        sql.append("ip string, ");
        sql.append("source string, ");
        sql.append("message array<string>, ");
        sql.append("information1 string, ");
        sql.append("information2 string, ");
        sql.append("information3 string,  ");
        sql.append("information4 string,  ");
        sql.append("information5 string)  ");
        sql.append("row format delimited fields terminated by ' '  ");
        sql.append("collection items terminated by ','  ");
        sql.append("map keys terminated by  ':' ");
        
        LOG.info("[创建HIVE表的DLL]"+sql.toString());
        
        HiveUtil.createTable(sql.toString());
        return tableName;
    }

4、加载日志到hive中

    /**
     * 加载本地文件到hive邮件日志表
     * @param path
     * @param tableName
     * @throws SQLException
     */
    public void loadData(String path,String tableName) throws SQLException
    {
	StringBuilder sql = new StringBuilder();
	sql.append("load data local inpath ");
	sql.append("'");
	sql.append(path);
	sql.append("'");
//	sql.append(" overwrite into table ");
        sql.append(" into table ");
        sql.append(tableName);
        
        LOG.info("[加载数据到HIVE表的DLL]"+sql.toString());
        
        HiveUtil.loadData(sql.toString());
    }

5、然后就可以根据自己的需求进行各类简单的查询了：

（1）查询所有数据

    /**
     * 查询所有数据
     */
    public ResultSet queryData(String tableName) throws SQLException
    {
	StringBuilder sql = new StringBuilder();
	sql.append("select syslog_month,syslog_day,syslog_time,ip,source,message,");
	sql.append("information1,information2,information3,information4,information5 ");
	sql.append("from ");
	sql.append(tableName);
	
	LOG.info("[查询所有HIVE数据的DLL]"+sql.toString());
	
	ResultSet res = HiveUtil.queryData(sql.toString());
	return res;
    }

（2）查询UserId和CategoryId分类的总延时

    /**
     * 查询UserId和CategoryId分类的总延时
     */
    public List<Map<String,Object>> queryHiveDataForUserIdAndCategoryIdDelay(String tableName) throws SQLException
    {
	StringBuilder sql = new StringBuilder();
	sql.append("select t2.message[4],t2.message[5],count(*),sum(t2.message[1]-t1.message[2]),sum(t2.message[1]-t1.message[2])/count(*) ");
	sql.append("from ");
	sql.append("(select * from "+tableName+" where message[0]='&QUEUE' ) t1 ");
	sql.append("FULL OUTER JOIN ");
	sql.append("(select * from "+tableName+" where message[0]='&OUT' OR message[0]='&WORKERERROR' OR message[0]='&ERROR' ) t2 ");
	sql.append("ON concat(t1.message[1],'0$',substring(t1.message[6],2,length(t1.message[6])-3))==t2.message[3] ");
	sql.append("where t2.message[1]>=t1.message[2] ");
        sql.append("group by t2.message[4],t2.message[5]");

        LOG.info("[查询UserId和CategoryId分类的总延时的HIVE数据的DLL]"+sql.toString());
        
        HiveQueryResultSet res = (HiveQueryResultSet) HiveUtil.queryData(sql.toString());
	List<Map<String,Object>> list = new ArrayList<Map<String,Object>>();
	while(res.next())
	{
	    int userId = res.getInt(1);
	    int categoryId = res.getInt(2);
	    int num = res.getInt(3);
	    double delay = res.getDouble(4);
	    double avgDelay = res.getDouble(5);
	    
	    Map<String,Object> map = new HashMap<String,Object>();
	    map.put("userId", userId);
	    map.put("categoryId", categoryId);
	    map.put("num", num);
	    map.put("delay", delay);
	    map.put("avgDelay", avgDelay);
	    list.add(map);
	}
        return list;
    }

使用hadoop和hive来进行应用的日志数据分析

猜你喜欢