Hive directly read and MySQL data Hbase

0.5 Overview

Foreign Hive provides StorageHandler interface provides the ability to access a variety of data storage components. Hbase provided HbaseStorageHandler, such hbase hive can access data in the mapping table by creating external. However, the version the company CDH cluster is relatively low, the new version does not support JdbcStorageHandler native hive. To access the data thus JDBC data source, it can only be achieved by adding third-party libraries.

1.Hive access Hbase

use ods_sdb;
create external table if not exists ods_sdb.$v_table(
   ajbs string comment '标识',
   hytcyqdrq string comment '合议庭成员确定日期',
   splcbgkyy string comment '审判流程不公开原因',
   ajgyxx_stm string comment '实体码',
   bygksplc string comment '不宜公开审判流程',
   jbfy string comment '经办法院',
   labmbs string comment '立案部门标识',
   splcygk string comment '审判流程已公开',
   ajgyxx_ajbs string comment '案件标识',
   ajmc string comment '案件名称',
   stm string comment '实体码',
   cbbmbs string comment '承办部门标识'
) comment '概要信息'
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties (
    'hbase.columns.mapping' = ':key,f:anjiangaiyaoxinxi.heyitingchengyuanquedingriqi,f:anjiangaiyaoxinxi.shenpanliuchengbugongkaiyuanyin,f:anjiangaiyaoxinxi.shitima,f:anjiangaiyaoxinxi.buyigongkaishenpanliucheng,f:anjiangaiyaoxinxi.jingbanfayuan,f:anjiangaiyaoxinxi.lianbumenbiaozhi,f:anjiangaiyaoxinxi.shenpanliuchengyigongkai,f:anjiangaiyaoxinxi.anjianbiaozhi,f:anjiangaiyaoxinxi.anjianmingcheng,f:shitima,f:anjiangaiyaoxinxi.chengbanbumenbiaozhi'
) tblproperties ( 'hbase.table.name' = 'aj_15_baseinfo')

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler': hbase underlying data in the memory, then to specify such processing.
hbase correspondence mapping for the specified field in the hive with serdeproperties external table, in which :: key is RowKey hbase the other fields defined in accordance with the order of the external table, the column group sequentially: the order of the field names, separated by commas .

tblproperties specified in the corresponding table name hbase.
Note: Do not perform complex conditional queries on the table, where only the best rowkey corresponding field will be evaluated.

Comments: The export operation only the table data, that is data derived from the hive hbase entity table.

2.Hive access MySQL

As previously described, hive began to support from the HIVE-1555 comes with JdbcStorageHandler, in a low version of the hive you want to directly access data in the jdbc, can only be achieved through a third party JdbcStorageHandler.

Third Party Source: https://github.com/qubole/Hive-JDBC-Storage-Handler
use:

git clone https://github.com/qubole/Hive-JDBC-storage-Handler.git
mvn clean install -Phadoop-1
add jar attention to the need to join mysql jdbc driver

add jar /home/csc/20190729/qubole-hive-JDBC.jar;
add jar /home/csc/20190729/udf-1.0.jar;
use ods_sdb;
create external table if not exists ods_sdb.$v_table(
    id string comment 'id',
    fdm string comment '案件标识',
    cBh string comment '当事人主键',
    cCxm string comment '案件查询码',
    nBgrpxh string comment '被告人排序号',
    nFzje string comment '犯罪金额',
    nSf string comment '特殊身份',
    cSf string comment '特殊身份中文',
    nZy string comment '职业',
    cZy string comment '职业中文',
    create_time string comment '创建时间'
) comment '当事人情况'
stored by 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
tblproperties (
  'mapred.jdbc.driver.class'='com.mysql.jdbc.Driver',
  'mapred.jdbc.url'='jdbc:mysql://ip:port/fb_data?characterEncoding=utf8',
  'mapred.jdbc.username'='username',
  'mapred.jdbc.input.table.name'='fb_15_dsr',
  'mapred.jdbc.password'='password',
  'mapred.jdbc.hive.lazy.split'= 'false'
);

Tblproperties in the configuration may be described with reference to the git.

Positioning solve the problem:

(Related to the specific machine resources, environment, non-repeating a recurring problem, modify the program, it is only temporary resolved)
when in actual use, the data processing parties, recurring situation jdbc link timeout, resulting in data export tasks fail.

Locate the problem as follows:

1. Non-standard execution time-consuming task for the first time, after the successful implementation, re-run relatively fast;

2.show processlist; found in the implementation of non-standard data export tasks, will give priority to the implementation of a COUNT ( ) in sql, time-consuming. Analysis of non-standard data stored in mysql, the first order of magnitude has reached 10 million, and secondly, the use of the storage engine is InnoDB, to count ( execution) of the need to traverse the whole table;

3. When exporting data to perform real tasks, divided into two mapper perform select xxx operations, less time consuming.

Run logs and data export tasks log mysql combination of basic orientation to the session timeout for the count (*) caused.

Problem-solving process is as follows:

1. First reading JdbcStorageHandler source, location count (*) Source:

/*
 * Copyright 2013-2015 Qubole
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 
package org.apache.hadoop.hive.wrapper;
 
import java.io.IOException;
 
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
 
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.TaskAttemptID;
import org.apache.hadoop.hive.shims.ShimLoader;
 
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
 
import java.sql.*;
import org.apache.hadoop.hive.jdbc.storagehandler.Constants;
import org.apache.hadoop.hive.jdbc.storagehandler.JdbcDBInputSplit;
public class RecordReaderWrapper<K, V> implements RecordReader<K, V> {
 
    private static final Log LOG = LogFactory.getLog(RecordReaderWrapper.class);
 
    private org.apache.hadoop.mapreduce.RecordReader<K, V> realReader;
    private long splitLen; // for getPos()
 
    // expect readReader return same Key & Value objects (common case)
    // this avoids extra serialization & deserialazion of these objects
    private K keyObj = null;
    protected V valueObj = null;
 
    private boolean firstRecord = false;
    private boolean eof = false;
    private Connection conn = null;
    private String tblname = null;
    private DBConfiguration delegate = null;
    private long taskIdMapper = 0;
    private boolean lazySplitActive = false;
    private long count = 0;
    private int chunks = 0;
    public RecordReaderWrapper(InputFormat<K, V> newInputFormat,
            InputSplit oldSplit, JobConf oldJobConf, Reporter reporter)
            throws IOException {
     
        TaskAttemptID taskAttemptID = TaskAttemptID.forName(oldJobConf
                .get("mapred.task.id"));
 
        if (taskAttemptID !=null) {
            LOG.info("Task attempt id is >> " + taskAttemptID.toString());
        }
 
        if(oldJobConf.get(Constants.LAZY_SPLIT) != null &&
                (oldJobConf.get(Constants.LAZY_SPLIT)).toUpperCase().equals("TRUE")){
            lazySplitActive = true;
            ResultSet results = null; 
            Statement statement = null;
            delegate = new DBConfiguration(oldJobConf);
            try{   
                conn = delegate.getConnection();
            
                statement = conn.createStatement();
                results = statement.executeQuery("Select Count(*) from " + oldJobConf.get("mapred.jdbc.input.table.name"));
                results.next();
 
                count = results.getLong(1);
                chunks = oldJobConf.getInt("mapred.map.tasks", 1);
                LOG.info("Total numer of records: " + count + ". Total number of mappers: " + chunks );
                splitLen = count/chunks;
                if((count%chunks) != 0)
                    splitLen++;
                LOG.info("Split Length is "+ splitLen);
                results.close();
                statement.close();
                 
            }
            catch(Exception e){
                // ignore Exception
            }
        }
        org.apache.hadoop.mapreduce.InputSplit split;
         
        if(lazySplitActive){
             
            ((JdbcDBInputSplit)(((InputSplitWrapper)oldSplit).realSplit)).setStart(splitLen);
            ((JdbcDBInputSplit)(((InputSplitWrapper)oldSplit).realSplit)).setEnd(splitLen);
        }
 
        if (oldSplit.getClass() == FileSplit.class) {
            split = new org.apache.hadoop.mapreduce.lib.input.FileSplit(
                    ((FileSplit) oldSplit).getPath(),
                    ((FileSplit) oldSplit).getStart(),
                    ((FileSplit) oldSplit).getLength(), oldSplit.getLocations());
        } else {
            split = ((InputSplitWrapper) oldSplit).realSplit;
        }
 
 
        // create a MapContext to pass reporter to record reader (for counters)
        TaskAttemptContext taskContext = ShimLoader.getHadoopShims()
                .newTaskAttemptContext(oldJobConf,
                        new ReporterWrapper(reporter));
 
        try {
            realReader = newInputFormat.createRecordReader(split, taskContext);
            realReader.initialize(split, taskContext);
 
            // read once to gain access to key and value objects
            if (realReader.nextKeyValue()) {
                firstRecord = true;
                keyObj = realReader.getCurrentKey();
                valueObj = realReader.getCurrentValue();
            } else {
                eof = true;
            }
        } catch (InterruptedException e) {
            throw new IOException(e);
        }
    }
}

results = statement.executeQuery ( "Select Count ( *) from" + oldJobConf.get ( "mapred.jdbc.input.table.name"));
see data before performing an export task will first get the head office of the table number used to divide tasks. However, where the trigger condition is

'mapred.jdbc.hive.lazy.split' = 'true'
However, this operating configuration to false, the default action will still count (*) are.

Adding threshold definition of a custom table on the order of:

                count = oldJobConf.getInt("mapred.jdbc.input.table.count", 20000000);

//                results = statement.executeQuery("Select Count("+ (key==null?"*":key) + ") from " + oldJobConf.get("mapred.jdbc.input.table.name"));
//                results.next();

//                count = results.getLong(1);

3. External modify the table definition:

添加 'mapred.jdbc.input.table.count'='3000000'

4. repackage, and upload;

5. Problems get!