Carbondata concurrent modification table problem

Recently, in the use of carbondata, the business logic needs to implement simultaneous writing to a table by multiple threads, that is, concurrent writing.
After reading the official documentation, there is very little description about the concurrent operation table:
carbon.lock.type

This configuration specifies the type of lock to be acquired during concurrent operations on table. There are the following types of lock implementation: - LOCALLOCK: Lock is created on local file system as file. This lock is useful when only one spark driver (thrift server) runs on a machine and no other CarbonData spark application is launched concurrently. - HDFSLOCK: Lock is created on HDFS file system as file. This lock is useful when multiple CarbonData spark applications are launched and no ZooKeeper is running on cluster and HDFS supports file based locking.
Roughly translated:
This property specifies the type of lock that needs to be acquired when operating the table concurrently, including two: LOCALLOCK and HDFSLOCK
where LOCALLOCK is established on the local file system, when only one spark driver is running on a machine and no When other spark applications are running at the same time, this lock can be used (personal understanding is that a single machine submits concurrent tasks);
while HDFSLOCK is built on the HDFS file system, when multiple spark applications are running and there is no ZooKeeper management can use this lock ( That is, multiple clients submit concurrent tasks)

Here is the java code to build a SparkSession:

 //初始化,获取配置文件
    private static PropertiesConfiguration cfg = null;
    private static String sparkIP;
    private static String hdfsIP;
    private static String hadoopUserName;
    private static String dataPath;
    private static String maxCores;

    static {
        try {
            cfg = new PropertiesConfiguration("cloud.properties");
            sparkIP = cfg.getString("spark.ip");
            hdfsIP = cfg.getString("hdfs.ip");
            hadoopUserName = cfg.getString("hadoop.user.name");
            dataPath = cfg.getString("data.path");
            maxCores = cfg.getString("spark.cores.max");
        } catch (ConfigurationException e) {
            e.printStackTrace();
        }
        // 当文件的内容发生改变时,配置对象也会刷新
        cfg.setReloadingStrategy(new FileChangedReloadingStrategy());
    }

    /**
     * 获取sparkSession,用于操作数据库
     * @return
     */
    public static SparkSession getSparkSession() {
        // 1.需要设置HADOOP用户名与远程文件用户名一致
        System.setProperty("HADOOP_USER_NAME", hadoopUserName);
        // 2.创建SparkConf对象,设置相关配置信息
        SparkConf conf = new SparkConf()
                .setAppName("carbon")
                .setMaster("spark://" + sparkIP + ":7077")
                .set("spark.cores.max", maxCores);

        // 3.设置锁类型
        CarbonProperties.getInstance().addProperty(CarbonCommonConstants.LOCK_TYPE, "HDFSLOCK");
        // 4.构建sparkSession
        SparkSession sparkSession = CarbonSession
                .CarbonBuilder(
                        SparkSession
                                .builder()
                                .config(conf)
                                .config("hive.metastore.uris","thrift://" + sparkIP + ":9083")
                )
                .getOrCreateCarbonSession("hdfs://" + hdfsIP + ":8020" + dataPath);
        return sparkSession;
    }
    /**
     * 加载csv文件到数据库
     * @param csv 文件路径
     * @param tableName 表名称
     * @param delimiter 分隔符
     * @return
     */
    public static boolean loadDataToTable(String csv, String tableName, String delimiter) {
        SparkSession sparkSession = getSparkSession();
        if (StringUtils.isBlank(delimiter)) {
            delimiter = ",";
        }
        String options = " OPTIONS('DELIMITER'='" + delimiter + "')";
        sparkSession.sql("LOAD DATA INPATH 'hdfs://" + hdfsIP + ":8020" + csv + "' INTO TABLE " + tableName + options);
//        sparkSession.close();
        return true;
    }

Worth noting: the resulting SparkSession is a singleton! ! ! So you can't close the session easily

Test 5 threads concurrently inserting csv data into carbondata table:

@Test
public void testConcurrentWriteToHDFS() throws InterruptedException {

    CountDownLatch latch = new CountDownLatch(5);
    for (int i = 0; i < 5; i++) {
        new Thread(() -> {
            CloudUtils.loadDataToTable("/opt/event_log_01.csv", "event_log", ",");
            String threadName = Thread.currentThread().getName();
            System.out.println("==============" + eventLogDao.countByWhereClause(""));
            System.out.println(threadName + " is finished");
            latch.countDown();
        }).start();
    }
    latch.await();
    System.out.println("all threads are finished");
    List<EventLog> eventLogs = eventLogDao.selectAll();
    System.out.println(eventLogs);
}

Test Results:

18/01/27 17:06:05 INFO LoadTable: Thread-22 Initiating Direct Load for the Table : (default.event_log)
18/01/27 17:06:05 INFO LoadTable: Thread-23 Initiating Direct Load for the Table : (default.event_log)
18/01/27 17:06:05 INFO LoadTable: Thread-20 Initiating Direct Load for the Table : (default.event_log)
18/01/27 17:06:05 INFO LoadTable: Thread-21 Initiating Direct Load for the Table : (default.event_log)
18/01/27 17:06:05 INFO LoadTable: Thread-19 Initiating Direct Load for the Table : (default.event_log)
18/01/27 17:06:05 INFO CarbonLockFactory: Thread-22 Configured lock type is: HDFSLOCK
18/01/27 17:06:05 INFO HdfsFileLock: Thread-22 HDFS lock path:hdfs://192.168.0.181:8020/opt/default/event_log/tablestatus.lock
18/01/27 17:06:05 INFO HdfsFileLock: Thread-21 HDFS lock path:hdfs://192.168.0.181:8020/opt/default/event_log/tablestatus.lock
18/01/27 17:06:05 INFO HdfsFileLock: Thread-23 HDFS lock path:hdfs://192.168.0.181:8020/opt/default/event_log/tablestatus.lock
18/01/27 17:06:05 INFO CarbonLoaderUtil: Thread-23 Acquired lock for tabledefault.event_log for table status updation
18/01/27 17:06:05 INFO HdfsFileLock: Thread-20 HDFS lock path:hdfs://192.168.0.181:8020/opt/default/event_log/tablestatus.lock
18/01/27 17:06:05 ERROR HdfsFileLock: Thread-20 failed to create file /opt/default/event_log/tablestatus.lock for DFSClient_NONMAPREDUCE_-720553362_71 for client 172.16.30.20 because current leaseholder is trying to recreate file.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3175)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:3005)
...
18/01/27 17:06:05 INFO HdfsFileLock: Thread-19 HDFS lock path:hdfs://192.168.0.181:8020/opt/default/event_log/tablestatus.lock
18/01/27 17:06:05 ERROR HdfsFileLock: Thread-19 failed to create file /opt/default/event_log/tablestatus.lock for DFSClient_NONMAPREDUCE_-720553362_71 for client 172.16.30.20 because current leaseholder is trying to recreate file.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3175)

It can be seen that each thread will try to acquire the lock during operation. If the table has been locked, it cannot be acquired, and then an error is reported,
but in the end all threads are executed! All five threads write data to the table!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325940705&siteId=291194637