CarbonData cross-database query exception BUG analysis and solution

0x0 background

A colleague recently discovered that an exception will be reported when CarbonData cross-database multi-table join query:

The JAVA code is as follows:

carbon.sql("select * from event.student as s left join test.user as u on u.name=s.name").show();

The exception is as follows:

java.io.FileNotFoundException: File does not exist: /opt\test\user
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2404)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4575)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1087)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:563)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:873)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)

It can be seen that the main reason is that there is a problem with the delimiter of the path, which causes the job to not find the table on HDFS, thus reporting an error.

File does not exist: /opt\test\user

0x1 Analysis

View exception stack

at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.getContentSummary(DFSClient.java:2778)
    at org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:656)
    at org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:652)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getContentSummary(DistributedFileSystem.java:652)
    at org.apache.carbondata.core.datastore.impl.FileFactory.getDirectorySize(FileFactory.java:534)
    at org.apache.spark.sql.hive.CarbonRelation.sizeInBytes(CarbonRelation.scala:207)
    at org.apache.spark.sql.CarbonDatasourceHadoopRelation.sizeInBytes(CarbonDatasourceHadoopRelation.scala:90)
    at org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$statistics$2.apply(LogicalRelation.scala:77)
	at org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$statistics$2.apply(LogicalRelation.scala:77)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.LogicalRelation.statistics$lzycompute(LogicalRelation.scala:76)
    at org.apache.spark.sql.execution.datasources.LogicalRelation.statistics(LogicalRelation.scala:75)
    at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:319)
    at org.apache.spark.sql.execution.SparkStrategies$JoinSelection$.canBroadcast(SparkStrategies.scala:117)
    at org.apache.spark.sql.execution.SparkStrategies$JoinSelection$.apply(SparkStrategies.scala:159)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)

This sentence can be found:

    at org.apache.carbondata.core.datastore.impl.FileFactory.getDirectorySize(FileFactory.java:534)
    at org.apache.spark.sql.hive.CarbonRelation.sizeInBytes(CarbonRelation.scala:207)

It can be seen here that Carbondata gets the size of the path. Go in org.apache.carbondata.core.datastore.impl.FileFactory.getDirectorySize and see the code:

public static long getDirectorySize(String filePath) throws IOException {
        FileFactory.FileType fileType = getFileType(filePath);
        switch(fileType) {
        case LOCAL:
        default:
            filePath = getUpdatedFilePath(filePath, fileType);
            File file = new File(filePath);
            return FileUtils.sizeOfDirectory(file);
        case HDFS:
        case ALLUXIO:
        case VIEWFS:
            Path path = new Path(filePath);
            FileSystem fs = path.getFileSystem(configuration);
            return fs.getContentSummary(path).getLength();
        }
    }

The filePath is passed in here, which is obviously the one above: /opt\test\user
So, where did it come from? Continue into org.apache.spark.sql.hive.CarbonRelation.sizeInBytes

 def sizeInBytes: Long = {
    val tableStatusNewLastUpdatedTime = SegmentStatusManager.getTableStatusLastModifiedTime(
      tableMeta.carbonTable.getAbsoluteTableIdentifier)

    if (tableStatusLastUpdateTime != tableStatusNewLastUpdatedTime) {
      val tablePath = CarbonStorePath.getCarbonTablePath(
        tableMeta.storePath,
        tableMeta.carbonTableIdentifier).getPath
      val fileType = FileFactory.getFileType(tablePath)
      if(FileFactory.isFileExist(tablePath, fileType)) {
        tableStatusLastUpdateTime = tableStatusNewLastUpdatedTime
        sizeInBytesLocalValue = FileFactory.getDirectorySize(tablePath)
      }
    }
    sizeInBytesLocalValue
  }

As you can see, the parameter tablePath is passed in here, and the parameter comes from the following sentence:

val tablePath = CarbonStorePath.getCarbonTablePath(
        tableMeta.storePath,
        tableMeta.carbonTableIdentifier).getPath

Continue to follow up to see:

public static CarbonTablePath getCarbonTablePath(String storePath, CarbonTableIdentifier tableIdentifier) {
        return new CarbonTablePath(tableIdentifier, storePath + File.separator + tableIdentifier.getDatabaseName() + File.separator + tableIdentifier.getTableName());
    }

Sure enough there is a problem here!
storePath + File.separator + tableIdentifier.getDatabaseName() + File.separator + tableIdentifier.getTableName()
The separator here is File.separator , which defaults to \ under Windows and / under Linux.
This leads to the fact that when the code runs under Windows, the parsed path of the SQL statement will be parsed into the delimiter under Windows. When the task is submitted to Spark, the path is found to be unrecognizable under Linux, resulting in an error!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941385&siteId=291194637