Several solutions to record oracle writeback

Since the version of spark 1.5.1 is used, there are many unexpected bugs, which are recorded for your reference.

First of all, let's talk about our requirements, which is to write the hive table back to oracle, which must be in the form of sparksql, so sqoop is not considered, and the big data platform of the cluster does not have the sqoop component. It must be accurately output according to a certain data format, what type is used when running from oracle, and finally what type is returned to oracle, and the precision is consistent.
Since the date is also stored as a string in the big data platform hive, and the string of hive does not specify the length, this is the difficulty.

1. The first option:

Considering that it is not allowed to access the metadata meta information of hive, use sqlContext.sql to read the schema of the target table, convert it to rdd, use the system table to read the oracle to obtain the final converted data type and length, reorganize the schema, and convert it to rdd. It reconstructs the dataframe with rdd and
uses a spark.jdbc class write.jdbc method
option("createTableColumnTypes", "name varchar(200)")
plus this attribute to solve the final table creation problem. This property of this method, after testing, cannot be used in spark 1.5.1 version, and should be used in 2.2.0 version.
code show as below:

package test1

import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import oracle.jdbc.driver.OracleDriver
import sun.security.util.Length
import org.apache.spark.sql.types.StringType
import java.util.ArrayList
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes
import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import org.apache.spark.sql.jdbc._
import java.sql.Types

object ojdbcTest {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("firstTry").setMaster("local");
    val sc = new SparkContext(conf);
    val sqlContext = new HiveContext(sc);

    //控制schame优化
    var df = sqlContext.sql("select * from  ****.BL_E01_REJECTACCOUNT")
    val df1 = df.schema.toArray

    val theJdbcDF = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:oracle:thin:***/*****@//*****/*****",
      "dbtable" -> "( select column_name ,data_type,data_length,data_precision,data_scale from user_tab_cols where table_name ='BL_E01_REJECTACCOUNT' order by COLUMN_ID ) a ",
      "driver" -> "oracle.jdbc.driver.OracleDriver",
      "numPartitions" -> "5",
      "lowerBound" -> "0",
      "upperBound" -> "80000000"))

    val str = theJdbcDF.collect().toArray
    var dateArray = new ArrayBuffer[String]
    var stringArray = new ArrayBuffer[(String, Int)]

    var list = new ArrayList[org.apache.spark.sql.types.StructField]();

    var string = new ArrayList[String]

    for (j <- 0 until str.length) {
      var st = str(j)
      var column_name = st.get(0)
      var data_type = st.get(1)
      var data_length = st.get(2)
      var data_precision = st.get(3)
      var data_scale = st.get(4)
      println(column_name + ":" + data_type + ":" + data_length + ":" + data_precision + data_scale)

      if (data_type.equals("DATE")) {
        dateArray += (column_name.toString())
        string.add(column_name.toString() + " " + data_type.toString())
      }

      if (data_type.equals("NUMBER")) {
        if (data_precision != null) {
          string.add(column_name.toString() + " " + data_type.toString() + s"(${data_precision.toString().toDouble.intValue()},${data_scale.toString().toDouble.intValue()})")
        } else {
          string.add(column_name.toString() + " " + data_type.toString())
        }

      }
      if (data_type.equals("VARCHAR2")) {
        stringArray += ((column_name.toString(), data_length.toString().toDouble.intValue()))
        string.add(column_name.toString() + " " + data_type.toString() + s"(${data_length.toString().toDouble.intValue()})")
      }

    }
    for (i <- 0 until df1.length) {
      var b = df1(i)
      var dataName = b.name
      var dataType = b.dataType
      //          println("字段名"+dataName+"字段类型"+dataType)
      if (dateArray.exists(p => p.equalsIgnoreCase(s"${dataName}"))) {
        dataType = DateType

      }
      var structType = DataTypes.createStructField(dataName, dataType, true)

      list.add(structType)
    }

    val schema = DataTypes.createStructType(list)

    if (dateArray.length > 0) {

      for (m <- 0 until dateArray.length) {
        var mm = dateArray(m).toString()
        println("mm:" + mm)
        var df5 = df.withColumn(s"$mm", df(s"$mm").cast(DateType))
        df = df5
      }
    }

    val rdd = df.toJavaRDD
    val df2 = sqlContext.createDataFrame(rdd, schema);

    df2.printSchema()

    val url = "jdbc:oracle:thin:@//*******/***"
    val table = "test2"
    val user = "***"
    val password = "***"

    val url1="jdbc:oracle:thin:***/***@//***/***"
    val connectionProperties = new Properties()
    connectionProperties.put("user", user)
    connectionProperties.put("password", password)
    connectionProperties.put("driver", "oracle.jdbc.driver.OracleDriver")

    val a = string.toString()
    val option = a.substring(1, a.length() - 1)
    println(option)

    df2.option("createTableColumnTypes",s"${option}").write.jdbc(url, table, connectionProperties)

    sc.stop()
  }
} 

The code is written casually, just a test class.

2. The second option:

Taking into account the previous situations, the above methods are not applicable to 1.5.1 and a new method is adopted.
Use the three methods in the overridden JdbcDialect class to read and write. This is the method to obtain the jdbc database type in sql, and you can rewrite it. Implement simple conversions.

package test1

import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import oracle.jdbc.driver.OracleDriver
import sun.security.util.Length
import org.apache.spark.sql.types.StringType
import java.util.ArrayList
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes
import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import org.apache.spark.sql.jdbc._
import java.sql.Types

object ojdbcTest {



    def oracleInit(){

      val dialect:JdbcDialect= new JdbcDialect() {
        override def canHandle(url:String)={
          url.startsWith("jdbc:oracle");
        }
        //读oracle的类型转换方法
        override def getCatalystType(sqlType, typeName, size, md):Option[DataType]={


      }
      //写oracle的类型转换方法
        override def getJDBCType(dt:DataType):Option[org.apache.spark.sql.jdbc.JdbcType]=

         dt match{
            case BooleanType => Some(JdbcType("NUMBER(1)", java.sql.Types.BOOLEAN))
            case IntegerType => Some(JdbcType("NUMBER(10)", java.sql.Types.INTEGER))
            case LongType    => Some(JdbcType("NUMBER(19)", java.sql.Types.BIGINT))
            case FloatType   => Some(JdbcType("NUMBER(19, 4)", java.sql.Types.FLOAT))
            case DoubleType  => Some(JdbcType("NUMBER(19, 4)", java.sql.Types.DOUBLE))
            case ByteType    => Some(JdbcType("NUMBER(3)", java.sql.Types.SMALLINT))
            case ShortType   => Some(JdbcType("NUMBER(5)", java.sql.Types.SMALLINT))
           case StringType  => Some(JdbcType("VARCHAR2(250)", java.sql.Types.VARCHAR))
            case DateType    => Some(JdbcType("DATE", java.sql.Types.DATE))
            case DecimalType.Unlimited => Some(JdbcType("NUMBER",java.sql.Types.NUMERIC))
            case _ => None
          }

      }
      JdbcDialects.registerDialect(dialect);
    }


  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("firstTry").setMaster("local");
    val sc = new SparkContext(conf);
    val sqlContext = new HiveContext(sc);

    //控制schame优化
    var df = sqlContext.sql("select * from  ****.BL_E01_REJECTACCOUNT")
    val df1 = df.schema.toArray

    val theJdbcDF = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:oracle:thin:****/****@//********/claimamdb",
      "dbtable" -> "( select column_name ,data_type,data_length,data_precision,data_scale from user_tab_cols where table_name ='BL_E01_REJECTACCOUNT' order by COLUMN_ID ) a ",
      "driver" -> "oracle.jdbc.driver.OracleDriver",
      "numPartitions" -> "5",
      "lowerBound" -> "0",
      "upperBound" -> "80000000"))

    val str = theJdbcDF.collect().toArray
    var dateArray = new ArrayBuffer[String]
    var stringArray = new ArrayBuffer[(String, Int)]

    var list = new ArrayList[org.apache.spark.sql.types.StructField]();



    for (j <- 0 until str.length) {
      var st = str(j)
      var column_name = st.get(0)
      var data_type = st.get(1)
      var data_length = st.get(2)
      var data_precision = st.get(3)
      var data_scale = st.get(4)
      println(column_name + ":" + data_type + ":" + data_length + ":" + data_precision + data_scale)

      if (data_type.equals("DATE")) {
        dateArray += (column_name.toString())

      }


      if (data_type.equals("VARCHAR2")) {
        stringArray += ((column_name.toString(), data_length.toString().toDouble.intValue()))

      }

    }
    for (i <- 0 until df1.length) {
      var b = df1(i)
      var dataName = b.name
      var dataType = b.dataType
      //          println("字段名"+dataName+"字段类型"+dataType)
      if (dateArray.exists(p => p.equalsIgnoreCase(s"${dataName}"))) {
        dataType = DateType

      }
      var structType = DataTypes.createStructField(dataName, dataType, true)

      list.add(structType)
    }

    val schema = DataTypes.createStructType(list)

    if (dateArray.length > 0) {

      for (m <- 0 until dateArray.length) {
        var mm = dateArray(m).toString()
        println("mm:" + mm)
        var df5 = df.withColumn(s"$mm", df(s"$mm").cast(DateType))
        df = df5
      }
    }

    val rdd = df.toJavaRDD
    val df2 = sqlContext.createDataFrame(rdd, schema);

    df2.printSchema()

    val url = "jdbc:oracle:thin:@//********/claimamdb"
    val table = "test2"
    val user = "****"
    val password = "****"

    val url1="jdbc:oracle:thin:****/****@//********/claimamdb"
    val connectionProperties = new Properties()
    connectionProperties.put("user", user)
    connectionProperties.put("password", password)
    connectionProperties.put("driver", "oracle.jdbc.driver.OracleDriver")




    oracleInit()
    df2.write.jdbc(url, table, connectionProperties)

    sc.stop()



  }
}

This method can only solve simple type conversion, and can not satisfy me that the original date in hive has been converted to string and then converted back to oracle date, because even the rewrite method cannot pass in parameters, and it is impossible to judge which string is date Type, you can inherit the logging class and re-jdbcUtils, it is a bit complicated to read the source code.

3. The third option

The code is the same as the first one.
Change the method to because the data type of the table cannot be made to be accurate, the string will default to 255 every time it is written to oracle if there is no length. For this problem, I changed it to use createjdbctable and insertIntoJDBC(url1, table, true) , It turns out that this version of insertintojdbc has bugs, and the official documentation prompts

Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs. 

The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

As a result, it will report that the error table already exists. After searching on foreign websites, it is found that this is a bug.
The query results are as follows
write picture description here

write picture description here

write picture description here

Well, after looking at so many things, how can we accurately get our data in without using the above methods.

4. The fourth option

The maximum varchar2 length of the Oracle database I have read is 4000. I think about it, use the getjdbcType method of the rewritten dialect to convert all string data to 4000 to ensure that the data will not be truncated, and then use Oracle's jdbc class to target our target Take the table building string of the table to build the table, and then use the dataframe to write an oracle temporary table, where varchar2 is 4000, and then use select to import the table data into the target table.

After I use the fields of the system table to judge the intermediate date type, I convert it to timestamp type, and convert it to the underlying oracle date class in the rewritten getjdbcType, so that the problem of date truncated will not occur.

代码如下:

package test1

import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import oracle.jdbc.driver.OracleDriver
import sun.security.util.Length
import org.apache.spark.sql.types.StringType
import java.util.ArrayList
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataTypes
import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import org.apache.spark.sql.jdbc._
import java.sql.Types

import java.sql.Connection
import java.sql.DriverManager
object ojdbcTest {



      def oracleInit(){

        val dialect:JdbcDialect= new JdbcDialect() {
          override def canHandle(url:String)={
            url.startsWith("jdbc:oracle");
          }

//       override def getCatalystType(sqlType, typeName, size, md):Option[DataType]={
    //
    //
    //      }
          override def getJDBCType(dt:DataType):Option[org.apache.spark.sql.jdbc.JdbcType]=

            dt match{
              case BooleanType => Some(JdbcType("NUMBER(1)", java.sql.Types.BOOLEAN))
              case IntegerType => Some(JdbcType("NUMBER(10)", java.sql.Types.INTEGER))
              case LongType    => Some(JdbcType("NUMBER(19)", java.sql.Types.BIGINT))
              case FloatType   => Some(JdbcType("NUMBER(19, 4)", java.sql.Types.FLOAT))
              case DoubleType  => Some(JdbcType("NUMBER(19, 4)", java.sql.Types.DOUBLE))
              case ByteType    => Some(JdbcType("NUMBER(3)", java.sql.Types.SMALLINT))
              case ShortType   => Some(JdbcType("NUMBER(5)", java.sql.Types.SMALLINT))
              case StringType  => Some(JdbcType("VARCHAR2(4000)", java.sql.Types.VARCHAR))
              case DateType    => Some(JdbcType("DATE", java.sql.Types.DATE))
              case DecimalType.Unlimited => Some(JdbcType("NUMBER",java.sql.Types.NUMERIC))
              case TimestampType=> Some(JdbcType("DATE",java.sql.Types.DATE))
              case _ => None
            }

        }
         JdbcDialects.registerDialect(dialect);
      }

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("firstTry").setMaster("local");
    val sc = new SparkContext(conf);
    val sqlContext = new HiveContext(sc);

    //控制schame优化
    var df = sqlContext.sql("select * from  ******.BL_E01_REJECTACCOUNT")
    val df1 = df.schema.toArray

    //val customSchema = sparkTargetDF.dtypes.map(x => x._1+" "+x._2).mkString(",").toUpperCase()
    val theJdbcDF = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:oracle:thin:********/********//********/********",
      "dbtable" -> "( select column_name ,data_type,data_length,data_precision,data_scale from user_tab_cols where table_name ='BL_E01_REJECTACCOUNT' order by COLUMN_ID ) a ",
      "driver" -> "oracle.jdbc.driver.OracleDriver",
      "numPartitions" -> "5",
      "lowerBound" -> "0",
      "upperBound" -> "80000000"))

    val str = theJdbcDF.collect().toArray
    var dateArray = new ArrayBuffer[String]
    var stringArray = new ArrayBuffer[(String, Int)]

    var list = new ArrayList[org.apache.spark.sql.types.StructField]();

    var string = new ArrayList[String]

    for (j <- 0 until str.length) {
      var st = str(j)
      var column_name = st.get(0)
      var data_type = st.get(1)
      var data_length = st.get(2)
      var data_precision = st.get(3)
      var data_scale = st.get(4)
      println(column_name + ":" + data_type + ":" + data_length + ":" + data_precision + data_scale)

      if (data_type.equals("DATE")) {
        dateArray += (column_name.toString())
        string.add(column_name.toString() + " " + data_type.toString())
      }

      if (data_type.equals("NUMBER")) {
        if (data_precision != null) {
          string.add(column_name.toString() + " " + data_type.toString() + s"(${data_precision.toString().toDouble.intValue()},${data_scale.toString().toDouble.intValue()})")
        } else {
          string.add(column_name.toString() + " " + data_type.toString())
        }

      }
      if (data_type.equals("VARCHAR2")) {
        stringArray += ((column_name.toString(), data_length.toString().toDouble.intValue()))
        string.add(column_name.toString() + " " + data_type.toString() + s"(${data_length.toString().toDouble.intValue()})")
      }

    }
    for (i <- 0 until df1.length) {
      var b = df1(i)
      var dataName = b.name
      var dataType = b.dataType
      //          println("字段名"+dataName+"字段类型"+dataType)
      if (dateArray.exists(p => p.equalsIgnoreCase(s"${dataName}"))) {
        dataType = TimestampType

      }
      var structType = DataTypes.createStructField(dataName, dataType, true)

      list.add(structType)
    }

    val schema = DataTypes.createStructType(list)

    if (dateArray.length > 0) {

      for (m <- 0 until dateArray.length) {
        var mm = dateArray(m).toString()
        println("mm:" + mm)
        var df5 = df.withColumn(s"$mm", df(s"$mm").cast(TimestampType))
        df = df5
      }
    }

    val rdd = df.toJavaRDD
    val df2 = sqlContext.createDataFrame(rdd, schema);

    df2.printSchema()

    val url = "jdbc:oracle:thin:@//********/********"
    val table = "test2"
    val table1="test3"
    val user = "********"
    val password = "#EDC5tgb"

    val url1 = "jdbc:oracle:thin:********/********//********/********"
    val connectionProperties = new Properties()
    connectionProperties.put("user", user)
    connectionProperties.put("password", password)
    connectionProperties.put("driver", "oracle.jdbc.driver.OracleDriver")

    val a = string.toString()
    val option = a.substring(1, a.length() - 1)
    println(option)

    oracleInit()

    createJdbcTable(option,table)

    println("create table is finish!")

    df2.write.jdbc(url, table1, connectionProperties)

    insertTable(table,table1)
    println("已导入目标表!")


    sc.stop()
    //option("createTableColumnTypes", "CLAIMNO VARCHAR2(300), comments VARCHAR(1024)")
    //df2.select(df2("POLICYNO")).write.option("createTableColumnTypes", "CLAIMNO VARCHAR2(200)")
    //.jdbc(url, table, connectionProperties)
  }

  def createJdbcTable(option:String,table:String) = {

    val url = "jdbc:oracle:thin:@//********/********"
    //驱动名称
    val driver = "oracle.jdbc.driver.OracleDriver"
    //用户名
    val username = "********"
    //密码
    val password = "#EDC5tgb"
    //初始化数据连接
    var connection: Connection = null
    try {
      //注册Driver
      Class.forName(driver)
      //得到连接
      connection = DriverManager.getConnection(url, username, password)
      val statement = connection.createStatement
      //执行查询语句,并返回结果
      val sql =s"""
        create table ${table}
(
 ${option}
)
        """
      val rs = statement.executeQuery(sql)
      connection.close
    } catch { case e: Exception => e.printStackTrace }
    finally { //关闭连接,释放资源   connection.close     }
    }
  }

  def insertTable(table:String,table1:String){
    val url = "jdbc:oracle:thin:@//********/********"
    //驱动名称
    val driver = "oracle.jdbc.driver.OracleDriver"
    //用户名
    val username = "********"
    //密码
    val password = "*********"
    //初始化数据连接
    var connection: Connection = null
    try {
      //注册Driver
      Class.forName(driver)
      //得到连接
      connection = DriverManager.getConnection(url, username, password)
      val statement = connection.createStatement
      //执行查询语句,并返回结果
      val sql =s"""
        insert into ${table} select * from  ${table1}
        """
      val rs = statement.executeQuery(sql)
      connection.close
    } catch { case e: Exception => e.printStackTrace }
    finally { //关闭连接,释放资源   connection.close     }
    }

  }
}

The pits in many versions, for example , the parameters provided by the mode
write.mode ().jdbc() will be overwritten no matter what is given, whether it is append or ignore.
Checked the source code, savemode was written to death as overwrite.
Refer to this question for details:

https://www.2cto.com/net/201609/551130.html

I wish you all to avoid detours!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324854080&siteId=291194637