[Spark] Spark SQL write to Mysql optimization

Spark2.2.0

Spark SQL is very slow to write to Mysql based on JDBC, and the load on Mysql is relatively high.

Some parameters of SparkSQL JDBC are as follows

url The JDBC URL to connect to. Listed as: jdbc: mysql: // ip: 3306
dbtable The JDBC table that should be read can be replaced by the subquery in parentheses. For example, (select * from table_name) as t1, you must add an alias to the query result)
driver The class name of the JDBC driver used to connect to this URL, listed as: com.mysql.jdbc.Driver

partitionColumn, lowerBound,

upperBound

All three fields must be specified. When reading the database, reading data from multiple workers in parallel, how to split the table.

partitionColumn:The partition field must be a numeric type, int, float, double, decimal can be;

lowerBound:下界,必须是整数

upperBound:上界,必须是整数

lowerBound、upperBound用于决定分区的大小,不用于过滤表中的行。表中的所有行将被分割、返回。

 

numPartitions

The number of partitions read and written in parallel to the database also determines the number of concurrent JDBC connections. If the number of RDD partitions exceeds this value, Spark will call coalesce (numPartitions) before writing to the database to reduce the number of partitions to this value.

fetchsize
 
Only applicable to read data. The JDBC fetch size is used to determine the number of rows fetched each time. This can help JDBC drivers to tune performance. These drivers have a low fetch size by default (for example, Oracle fetches 10 rows at a time).
batchsize Only applicable to write data. JDBC batch size, used to determine the number of rows per insert. This can help the JDBC driver to tune performance. The default is 1000.
isolationLevel Only applicable to write data. The transaction isolation level applies to the current connection. It can be a NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to the connection object definition by JDBC, the default value is the standard transaction isolation level READ_UNCOMMITTED.
truncate Only applicable to write data. When SaveMode.Overwrite is enabled, this option truncates the table in MySQL (instead of deleting and rebuilding its existing table). This can be more effective and prevent table metadata (eg, indexes) from being removed. However, in some cases, such as when the new data has a different mode, it will not work. It defaults to false.
createTableOptions Only applicable to write data. This option allows you to set specific database table and partition options when creating a table (such as CREATE TABLE t (name string) ENGINE = InnoDB.).
createTableColumnTypes Only applicable to write data. Specify the data type of the table field when creating the table. The specified data type must be consistent with the type of sparkSQL

url: After the url, add the parameter rewriteBatchedStatements = true to indicate that the MySQL service starts batch writing. This parameter is an important parameter for batch writing, which can significantly improve performance
batchsize: the number of DataFrame writer batches written to MySQL. In order to improve the performance parameter
isolationLevel: transaction isolation level, DataFrame write does not need to open the transaction, it is NONE
truncate: available in overwrite mode, when the table is overwritten with the original data, the table structure is not deleted but reused
 

SparkSQL writes to the Mysql database based on JDBC:

Conclusion: When df writes to Mysql, it writes according to the partition method, calls JDBC, PrepareStatement to assemble Mysql SQL

https://blog.csdn.net/IT_xhf/article/details/85336074

https://www.jianshu.com/p/429e64663b0e

  override def createRelation(
      sqlContext: SQLContext,
      mode: SaveMode,
      parameters: Map[String, String],
      df: DataFrame): BaseRelation = {
    val jdbcOptions = new JDBCOptions(parameters)
    val url = jdbcOptions.url
    val table = jdbcOptions.table
    val createTableOptions = jdbcOptions.createTableOptions
    val isTruncate = jdbcOptions.isTruncate

    val conn = JdbcUtils.createConnectionFactory(jdbcOptions)()
    try {
      val tableExists = JdbcUtils.tableExists(conn, url, table)
      if (tableExists) {
        mode match {
          case SaveMode.Overwrite =>
            if (isTruncate && isCascadingTruncateTable(url) == Some(false)) {
              // In this case, we should truncate table and then load.
              truncateTable(conn, table)
              saveTable(df, url, table, jdbcOptions)
            } else {
              // Otherwise, do not truncate the table, instead drop and recreate it
              dropTable(conn, table)
              createTable(df.schema, url, table, createTableOptions, conn)
              saveTable(df, url, table, jdbcOptions)
            }

          case SaveMode.Append =>
            saveTable(df, url, table, jdbcOptions)

          case SaveMode.ErrorIfExists =>
            throw new AnalysisException(
              s"Table or view '$table' already exists. SaveMode: ErrorIfExists.")

          case SaveMode.Ignore =>
            // With `SaveMode.Ignore` mode, if table already exists, the save operation is expected
            // to not save the contents of the DataFrame and to not change the existing data.
            // Therefore, it is okay to do nothing here and then just return the relation below.
        }
      } else {
        createTable(df.schema, url, table, createTableOptions, conn)
        saveTable(df, url, table, jdbcOptions)
      }
    } finally {
      conn.close()
    }

    createRelation(sqlContext, parameters)
  }

最后通过org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#saveTable函数完成数据的插入


  /**
   * Saves the RDD to the database in a single transaction.
   */
  def saveTable(
      df: DataFrame,
      tableSchema: Option[StructType],
      isCaseSensitive: Boolean,
      options: JDBCOptions): Unit = {
    val url = options.url
    val table = options.table
    val dialect = JdbcDialects.get(url)
    val rddSchema = df.schema
    val getConnection: () => Connection = createConnectionFactory(options)
    val batchSize = options.batchSize
    val isolationLevel = options.isolationLevel

    val insertStmt = getInsertStatement(table, rddSchema, tableSchema, isCaseSensitive, dialect)
    val repartitionedDF = options.numPartitions match {
      case Some(n) if n <= 0 => throw new IllegalArgumentException(
        s"Invalid value `$n` for parameter `${JDBCOptions.JDBC_NUM_PARTITIONS}` in table writing " +
          "via JDBC. The minimum value is 1.")
      case Some(n) if n < df.rdd.getNumPartitions => df.coalesce(n)
      case _ => df
    }
    repartitionedDF.foreachPartition(iterator => savePartition(
      getConnection, table, iterator, rddSchema, insertStmt, batchSize, dialect, isolationLevel)
    )
  }

  /**
   * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn.
   */
  def getInsertStatement(
      table: String,
      rddSchema: StructType,
      tableSchema: Option[StructType],
      isCaseSensitive: Boolean,
      dialect: JdbcDialect): String = {
    val columns = if (tableSchema.isEmpty) {
      rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)).mkString(",")
    } else {
      val columnNameEquality = if (isCaseSensitive) {
        org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
      } else {
        org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
      }
      // The generated insert statement needs to follow rddSchema's column sequence and
      // tableSchema's column names. When appending data into some case-sensitive DBMSs like
      // PostgreSQL/Oracle, we need to respect the existing case-sensitive column names instead of
      // RDD column names for user convenience.
      val tableColumnNames = tableSchema.get.fieldNames
      rddSchema.fields.map { col =>
        val normalizedName = tableColumnNames.find(f => columnNameEquality(f, col.name)).getOrElse {
          throw new AnalysisException(s"""Column "${col.name}" not found in schema $tableSchema""")
        }
        dialect.quoteIdentifier(normalizedName)
      }.mkString(",")
    }
    val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
    s"INSERT INTO $table ($columns) VALUES ($placeholders)"
  }

  /**
   * Saves a partition of a DataFrame to the JDBC database.  This is done in
   * a single database transaction (unless isolation level is "NONE")
   * in order to avoid repeatedly inserting data as much as possible.
   *
   * It is still theoretically possible for rows in a DataFrame to be
   * inserted into the database more than once if a stage somehow fails after
   * the commit occurs but before the stage can return successfully.
   *
   * This is not a closure inside saveTable() because apparently cosmetic
   * implementation changes elsewhere might easily render such a closure
   * non-Serializable.  Instead, we explicitly close over all variables that
   * are used.
   */
  def savePartition(
      getConnection: () => Connection,
      table: String,
      iterator: Iterator[Row],
      rddSchema: StructType,
      insertStmt: String,
      batchSize: Int,
      dialect: JdbcDialect,
      isolationLevel: Int): Iterator[Byte] = {
    val conn = getConnection()
    var committed = false

    var finalIsolationLevel = Connection.TRANSACTION_NONE
    if (isolationLevel != Connection.TRANSACTION_NONE) {
      try {
        val metadata = conn.getMetaData
        if (metadata.supportsTransactions()) {
          // Update to at least use the default isolation, if any transaction level
          // has been chosen and transactions are supported
          val defaultIsolation = metadata.getDefaultTransactionIsolation
          finalIsolationLevel = defaultIsolation
          if (metadata.supportsTransactionIsolationLevel(isolationLevel))  {
            // Finally update to actually requested level if possible
            finalIsolationLevel = isolationLevel
          } else {
            logWarning(s"Requested isolation level $isolationLevel is not supported; " +
                s"falling back to default isolation level $defaultIsolation")
          }
        } else {
          logWarning(s"Requested isolation level $isolationLevel, but transactions are unsupported")
        }
      } catch {
        case NonFatal(e) => logWarning("Exception while detecting transaction support", e)
      }
    }
    val supportsTransactions = finalIsolationLevel != Connection.TRANSACTION_NONE

    try {
      if (supportsTransactions) {
        conn.setAutoCommit(false) // Everything in the same db transaction.
        conn.setTransactionIsolation(finalIsolationLevel)
      }
      val stmt = conn.prepareStatement(insertStmt)
      val setters = rddSchema.fields.map(f => makeSetter(conn, dialect, f.dataType))
      val nullTypes = rddSchema.fields.map(f => getJdbcType(f.dataType, dialect).jdbcNullType)
      val numFields = rddSchema.fields.length

      try {
        var rowCount = 0
        while (iterator.hasNext) {
          val row = iterator.next()
          var i = 0
          while (i < numFields) {
            if (row.isNullAt(i)) {
              stmt.setNull(i + 1, nullTypes(i))
            } else {
              setters(i).apply(stmt, row, i)
            }
            i = i + 1
          }
          stmt.addBatch()
          rowCount += 1
          if (rowCount % batchSize == 0) {
            stmt.executeBatch()
            rowCount = 0
          }
        }
        if (rowCount > 0) {
          stmt.executeBatch()
        }
      } finally {
        stmt.close()
      }
      if (supportsTransactions) {
        conn.commit()
      }
      committed = true
      Iterator.empty
    } catch {
      case e: SQLException =>
        val cause = e.getNextException
        if (cause != null && e.getCause != cause) {
          // If there is no cause already, set 'next exception' as cause. If cause is null,
          // it *may* be because no cause was set yet
          if (e.getCause == null) {
            try {
              e.initCause(cause)
            } catch {
              // Or it may be null because the cause *was* explicitly initialized, to *null*,
              // in which case this fails. There is no other way to detect it.
              // addSuppressed in this case as well.
              case _: IllegalStateException => e.addSuppressed(cause)
            }
          } else {
            e.addSuppressed(cause)
          }
        }
        throw e
    } finally {
      if (!committed) {
        // The stage must fail.  We got here through an exception path, so
        // let the exception through unless rollback() or close() want to
        // tell the user about another problem.
        if (supportsTransactions) {
          conn.rollback()
        }
        conn.close()
      } else {
        // The stage must succeed.  We cannot propagate any exception close() might throw.
        try {
          conn.close()
        } catch {
          case e: Exception => logWarning("Transaction succeeded, but closing failed", e)
        }
      }
    }
  }

 

 

SparkSQL read database parameter case

https://www.cnblogs.com/wwxbi/p/6978774.html

Parameter case

https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa

partitionColumn, lowerBound,upperBound使用案例

https://blog.csdn.net/wiborgite/article/details/84944596

 

 

 

 

 

 

 

 

https://mp.weixin.qq.com/s?__biz=MzA5MTc0NTMwNQ==&mid=2650718782&idx=1&sn=ecb0fc74876b67a811e17365274d6ab0&chksm=887ddf48bf0a565e8992fc290f7bbbdfa5c2df325377852463fa2c253341bad59745fc8a2cb3&scene=0&xtrack=1&key=4ac440fd252317b96ed0ee273315387dbf4abd3eca5e023c192f65c1b8ac2b5041c2a1112e2662608fc3460662d4f651e0812a687111a8262c18163a90e32e72856dcf600897e5b9b193fc7c07d630fa&ascene=1&uin=MjkxNzIzOTMxNA%3D%3D&devicetype=Windows+10&version=62060834&lang=zh_CN&pass_ticket=WcLK4S0l4KKMY6vw6GCXNpm%2BJYK%2FV70piO6l4aHvaxHXeJIzqMbwpMW0L4DdzlHq

 

 

 

 

 

发布了61 篇原创文章 · 获赞 2 · 访问量 7303

Guess you like

Origin blog.csdn.net/hebaojing/article/details/103232956