【Spark】Spark SQL写入Mysql优化

Spark2.2.0

Spark SQL基于JDBC写入Mysql速度非常慢，且对Mysql造成的负载比较高。

SparkSQL JDBC部分参数如下

url	要连接的JDBC URL。列如：jdbc:mysql://ip:3306
dbtable	应该读取的JDBC表，可以使用括号中的子查询代替完整的表。例如，(select * from table_name) as t1,必须给查询结果加上别名）
driver	用于连接到此URL的JDBC驱动程序的类名,列如：com.mysql.jdbc.Driver
`partitionColumn, lowerBound,` `upperBound`	这三个字段必须都被指定。当读取数据库，从多个worker并行读取数据，如何分割表。 `partitionColumn：`分区字段必须是数值类型，int、float、double、decimal均可； `lowerBound：下界，必须是整数` `upperBound：上界，必须是整数` `lowerBound、upperBound用于决定分区的大小，不用于过滤表中的行。表中的所有行将被分割、返回。`
`numPartitions`	并行读写数据库的分区数，也决定了JDBC的连接并发数。如果RDD的分区数超过了此值，Spark会在写入数据库前调用coalesce(numPartitions)减少分区数到此值。
fetchsize	仅适用于read数据。JDBC提取大小，用于确定每次获取的行数。这可以帮助JDBC驱动程序调优性能，这些驱动程序默认具有较低的提取大小（例如，Oracle每次提取10行）。
batchsize	仅适用于write数据。JDBC批量大小，用于确定每次insert的行数。这可以帮助JDBC驱动程序调优性能。默认为1000。
isolationLevel	仅适用于write数据。事务隔离级别，适用于当前连接。它可以是一个NONE，READ_COMMITTED，READ_UNCOMMITTED，REPEATABLE_READ，或SERIALIZABLE，对应于由JDBC的连接对象定义，缺省值为标准事务隔离级别READ_UNCOMMITTED。
truncate	仅适用于write数据。当SaveMode.Overwrite启用时，此选项会truncate在MySQL中的表（而不是删除，再重建其现有的表）。这可以更有效，并且防止表元数据（例如，索引）被去除。但是，在某些情况下，例如当新数据具有不同的模式时，它将无法工作。它默认为false。
createTableOptions	仅适用于write数据。此选项允许在创建表（例如CREATE TABLE t (name string) ENGINE=InnoDB.）时设置特定的数据库表和分区选项。
`createTableColumnTypes`	仅适用于write数据。创建表时指定表字段的数据类型。指定的数据类型必须与sparkSQL的类型一致

url:在url后加上参数rewriteBatchedStatements=true表示MySQL服务开启批次写入，此参数是批次写入的一个比较重要参数，可明显提升性能
batchsize：DataFrame writer批次写入MySQL 的条数，也为提升性能参数
isolationLevel：事务隔离级别，DataFrame写入不需要开启事务，为NONE
truncate：overwrite模式时可用，表时在覆盖原始数据时不会删除表结构而是复用

SparkSQL基于JDBC方式写入Mysql数据库：

结论：df在写入Mysql时，按照分区的方式写入，调用JDBC，PrepareStatement方式组装Mysql的SQL

https://blog.csdn.net/IT_xhf/article/details/85336074

https://www.jianshu.com/p/429e64663b0e

  override def createRelation(
      sqlContext: SQLContext,
      mode: SaveMode,
      parameters: Map[String, String],
      df: DataFrame): BaseRelation = {
    val jdbcOptions = new JDBCOptions(parameters)
    val url = jdbcOptions.url
    val table = jdbcOptions.table
    val createTableOptions = jdbcOptions.createTableOptions
    val isTruncate = jdbcOptions.isTruncate

    val conn = JdbcUtils.createConnectionFactory(jdbcOptions)()
    try {
      val tableExists = JdbcUtils.tableExists(conn, url, table)
      if (tableExists) {
        mode match {
          case SaveMode.Overwrite =>
            if (isTruncate && isCascadingTruncateTable(url) == Some(false)) {
              // In this case, we should truncate table and then load.
              truncateTable(conn, table)
              saveTable(df, url, table, jdbcOptions)
            } else {
              // Otherwise, do not truncate the table, instead drop and recreate it
              dropTable(conn, table)
              createTable(df.schema, url, table, createTableOptions, conn)
              saveTable(df, url, table, jdbcOptions)
            }

          case SaveMode.Append =>
            saveTable(df, url, table, jdbcOptions)

          case SaveMode.ErrorIfExists =>
            throw new AnalysisException(
              s"Table or view '$table' already exists. SaveMode: ErrorIfExists.")

          case SaveMode.Ignore =>
            // With `SaveMode.Ignore` mode, if table already exists, the save operation is expected
            // to not save the contents of the DataFrame and to not change the existing data.
            // Therefore, it is okay to do nothing here and then just return the relation below.
        }
      } else {
        createTable(df.schema, url, table, createTableOptions, conn)
        saveTable(df, url, table, jdbcOptions)
      }
    } finally {
      conn.close()
    }

    createRelation(sqlContext, parameters)
  }

最后通过org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#saveTable函数完成数据的插入

  /**
   * Saves the RDD to the database in a single transaction.
   */
  def saveTable(
      df: DataFrame,
      tableSchema: Option[StructType],
      isCaseSensitive: Boolean,
      options: JDBCOptions): Unit = {
    val url = options.url
    val table = options.table
    val dialect = JdbcDialects.get(url)
    val rddSchema = df.schema
    val getConnection: () => Connection = createConnectionFactory(options)
    val batchSize = options.batchSize
    val isolationLevel = options.isolationLevel

    val insertStmt = getInsertStatement(table, rddSchema, tableSchema, isCaseSensitive, dialect)
    val repartitionedDF = options.numPartitions match {
      case Some(n) if n <= 0 => throw new IllegalArgumentException(
        s"Invalid value `$n` for parameter `${JDBCOptions.JDBC_NUM_PARTITIONS}` in table writing " +
          "via JDBC. The minimum value is 1.")
      case Some(n) if n < df.rdd.getNumPartitions => df.coalesce(n)
      case _ => df
    }
    repartitionedDF.foreachPartition(iterator => savePartition(
      getConnection, table, iterator, rddSchema, insertStmt, batchSize, dialect, isolationLevel)
    )
  }


  /**
   * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn.
   */
  def getInsertStatement(
      table: String,
      rddSchema: StructType,
      tableSchema: Option[StructType],
      isCaseSensitive: Boolean,
      dialect: JdbcDialect): String = {
    val columns = if (tableSchema.isEmpty) {
      rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)).mkString(",")
    } else {
      val columnNameEquality = if (isCaseSensitive) {
        org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
      } else {
        org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
      }
      // The generated insert statement needs to follow rddSchema's column sequence and
      // tableSchema's column names. When appending data into some case-sensitive DBMSs like
      // PostgreSQL/Oracle, we need to respect the existing case-sensitive column names instead of
      // RDD column names for user convenience.
      val tableColumnNames = tableSchema.get.fieldNames
      rddSchema.fields.map { col =>
        val normalizedName = tableColumnNames.find(f => columnNameEquality(f, col.name)).getOrElse {
          throw new AnalysisException(s"""Column "${col.name}" not found in schema $tableSchema""")
        }
        dialect.quoteIdentifier(normalizedName)
      }.mkString(",")
    }
    val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
    s"INSERT INTO $table ($columns) VALUES ($placeholders)"
  }


  /**
   * Saves a partition of a DataFrame to the JDBC database.  This is done in
   * a single database transaction (unless isolation level is "NONE")
   * in order to avoid repeatedly inserting data as much as possible.
   *
   * It is still theoretically possible for rows in a DataFrame to be
   * inserted into the database more than once if a stage somehow fails after
   * the commit occurs but before the stage can return successfully.
   *
   * This is not a closure inside saveTable() because apparently cosmetic
   * implementation changes elsewhere might easily render such a closure
   * non-Serializable.  Instead, we explicitly close over all variables that
   * are used.
   */
  def savePartition(
      getConnection: () => Connection,
      table: String,
      iterator: Iterator[Row],
      rddSchema: StructType,
      insertStmt: String,
      batchSize: Int,
      dialect: JdbcDialect,
      isolationLevel: Int): Iterator[Byte] = {
    val conn = getConnection()
    var committed = false

    var finalIsolationLevel = Connection.TRANSACTION_NONE
    if (isolationLevel != Connection.TRANSACTION_NONE) {
      try {
        val metadata = conn.getMetaData
        if (metadata.supportsTransactions()) {
          // Update to at least use the default isolation, if any transaction level
          // has been chosen and transactions are supported
          val defaultIsolation = metadata.getDefaultTransactionIsolation
          finalIsolationLevel = defaultIsolation
          if (metadata.supportsTransactionIsolationLevel(isolationLevel))  {
            // Finally update to actually requested level if possible
            finalIsolationLevel = isolationLevel
          } else {
            logWarning(s"Requested isolation level $isolationLevel is not supported; " +
                s"falling back to default isolation level $defaultIsolation")
          }
        } else {
          logWarning(s"Requested isolation level $isolationLevel, but transactions are unsupported")
        }
      } catch {
        case NonFatal(e) => logWarning("Exception while detecting transaction support", e)
      }
    }
    val supportsTransactions = finalIsolationLevel != Connection.TRANSACTION_NONE

    try {
      if (supportsTransactions) {
        conn.setAutoCommit(false) // Everything in the same db transaction.
        conn.setTransactionIsolation(finalIsolationLevel)
      }
      val stmt = conn.prepareStatement(insertStmt)
      val setters = rddSchema.fields.map(f => makeSetter(conn, dialect, f.dataType))
      val nullTypes = rddSchema.fields.map(f => getJdbcType(f.dataType, dialect).jdbcNullType)
      val numFields = rddSchema.fields.length

      try {
        var rowCount = 0
        while (iterator.hasNext) {
          val row = iterator.next()
          var i = 0
          while (i < numFields) {
            if (row.isNullAt(i)) {
              stmt.setNull(i + 1, nullTypes(i))
            } else {
              setters(i).apply(stmt, row, i)
            }
            i = i + 1
          }
          stmt.addBatch()
          rowCount += 1
          if (rowCount % batchSize == 0) {
            stmt.executeBatch()
            rowCount = 0
          }
        }
        if (rowCount > 0) {
          stmt.executeBatch()
        }
      } finally {
        stmt.close()
      }
      if (supportsTransactions) {
        conn.commit()
      }
      committed = true
      Iterator.empty
    } catch {
      case e: SQLException =>
        val cause = e.getNextException
        if (cause != null && e.getCause != cause) {
          // If there is no cause already, set 'next exception' as cause. If cause is null,
          // it *may* be because no cause was set yet
          if (e.getCause == null) {
            try {
              e.initCause(cause)
            } catch {
              // Or it may be null because the cause *was* explicitly initialized, to *null*,
              // in which case this fails. There is no other way to detect it.
              // addSuppressed in this case as well.
              case _: IllegalStateException => e.addSuppressed(cause)
            }
          } else {
            e.addSuppressed(cause)
          }
        }
        throw e
    } finally {
      if (!committed) {
        // The stage must fail.  We got here through an exception path, so
        // let the exception through unless rollback() or close() want to
        // tell the user about another problem.
        if (supportsTransactions) {
          conn.rollback()
        }
        conn.close()
      } else {
        // The stage must succeed.  We cannot propagate any exception close() might throw.
        try {
          conn.close()
        } catch {
          case e: Exception => logWarning("Transaction succeeded, but closing failed", e)
        }
      }
    }
  }

SparkSQL读数据库参数案例

https://www.cnblogs.com/wwxbi/p/6978774.html

参数的案例

https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa

partitionColumn, lowerBound,upperBound使用案例

https://blog.csdn.net/wiborgite/article/details/84944596

https://mp.weixin.qq.com/s?__biz=MzA5MTc0NTMwNQ==&mid=2650718782&idx=1&sn=ecb0fc74876b67a811e17365274d6ab0&chksm=887ddf48bf0a565e8992fc290f7bbbdfa5c2df325377852463fa2c253341bad59745fc8a2cb3&scene=0&xtrack=1&key=4ac440fd252317b96ed0ee273315387dbf4abd3eca5e023c192f65c1b8ac2b5041c2a1112e2662608fc3460662d4f651e0812a687111a8262c18163a90e32e72856dcf600897e5b9b193fc7c07d630fa&ascene=1&uin=MjkxNzIzOTMxNA%3D%3D&devicetype=Windows+10&version=62060834&lang=zh_CN&pass_ticket=WcLK4S0l4KKMY6vw6GCXNpm%2BJYK%2FV70piO6l4aHvaxHXeJIzqMbwpMW0L4DdzlHq

我是旺领导

发布了61 篇原创文章 · 获赞 2 · 访问量 7303

私信关注

【Spark】Spark SQL写入Mysql优化

猜你喜欢