A source of process data read .sparksql
1.spark currently support reading jdbc, hive, text, orc and other types of data, if it is to support hbase or other data source, you must customize
2. The reading process
(1)sparksql进行 session.read.text()或者 session.read .format("text") .options(Map("a"->"b")).load("")
. Read: Create Object DataFrameReader
Method format: data source type assigned DataFrameReade
options Methods: assignment DataFrameReade additional configuration options
Internal enter session.read.text () method, you can see the format is "text"
(2) into the load method
load turned out to be: sparkSession.baseRelationToDataFrame this method eventually create dataframe
(3 ) enters the DataSource resolveRelation () method
This paragraph is: providingClass This class is a class that implements the interface which is divided into to two shema there is no incoming schema of
(3) providingClass format is incoming data source type, source is in front of the
map spark source provides all data
4. concluded just write a class that implements RelationProvider following method, which returns in a method baserelation
def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation
我们在实现baserelation里面的逻辑就可以了
5.看看spark读取jdbc类
需要一个类,实现xxxScan这中类,这种类有三种,全局扫描tableScan,PrunedFilteredScan(列裁剪与谓词下推),PrunedScan ,
实现buildscan方法返回row类型rdd,结合baserelation有shcame这个变量 ,就凑成了dataframe
6.jdbcRdd.scanTable方法,得到RDD
7.查看jdbcRDD的compute方法,是通过jdbc查询sql的方式获取数据
RDD的计算是惰性的,一系列转换操作只有在遇到动作操作是才会去计算数据,而分区作为数据计算的基本单位。在计算链中,无论一个RDD有多么复杂,其最终都会调用内部的compute函数来计算一个分区的数据。
override def compute(thePart: Partition, context: TaskContext): Iterator[InternalRow] = { var closed = false var rs: ResultSet = null var stmt: PreparedStatement = null var conn: Connection = null def close() { if (closed) return try { if (null != rs) { rs.close() } } catch { case e: Exception => logWarning("Exception closing resultset", e) } try { if (null != stmt) { stmt.close() } } catch { case e: Exception => logWarning("Exception closing statement", e) } try { if (null != conn) { if (!conn.isClosed && !conn.getAutoCommit) { try { conn.commit() } catch { case NonFatal(e) => logWarning("Exception committing transaction", e) } } conn.close() } logInfo("closed connection") } catch { case e: Exception => logWarning("Exception closing connection", e) } closed = true } context.addTaskCompletionListener{ context => close() } val inputMetrics = context.taskMetrics().inputMetrics val part = thePart.asInstanceOf[JDBCPartition] conn = getConnection() val dialect = JdbcDialects.get(url) import scala.collection.JavaConverters._ dialect.beforeFetch(conn, options.asProperties.asScala.toMap) // H2's JDBC driver does not support the setSchema() method. We pass a // fully-qualified table name in the SELECT statement. I don't know how to // talk about a table in a completely portable way. //坐上每个分区的Filter条件 val myWhereClause = getWhereClause(part) //最終查询sql语句 val sqlText = s"SELECT $columnList FROM ${options.table} $myWhereClause" //jdbc查询 stmt = conn.prepareStatement(sqlText, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY) stmt.setFetchSize(options.fetchSize) rs = stmt.executeQuery() val rowsIterator = JdbcUtils.resultSetToSparkInternalRows(rs, schema, inputMetrics) //返回迭代器 CompletionIterator[InternalRow, Iterator[InternalRow]]( new InterruptibleIterator(context, rowsIterator), close()) }