Spark Core: data input

Text file input and output

Read a text file

scala> sc.textFile("./wc.txt")
res4: org.apache.spark.rdd.RDD[String] = ./wc.txt MapPartitionsRDD[5] at textFile at <console>:25

Save the text file

scala> res4.saveAsTextFile("./test")

JSON / CSV file input and output

这种有格式的文件的输入和输出还是通过文本文件的输入和输出来支持的,Spark Core没有内置对JSON文件和CSV文件的解析和反解析功能,这个解析功能是需要用户自己根据需求来定制的。 注意:JSON文件的读取如果需要多个partition来读,那么JSOn文件一般一行一个json。如果你的JSON是跨行的,那么需要整体读入所有数据,并整体解析。

SequenceFile file input and output

Save the file SequenceFile

scala> val rdd =sc.parallelize(List((1,"aa"),(2,"bb"),(4,"cc")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> rdd.saveAsSequenceFile("./seq")

View SequenceFile file

scala> sc.sequenceFile[Int,String]("./seq").collect
res8: Array[(Int, String)] = Array((2,bb), (4,cc), (1,aa))

Object file input and output

Save object file

scala> val rdd =sc.parallelize(List((1,"aa"),(2,"bb"),(4,"cc")))
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> rdd.saveAsObjectFile("./obj")

View object file

scala> sc.objectFile[(Int,String)]("obj").collect
res10: Array[(Int, String)] = Array((2,bb), (4,cc), (1,aa))

hadoop input and output

Read from hadoop

object ReadHadoopFile { 
	def main(args: Array[String]) { 
	val sparkConf = new SparkConf().setMaster("local[2]").setAppName("HadoopFileApp") 
	val sc = new SparkContext(sparkConf) 
	val input = sc.newAPIHadoopFile[LongWritable, 
									Text, 
									TextInputFormat]
									("/output/part*",
									 classOf[TextInputFormat], 
									 classOf[LongWritable], 
									 classOf[Text]) 
	println("有多少条数据:" + input.count) 
	input.foreach(print(_)) 
	input.first 
	sc.stop() 
	}
}

Save to hadoop

object WriteHadoopFile {
	def main(args: Array[String]) {
	val sparkConf = new SparkConf().setMaster("local[2]").setAppName("HadoopFileApp")
	val sc = new SparkContext(sparkConf)
	
	val initialRDD = sc.parallelize(Array(("hadoop", 30), ("hive", 71), ("cat",
	11)))
	
	initialRDD.saveAsHadoopFile("/output/",
								classOf[Text] ,
								classOf[LongWritable] ,
								classOf[TextOutputFormat[Text, LongWritable]])
	sc.stop()
	}
}

MySQL input and output

Read data from MySQL

def main (args: Array[String] ) {
	val sparkConf = new SparkConf ().setMaster ("local[2]").setAppName
	("HBaseApp")
	val sc = new SparkContext (sparkConf)
	val rdd = new JdbcRDD (
		sc,
		() => {
			Class.forName ("com.mysql.jdbc.Driver").newInstance () DriverManager.getConnection 
			("jdbc:mysql://linux01:3306/company", "root",
			"123456")
			},
		"select * from staff where id >= ? and id <= ?;",
		1,
		100,
		1,
		r => (r.getString (1), r.getString (2), r.getString (3) ) ).cache ()
	println (rdd.count () ) rdd.foreach (println (_) )
	sc.stop ()
}

View the data in MySQL


def main(args: Array[String]) {
	val sparkConf = new SparkConf().setMaster("local[2]").setAppName("HBaseApp")
	val sc = new SparkContext(sparkConf)
	val data = sc.parallelize(List(("Irelia", "Female"), ("Ezreal", "Male"),
	("Alistar", "Female")))
	
	data.foreachPartition(insertData)
}


def insertData(iterator: Iterator[(String, String)]): Unit = {
	val conn = DriverManager.getConnection("jdbc:mysql://linux01:3306/company",
	"root", "123456")
	iterator.foreach(data => {
	val ps = conn.prepareStatement("insert into staff(name, sex) values
	(?, ?)")
	ps.setString(1, data._1) 
	ps.setString(2, data._2)
	ps.executeUpdate()
	})
}

Guess you like

Origin blog.csdn.net/drl_blogs/article/details/92796619