1. Read the file
Read the file into rdd through the sc.textFile("file://") method.
val lines = sc.textFile("file://")//file address or HDFS file path
local address
"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"
HDFS file address
"hdfs://112.74.21.122:9000/user/hive/warehouse/hive_test"
2. Save the file
Save the rdd content to a file via sc.saveAsTextFile("file://")
rdd.saveAsTextFile("file:///home/writeout.txt");//把rdd写入/home/writeout.txt
But we open the /home folder and find that writeout is not a txt file but a folder, we open the folder, the structure is as follows
Did we save it wrong? No, it's normal at this time. part-00000 represents the partition. If there are multiple partitions, there will be multiple part-xxxxxx files.
If we want to read the saved file again, we don't need to read it one by one, just read it directly, and spark will automatically load all partition data.
val rdd = sc.textFile("file:///home/writeout/part-00000");//We don't need to read them one by one val rdd = sc.textFile("file:///home/writeout.txt");//Direct reading in this way will automatically load all partition data into rdd
3. JSON data parsing
(1) Read JSON format file
Just use sc.textFile("file://") to read the .json file directly
(2)JSON
Scala has a built-in JSON library scala.util.parsing.json.JSON that can parse JSON data.
Parse the input JSON string by calling the JSON.parseFull(jsonString: String) function.
If the parsing is successful, it will return a Some (map: Map[String, Any]), if it fails, it will return None
Example:
document content
We see that each {...} is a json format data, and a json file contains several json format data.
We parse the content of this json file
1. Write a program
import org.apache.spark._ import scala.util.parsing.json.JSON object JSONApp { def main(args:Array[String]): Unit ={ //Initialization configuration: set the host name and the name of the main class of the program val conf = new SparkConf().setMaster("local").setAppName("JSONApp"); //Create sparkcontext through conf val sc = new SparkContext(conf); val inputFile = "file:///usr/local/spark/examples/src/main/resources/people.json"//读取json文件 val jsonStr = sc.textFile(inputFile); val result = jsonStr.map(s => JSON.parseFull(s));//parse JSON string one by one result.foreach( { r => r match { case Some(map:Map[String,Any]) => println(map) case None => println("parsing failed!") case other => println("unknown data structure" + other) } } ); } }
2. Package the entire program into a jar package
3. Run the program through spark-submit
4. View the results
After running the program, you can see a lot of output information on the screen, find the following information, and the parsing is successful.