[spark] File reading and writing and JSON data parsing

1. Read the file

Read the file into rdd through the  sc.textFile("file://")  method.

val lines = sc.textFile("file://")//file address or HDFS file path

local address

"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"

HDFS file address

"hdfs://112.74.21.122:9000/user/hive/warehouse/hive_test"

2. Save the file

Save the rdd content to a file via sc.saveAsTextFile("file://") 

rdd.saveAsTextFile("file:///home/writeout.txt");//把rdd写入/home/writeout.txt

But we open the /home folder and find that writeout is not a txt file but a folder, we open the folder, the structure is as follows

       

Did we save it wrong? No, it's normal at this time. part-00000 represents the partition. If there are multiple partitions, there will be multiple part-xxxxxx files.

If we want to read the saved file again, we don't need to read it one by one, just read it directly, and spark will automatically load all partition data.

val rdd = sc.textFile("file:///home/writeout/part-00000");//We don't need to read them one by one
val rdd = sc.textFile("file:///home/writeout.txt");//Direct reading in this way will automatically load all partition data into rdd 

3. JSON data parsing

(1) Read JSON format file

  Just use sc.textFile("file://") to read the .json file directly

(2)JSON

   Scala has a built-in JSON library scala.util.parsing.json.JSON  that can parse JSON data.

Parse the input JSON string   by calling the JSON.parseFull(jsonString: String) function.

  If the parsing is successful, it will return a Some (map: Map[String, Any]), if it fails, it will return None

Example:

 

document content

We see that each {...} is a json format data, and a json file contains several json format data.

We parse the content of this json file

1. Write a program

import org.apache.spark._
import scala.util.parsing.json.JSON
object JSONApp {
    def main(args:Array[String]): Unit ={
        //Initialization configuration: set the host name and the name of the main class of the program
        val conf = new SparkConf().setMaster("local").setAppName("JSONApp");
        //Create sparkcontext through conf
        val sc = new SparkContext(conf);

        val inputFile = "file:///usr/local/spark/examples/src/main/resources/people.json"//读取json文件
        val jsonStr = sc.textFile(inputFile);
        val result = jsonStr.map(s => JSON.parseFull(s));//parse JSON string one by one
        result.foreach(
            {
                r => r match {
                case Some(map:Map[String,Any]) => println(map)
                case None => println("parsing failed!")
                case other => println("unknown data structure" + other)
                }
            }
        );
    }
}

2. Package the entire program into a jar package

3. Run the program through spark-submit

4. View the results

After running the program, you can see a lot of output information on the screen, find the following information, and the parsing is successful.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325808282&siteId=291194637