Spark loads data sources, Spark reads Jar packages and executes Scala operations

1. Load the CSV file

1. UseSparkContext

    val conf = new SparkConf().setAppName("CsvDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val lines = sc.textFile("in/users.csv")
    lines.foreach(println)

The read display effect is as follows:

user_id,locale,birthyear,gender,joinedAt,location,timezone
3197468391,id_ID,1993,male,2012-10-02T06:40:55.524Z,Medan  Indonesia,480
3537982273,id_ID,1992,male,2012-09-29T18:03:12.111Z,Medan  Indonesia,420
823183725,en_US,1975,male,2012-10-06T03:14:07.149Z,Stratford  Ontario,-240
1872223848,en_US,1991,female,2012-11-04T08:59:43.783Z,Tehran  Iran,210
3429017717,id_ID,1995,female,2012-09-10T16:06:53.132Z,,420
627175141,ka_GE,1973,female,2012-11-01T09:59:17.590Z,Tbilisi  Georgia,240
2752000443,id_ID,1994,male,2012-10-03T05:22:17.637Z,Medan  Indonesia,420
3473687777,id_ID,1965,female,2012-10-03T12:19:29.975Z,Medan  Indonesia,420
2966052962,id_ID,1979,male,2012-10-31T10:11:57.668Z,Medan  Indonesia,420
264876277,id_ID,1988,female,2012-10-02T07:28:09.555Z,Medan  Indonesia,420
1534483818,en_US,1992,male,2012-09-25T13:38:04.083Z,Medan  Indonesia,420
2648135297,en_US,1996,female,2012-10-30T05:09:45.592Z,Phnom Penh,420

2. UseSparkSession

    val spark = SparkSession.builder().appName("CSV").master("local[*]").getOrCreate()
    //val df = spark.read.format("csv").option("header","false").load("in/users.csv")		//与下面一行等价
    val df = spark.read.csv("in/users.csv")
    df.show()

The read display effect is as follows:

+----------+------+---------+------+--------------------+------------------+--------+
|       _c0|   _c1|      _c2|   _c3|                 _c4|               _c5|     _c6|
+----------+------+---------+------+--------------------+------------------+--------+
|   user_id|locale|birthyear|gender|            joinedAt|          location|timezone|
|3197468391| id_ID|     1993|  male|2012-10-02T06:40:...|  Medan  Indonesia|     480|
|3537982273| id_ID|     1992|  male|2012-09-29T18:03:...|  Medan  Indonesia|     420|
| 823183725| en_US|     1975|  male|2012-10-06T03:14:...|Stratford  Ontario|    -240|
|1872223848| en_US|     1991|female|2012-11-04T08:59:...|      Tehran  Iran|     210|
|3429017717| id_ID|     1995|female|2012-09-10T16:06:...|              null|     420|
| 627175141| ka_GE|     1973|female|2012-11-01T09:59:...|  Tbilisi  Georgia|     240|
|2752000443| id_ID|     1994|  male|2012-10-03T05:22:...|  Medan  Indonesia|     420|
|3473687777| id_ID|     1965|female|2012-10-03T12:19:...|  Medan  Indonesia|     420|
|2966052962| id_ID|     1979|  male|2012-10-31T10:11:...|  Medan  Indonesia|     420|
| 264876277| id_ID|     1988|female|2012-10-02T07:28:...|  Medan  Indonesia|     420|
|1534483818| en_US|     1992|  male|2012-09-25T13:38:...|  Medan  Indonesia|     420|
|2648135297| en_US|     1996|female|2012-10-30T05:09:...|        Phnom Penh|     420|
+----------+------+---------+------+--------------------+------------------+--------+

Two, load the JSON data source

1. UseSparkContext

    val conf = new SparkConf().setAppName("Json").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val lines = sc.textFile("in/users.json")
    import scala.util.parsing.json._
    val rdd = lines.map(x=>JSON.parseFull(x))
    rdd.collect.foreach(println)

The read display effect is as follows:

Some(Map(name -> Michael))
Some(Map(name -> Andy, Age -> 30.0))
Some(Map(name -> Justin, Age -> 19.0))

2. UseSparkSession

    val spark = SparkSession.builder().appName("JSON").master("local[*]").getOrCreate()
    //val df = spark.read.format("json").option("header","true").load("in/users.json")
    val df = spark.read.json("in/users.json")//与上一行等价
    df.show()

The read display effect is as follows:

+----+-------+
| Age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Three, Spark reads the Jar package to perform Scala operations

1. Package the Sparkcode

  • Here, the file is read by reading the configuration file to facilitate the modification of the file later, without modifying the code
  • Note that you need to upload the read file to the corresponding hdfs file system

Configuration file:

loadfile:hdfs://192.168.8.99:9000/kb09File/world.txt
outfile:hdfs://192.168.8.99:9000/kb09File/outfile

Scala code:

  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val properties=new Properties()
    properties.load(new FileInputStream("/opt/userset.properties"))
    val loadFilePath=properties.get("loadfile").toString
    val outFilePath=properties.get("outfile").toString
    val lines = sc.textFile(loadFilePath)
    lines.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_).saveAsTextFile(outFilePath)
    lines.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_).foreach(println)
    sc.stop()
  }

2. Delete security files

  • Upload the Jar package to the virtual machine
  • Execute the following statement to delete the security file in the Jar package, otherwise the Scala code cannot be executed
	zip -d /opt/kb09File/sparkdemo1.jar 'META-INF/*.DSA' 'META-INF/*SF'

3. Read the Jar package

  • Upload configuration file to hdfs file system
  • Execute Jar package operation Scala code
	spark-submit --class gaoji.WordCount --master local[1] ./sparkdemo1.jar

Guess you like

Origin blog.csdn.net/qq_42578036/article/details/109713615