1. Load the CSV file
1. UseSparkContext
val conf = new SparkConf().setAppName("CsvDemo").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("in/users.csv")
lines.foreach(println)
The read display effect is as follows:
user_id,locale,birthyear,gender,joinedAt,location,timezone
3197468391,id_ID,1993,male,2012-10-02T06:40:55.524Z,Medan Indonesia,480
3537982273,id_ID,1992,male,2012-09-29T18:03:12.111Z,Medan Indonesia,420
823183725,en_US,1975,male,2012-10-06T03:14:07.149Z,Stratford Ontario,-240
1872223848,en_US,1991,female,2012-11-04T08:59:43.783Z,Tehran Iran,210
3429017717,id_ID,1995,female,2012-09-10T16:06:53.132Z,,420
627175141,ka_GE,1973,female,2012-11-01T09:59:17.590Z,Tbilisi Georgia,240
2752000443,id_ID,1994,male,2012-10-03T05:22:17.637Z,Medan Indonesia,420
3473687777,id_ID,1965,female,2012-10-03T12:19:29.975Z,Medan Indonesia,420
2966052962,id_ID,1979,male,2012-10-31T10:11:57.668Z,Medan Indonesia,420
264876277,id_ID,1988,female,2012-10-02T07:28:09.555Z,Medan Indonesia,420
1534483818,en_US,1992,male,2012-09-25T13:38:04.083Z,Medan Indonesia,420
2648135297,en_US,1996,female,2012-10-30T05:09:45.592Z,Phnom Penh,420
2. UseSparkSession
val spark = SparkSession.builder().appName("CSV").master("local[*]").getOrCreate()
//val df = spark.read.format("csv").option("header","false").load("in/users.csv") //与下面一行等价
val df = spark.read.csv("in/users.csv")
df.show()
The read display effect is as follows:
+----------+------+---------+------+--------------------+------------------+--------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|
+----------+------+---------+------+--------------------+------------------+--------+
| user_id|locale|birthyear|gender| joinedAt| location|timezone|
|3197468391| id_ID| 1993| male|2012-10-02T06:40:...| Medan Indonesia| 480|
|3537982273| id_ID| 1992| male|2012-09-29T18:03:...| Medan Indonesia| 420|
| 823183725| en_US| 1975| male|2012-10-06T03:14:...|Stratford Ontario| -240|
|1872223848| en_US| 1991|female|2012-11-04T08:59:...| Tehran Iran| 210|
|3429017717| id_ID| 1995|female|2012-09-10T16:06:...| null| 420|
| 627175141| ka_GE| 1973|female|2012-11-01T09:59:...| Tbilisi Georgia| 240|
|2752000443| id_ID| 1994| male|2012-10-03T05:22:...| Medan Indonesia| 420|
|3473687777| id_ID| 1965|female|2012-10-03T12:19:...| Medan Indonesia| 420|
|2966052962| id_ID| 1979| male|2012-10-31T10:11:...| Medan Indonesia| 420|
| 264876277| id_ID| 1988|female|2012-10-02T07:28:...| Medan Indonesia| 420|
|1534483818| en_US| 1992| male|2012-09-25T13:38:...| Medan Indonesia| 420|
|2648135297| en_US| 1996|female|2012-10-30T05:09:...| Phnom Penh| 420|
+----------+------+---------+------+--------------------+------------------+--------+
Two, load the JSON data source
1. UseSparkContext
val conf = new SparkConf().setAppName("Json").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("in/users.json")
import scala.util.parsing.json._
val rdd = lines.map(x=>JSON.parseFull(x))
rdd.collect.foreach(println)
The read display effect is as follows:
Some(Map(name -> Michael))
Some(Map(name -> Andy, Age -> 30.0))
Some(Map(name -> Justin, Age -> 19.0))
2. UseSparkSession
val spark = SparkSession.builder().appName("JSON").master("local[*]").getOrCreate()
//val df = spark.read.format("json").option("header","true").load("in/users.json")
val df = spark.read.json("in/users.json")//与上一行等价
df.show()
The read display effect is as follows:
+----+-------+
| Age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Three, Spark reads the Jar package to perform Scala operations
1. Package the Spark
code
- Here, the file is read by reading the configuration file to facilitate the modification of the file later, without modifying the code
- Note that you need to upload the read file to the corresponding hdfs file system
Configuration file:
loadfile:hdfs://192.168.8.99:9000/kb09File/world.txt
outfile:hdfs://192.168.8.99:9000/kb09File/outfile
Scala code:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
val sc = new SparkContext(conf)
val properties=new Properties()
properties.load(new FileInputStream("/opt/userset.properties"))
val loadFilePath=properties.get("loadfile").toString
val outFilePath=properties.get("outfile").toString
val lines = sc.textFile(loadFilePath)
lines.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_).saveAsTextFile(outFilePath)
lines.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_).foreach(println)
sc.stop()
}
2. Delete security files
- Upload the Jar package to the virtual machine
- Execute the following statement to delete the security file in the Jar package, otherwise the Scala code cannot be executed
zip -d /opt/kb09File/sparkdemo1.jar 'META-INF/*.DSA' 'META-INF/*SF'
3. Read the Jar package
- Upload configuration file to hdfs file system
- Execute Jar package operation Scala code
spark-submit --class gaoji.WordCount --master local[1] ./sparkdemo1.jar