Appreciated Spark SQL (three) - Spark SQL procedure Example

Previous Speaking at Spark 2.x which, in fact SQLContext and HiveContext are outdated, the opposite is the use of the sql function SparkSession object to manipulate SQL statements. Before using this function executes SQL statements need to call createOrReplaceTempView DataFrame registered a temporary table, so the key is to first convert the RDD into DataFrame. In fact, the actual declared Spark

type DataFrame = Dataset[Row]

So, DataFrame is Dataset [Row] alias. RDD is to provide for low-level API, and DataFrame / Dataset provide for high-level API (such as SQL for the occasion for structured data).

Here are some examples of Spark SQL program.

An example of a: SparkSQLExam.scala

 1 package bruce.bigdata.spark.example
 2 
 3 import org.apache.spark.sql.Row
 4 import org.apache.spark.sql.SparkSession
 5 import org.apache.spark.sql.types._
 6 
 7 object SparkSQLExam {
 8 
 9     case class offices(office:Int,city:String,region:String,mgr:Int,target:Double,sales:Double)
10     
11     def main(args: Array[String]) {
12 
13         val spark = SparkSession
14           .builder
15           .appName("SparkSQLExam")
16           .getOrCreate()
17         
18         runSparkSQLExam1(spark)
19         runSparkSQLExam2(spark)
20         
21         spark.stop()
22     
23     }
24     
25     
26     private def runSparkSQLExam1(spark: SparkSession): Unit = {
27     
28         import spark.implicits._
29         
30         val rddOffices=spark.sparkContext.textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt").map(_.split("\t")).map(p=>offices(p(0).trim.toInt,p(1),p(2),p(3).trim.toInt,p(4).trim.toDouble,p(5).trim.toDouble))
31         val officesDataFrame = spark.createDataFrame(rddOffices)
32         
33         officesDataFrame.createOrReplaceTempView("offices")
34         spark.sql("select city from offices where region='Eastern'").map(t=>"City: " + t(0)).collect.foreach(println)
35         
36     
37     }
38     
39     private def runSparkSQLExam2(spark: SparkSession): Unit = {
40     
41          import spark.implicits._
42          import org.apache.spark.sql._
43          import org.apache.spark.sql.types._
44         
45          val schema = new StructType(Array(StructField("office", IntegerType, false), StructField("city", StringType, false), StructField("region", StringType, false), StructField("mgr", IntegerType, true), StructField("target", DoubleType, true), StructField("sales", DoubleType, false)))
46          val rowRDD = spark.sparkContext.textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt").map(_.split("\t")).map(p => Row(p(0).trim.toInt,p(1),p(2),p(3).trim.toInt,p(4).trim.toDouble,p(5).trim.toDouble))
47          val dataFrame = spark.createDataFrame(rowRDD, schema)
48          
49          dataFrame.createOrReplaceTempView("offices2")        
50          spark.sql("select city from offices2 where region='Western'").map(t=>"City: " + t(0)).collect.foreach(println)
51         
52     }
53     
54 }

Use the following command to compile:

[root@BruceCentOS4 scala]# scalac SparkSQLExam.scala

Before compiling the need to increase the path in the CLASSPATH:

export CLASSPATH=$CLASSPATH:$SPARK_HOME/jars/*:$(/opt/hadoop/bin/hadoop classpath)

Then packaged into a jar file:

[root@BruceCentOS4 scala]# jar -cvf spark_exam_scala.jar bruce

Then submit by spark-submit the program to the yarn cluster execution, in order to facilitate the client to see the results, this uses yarn cient mode.

[root@BruceCentOS4 scala]# $SPARK_HOME/bin/spark-submit --class bruce.bigdata.spark.example.SparkSQLExam --master yarn --deploy-mode client spark_exam_scala.jar

Screenshot operating results:

 

Examples of two: SparkSQLExam.scala (need to start the hive metastore)

 1 package  bruce.bigdata.spark.example
 2 
 3 import org.apache.spark.sql.{SaveMode, SparkSession}
 4 
 5 object SparkHiveExam {
 6 
 7     def main(args: Array[String]) {
 8         
 9         val spark = SparkSession
10           .builder()
11           .appName("Spark Hive Exam")
12           .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
13           .enableHiveSupport()
14           .getOrCreate()
15        
16         import spark.implicits._
17          
18          // use hql view hive data 
19          spark.sql ( "Show Databases" ) .collect.foreach (println)
 20          spark.sql ( "use OrderDB" )
 21          spark.sql ( "Show the Tables" ) .collect.foreach (the println)
 22 is          spark.sql ( "SELECT City Offices from WHERE Region = 'Eastern'") Map. (T => "City:" + T (0 .)) collect.foreach (the println)
 23 is          
24          // will hql check out the saved data to another table a new hive
 25          // find order amount more than $ 10,000 of product 
26          spark.sql ( "" "the Create the table products_high_sales (MFR_ID String, String product_id, the Description String) 
 27                    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE""")
28         spark.sql("""select mfr_id,product_id,description
29                    from products a inner join orders b
30                    on a.mfr_id=b.mfr and a.product_id=b.product
31                    where b.amount>10000""").write.mode(SaveMode.Overwrite).saveAsTable("products_high_sales")
32         
33         //将HDFS文件数据导入到hive表中            
34         spark.sql("""CREATE TABLE IF NOT EXISTS offices2 (office int,city string,region string,mgr int,target double,sales double ) 
35                    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE""")
36         spark.sql("LOAD DATA INPATH '/user/hive/warehouse/orderdb.db/offices/offices.txt' INTO TABLE offices2")
37         
38         spark.stop()
39     }
40 }

Use the following command to compile:

[root@BruceCentOS4 scala]# scalac SparkHiveExam.scala

Use the following command packaging:

[root@BruceCentOS4 scala]# jar -cvf spark_exam_scala.jar bruce

Run the following command:

[root@BruceCentOS4 scala]# $SPARK_HOME/bin/spark-submit --class bruce.bigdata.spark.example.SparkHiveExam --master yarn --deploy-mode client spark_exam_scala.jar

Result of the program:

 

Further the above-described running, hive in more than two tables:

 

 

Examples of three: spark_sql_exam.py

 1 from __future__ import print_function
 2 
 3 from pyspark.sql import SparkSession
 4 from pyspark.sql.types import *
 5 
 6 
 7 if __name__ == "__main__":
 8     spark = SparkSession \
 9         .builder \
10         .appName("Python Spark SQL exam") \
11         .config("spark.some.config.option", "some-value") \
12         .getOrCreate()
13 
14     schema = StructType([StructField("office", IntegerType(), False), StructField("city", StringType(), False), 
15         StructField("region", StringType(), False), StructField("mgr", IntegerType(), True), 
16         StructField("Target", DoubleType(), True), StructField("sales", DoubleType(), False)])
17         
18     rowRDD = spark.sparkContext.textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt").map(lambda p: p.split("\t")) \
19         .map(lambda p: (int(p[0].strip()), p[1], p[2], int(p[3].strip()), float(p[4].strip()), float(p[5].strip())))
20             
21     dataFrame = spark.createDataFrame(rowRDD, schema)
22     dataFrame.createOrReplaceTempView("offices")
23     spark.sql("select city from offices where region='Eastern'").show()
24     
25     spark.stop()

 Execute the command to run the program:

[root@BruceCentOS4 spark]# $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client spark_sql_exam.py

Result of the program:

 

Examples of four: JavaSparkSQLExam.java

 1 package bruce.bigdata.spark.example;
 2 
 3 import java.util.ArrayList;
 4 import java.util.List;
 5 
 6 import org.apache.spark.api.java.JavaRDD;
 7 import org.apache.spark.api.java.function.Function;
 8 import org.apache.spark.api.java.function.MapFunction;
 9 import org.apache.spark.sql.Dataset;
10 import org.apache.spark.sql.Row;
11 import org.apache.spark.sql.RowFactory;
12 import org.apache.spark.sql.SparkSession;
13 import org.apache.spark.sql.types.DataTypes;
14 import org.apache.spark.sql.types.StructField;
15 import org.apache.spark.sql.types.StructType;
16 import org.apache.spark.sql.AnalysisException;
17 
18 
19 public class JavaSparkSQLExam {
20     public static void main(String[] args) throws AnalysisException {
21         SparkSession spark = SparkSession
22           .builder()
23           .appName("Java Spark SQL exam")
24           .config("spark.some.config.option", "some-value")
25           .getOrCreate();    
26         
27         List<StructField> fields = new ArrayList<>();
28         fields.add(DataTypes.createStructField("office", DataTypes.IntegerType, false));
29         fields.add(DataTypes.createStructField("city", DataTypes.StringType, false));
30         fields.add(DataTypes.createStructField("region", DataTypes.StringType, false));
31         fields.add(DataTypes.createStructField("mgr", DataTypes.IntegerType, true));
32         fields.add(DataTypes.createStructField("target", DataTypes.DoubleType, true));
33         fields.add(DataTypes.createStructField("sales", DataTypes.DoubleType, false));
34         
35         StructType schema = DataTypes.createStructType(fields);
36         
37         
38         JavaRDD<String> officesRDD = spark.sparkContext()
39           .textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt", 1)
40           .toJavaRDD();
41         
42         JavaRDD<Row> rowRDD = officesRDD.map((Function<String, Row>) record -> {
43           String[] attributes = record.split("\t");
44           return RowFactory.create(Integer.valueOf(attributes[0].trim()), attributes[1], attributes[2], Integer.valueOf(attributes[3].trim()), Double.valueOf(attributes[4].trim()), Double.valueOf(attributes[5].trim()));
45         });
46         
47         Dataset<Row> dataFrame = spark.createDataFrame(rowRDD, schema);
48         
49         dataFrame.createOrReplaceTempView("offices");
50         Dataset<Row> results = spark.sql("select city from offices where region='Eastern'");
51         results.collectAsList().forEach(r -> System.out.println(r));
52         
53         spark.stop();
54     }
55 }

After the compiler package execute the following command:

[root@BruceCentOS4 spark]# $SPARK_HOME/bin/spark-submit --class bruce.bigdata.spark.example.JavaSparkSQLExam --master yarn --deploy-mode client spark_exam_java.jar

operation result:

 

The above are some of the few examples of Spark SQL procedures were used Scala / Python / Java to write. Another addition to these three languages, Spark also supports the R programming language, because I am not familiar with, not the example. No matter what language, in fact, are basically the same API, and is mainly used DataFrame Dataset high-level API to invoke and execute SQL. Using the data API, it can easily be converted into SQL structured data to operate, but also to facilitate the operation of the Hive.

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/roushi17/p/spark_sql_examples.html