Understanding Spark SQL (two) - SQLContext and HiveContext

Use Spark SQL, in addition to using the previously described method, in fact, it may also be used SQLContext or HiveContext implemented programmatically. The former supports the SQL parser (SQL-92 syntax), which supports SQL syntax parser and HiveSQL parser, parser HiveSQL default, the user can switch by SQL syntax parser configured to run unsupported syntax HiveQL , such as: select 1. HiveContext actually a subclass of SQLContext, so in addition to the override functions and variables, and may be used as SQLContext functions and variables HiveContext during operation.

Because the scala program fragment spark-shell tool is actually running, for convenience, the following demonstration using spark-shell.

First look SQLContext, as is standard SQL, you can not rely on the Hive metastore, such as the following examples (not started hive metastore):

[root@BruceCentOS4 ~]# $SPARK_HOME/bin/spark-shell --master yarn --conf spark.sql.catalogImplementation=in-memory

 

 scala> case class offices(office:Int,city:String,region:String,mgr:Int,target:Double,sales:Double)
defined class offices

scala> val rddOffices=sc.textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt").map(_.split("\t")).map(p=>offices(p(0).trim.toInt,p(1),p(2),p(3).trim.toInt,p(4).trim.toDouble,p(5).trim.toDouble))
rddOffices: org.apache.spark.rdd.RDD[offices] = MapPartitionsRDD[3] at map at <console>:26

scala> val officesDataFrame = spark.createDataFrame(rddOffices)
officesDataFrame: org.apache.spark.sql.DataFrame = [office: int, city: string ... 4 more fields]

scala> officesDataFrame.createOrReplaceTempView("offices")

scala> spark.sql("select city from offices where region='Eastern'").map(t=>"City: " + t(0)).collect.foreach(println)
City: NewYork                                                                   
City: Chicago
City: Atlanta

scala>

 After executing the above command, in fact, started a yarn client mode Spark Application of yarn in a cluster, then the scala> prompt, enter the statement will be generated RDD's transformation, the last command will generate collect RDD of action, That will trigger the execution of the program and submit the Job.

The reason why the command line plus --conf spark.sql.catalogImplementation = in-memory option, because the default spark-shell of the start of SparkSession objects spark is supported by default Hive, and without this option is enabled, the program will to connect hive metastore, because there did not start hive metastore, so the program will function in the implementation of createDataFrame error.

The first line in the program is a case class statement, this is the back pattern data definition file (defined mode in addition to this method, in fact, there is another method, and then later described). The second line is read from a text file in hdfs, and mapped to work through the above mode map. The third row is generated based on the second row DataFrame RDD, based on the fourth row in the third row DataFrame temporary registration table on a logical last line can be performed by SparkSession the sql statement sql function.

In fact, SQLContext is SQL entrance Spark 1.x is, in Spark 2.x using SparkSession as SQL entrance, but for backward compatibility, Spark 2.x still supports SQL SQLContext to operate, but will be prompted deprecated, Therefore, the above example is the use of Spark 2.x wording.

In fact there is another way to manipulate the SQL, for the same data, for example:

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val schema = new StructType(Array(StructField("office", IntegerType, false), StructField("city", StringType, false), StructField("region", StringType, false), StructField("mgr", IntegerType, true), StructField("target", DoubleType, true), StructField("sales", DoubleType, false)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(office,IntegerType,false), StructField(city,StringType,false), StructField(region,StringType,false), StructField(mgr,IntegerType,true), StructField(target,DoubleType,true), StructField(sales,DoubleType,false))

scala> val rowRDD = sc.textFile("/user/hive/warehouse/orderdb.db/offices/offices.txt").map(_.split("\t")).map(p => Row(p(0).trim.toInt,p(1),p(2),p(3).trim.toInt,p(4).trim.toDouble,p(5).trim.toDouble))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:30

scala> val dataFrame = spark.createDataFrame(rowRDD, schema)
dataFrame: org.apache.spark.sql.DataFrame = [office: int, city: string ... 4 more fields]

scala> dataFrame.createOrReplaceTempView("offices")

scala> spark.sql("select city from offices where region='Eastern'").map(t=>"City: " + t(0)).collect.foreach(println)
City: NewYork                                                                   
City: Chicago
City: Atlanta

This example and the previous example a little different, there are three main areas:

Examples of previous use case class 1 is defined mode, the Spark reflective to infer the Schema; and this example using the object model to define StructType type, which receives as an array members StructField object that represents the definition of a field, each field defined by the field names, field types, and whether to allow null composition;

2. For RDD representative of data, is directly previous example case class defined by dividing a field type, and this example is the use of Row type;

3. In generating DataFrame use createDataFrame function, the function parameters are not the same, as long as the previous example to the incoming RDD objects (objects in implicit mode), and the need to pass this example and RDD Schema definitions;

Actual programming, recommended the second approach, because it is more flexible, schema information may not necessarily be written dead, but can be generated in the course of the program running.

 

Then look at the following HiveContext usage before using HiveContext need to ensure that:

  • Spark is to support the use of Hive;
  • Hive hive-site.xml configuration file under the conf directory is already in the Spark;
  • hive metastore has been launched;

for example:

First start hive metastore:

[root@BruceCentOS ~]# nohup hive --service metastore &

Then still be exemplified by spark-shell, starting spark-shell, as shown below:

[root@BruceCentOS4 ~]# $SPARK_HOME/bin/spark-shell --master yarn

scala> spark.sql("show databases").collect.foreach(println)
[default]
[orderdb]

scala> spark.sql("use orderdb")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show tables").collect.foreach(println)
[orderdb,customers,false]
[orderdb,offices,false]
[orderdb,orders,false]
[orderdb,products,false]
[orderdb,salesreps,false]

scala> spark.sql("select city from offices where region='Eastern'").map(t=>"City: " + t(0)).collect.foreach(println)
City: NewYork                                                                   
City: Chicago
City: Atlanta

scala>

You can see the start spark-shell did not take that last option, because here we intend to use to manipulate data HiveContext Hive, you need support Hive. As mentioned above spark-shell is enabled by default Hive supports. With SQLContext similar, Spark 2.x also not required then HiveContext object to manipulate SQL, and direct use SparkSession object to manipulate just fine. Here it can be seen that the table can be directly manipulated, no longer defined schema, because the schema is defined outside of the hive metastore, spark hive metastore by connecting to the table to read schema information, so there can directly operate SQL.

 

In addition to using the above SqlContext ordinary file operations (additional definitions mode) and operation using HiveContext Hive table data (need to open the hive metastore) outside, is further operable SqlContext JSON, PARQUET other documents, data files since both bring the mode information, so you can create file-based direct DataFrame, such as:

scala> val df = spark.read.json("file:///opt/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]                

scala> df.createOrReplaceTempView("people")

scala> spark.sql("select name,age from people where age>19").map(t=>"Name :" + t(0) + ", Age: " + t(1)).collect.foreach(println)
Name :Andy, Age: 30    

 

Another use is called DSL (Domain Specific Language) last look at the DataFrame.

scala> val df = spark.read.json("file:///opt/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]                

scala> df.show()
+----+-------+                                                                  
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+


scala> df.select("name").show()
+-------+                                                                       
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+


scala> df.select(df("name"), df("age") + 1).show()
+-------+---------+                                                             
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+


scala> df.filter(df("age") > 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+


scala> df.groupBy("age").count().show()
+----+-----+                                                                    
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+


scala>

These are some basic summary of SQLContext and HiveContext Spark SQL usage of the examples cited spark-shell tools are used. Since the spark-shell is actually running tool scala program fragment, the above example can become an independent application. I will try them next blog post using Scala, Java and Python to write a separate program to operate above example hive database orderdb, may be appropriate to use some of the more complex SQL to analyze statistical data.

 

 

 

 

Guess you like

Origin www.cnblogs.com/roushi17/p/sqlcontext_hivecontext.html