sparksql能与hive兼容并且可以读取多种数据源如sql
1新建一个文件并保存
1,zhangsan,18
2,lisi,19
3,wangwu,20
4,zhaoliu,21
2提交文件到hdfs
hdfs dfs -put person.txt /
3使用map切分
val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/person.txt").map(_,split(","))
4定义case class(相当于表的schema)
case class Person(id: Long, name: String, age: Int)
5.将RDD和case class关联
val personRdd=rdd.map(fields(0).toLong, fields(1), fields(2).toInt)
6将personRdd转为DateFrame
import sqlContext.implicits._#在命令行中可以不写
val personDf = personRdd.toDF
7展示数据
personDf .show()
personDf .select("id","name").show
DSL风格语法
personDf .show()
//查看DataFrame中的内容
personDF.show
//查看DataFrame部分列中的内容
personDF.select(personDF.col(“name”)).show
personDF.select(col(“name”), col(“age”)).show
personDF.select(“name”).show
//打印DataFrame的Schema信息
personDF.printSchema
//查询所有的name和age,并将age+1
personDF.select(col(“id”), col(“name”), col(“age”) + 1).show
personDF.select(personDF(“id”), personDF(“name”), personDF(“age”) + 1).show
//过滤age大于等于18的
personDF.filter(col(“age”) >= 18).show
//按年龄进行分组并统计相同年龄的人数
personDF.groupBy(“age”).count().show()
SQL风格语法
如果想使用SQL风格的语法,需要将DataFrame注册成表
personDF.registerTempTable(“t_person”)
//查询年龄最大的前两名
sqlContext.sql(“select * from t_person order by age desc limit 2”).show
//显示表的Schema信息
sqlContext.sql(“desc t_person”).show
代码示例程序
下面程序可以直接在idea上面运行main方法,集群上运行需要注释setMaster(“local”)
package cn.itcast.spark.day4
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
/**
* Created by root on 2016/5/19.
*/
object SQLDemo {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SQLDemo").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//设置用户访问权限这里用户名为hadoop的
System.setProperty("user.name","root")
val personRdd = sc.textFile("hdfs://192.168.1.100:9000/person.txt").map(line =>{
val fields = line.split(",")
Person(fields(0).toLong, fields(1), fields(2).toInt)
})
import sqlContext.implicits._
val personDf = personRdd.toDF
personDf.registerTempTable("person")
sqlContext.sql("select * from person where age >= 20 order by age desc limit 2").show()
sc.stop()
}
}
case class Person(id: Long, name: String, age: Int)
将程序打成jar包,上传到spark集群,提交Spark任务
倒数第一个和第二个为入参,代表输入和输出
/usr/local/spark-1.5.2-bin-hadoop2.6/bin/spark-submit
–class cn.itcast.spark.sql.InferringSchema
–master spark://node1.itcast.cn:7077
/root/spark-mvn-1.0-SNAPSHOT.jar
hdfs://node1.itcast.cn:9000/person.txt
hdfs://node1.itcast.cn:9000/out
sparksql笔记
//1.读取数据,将每一行的数据使用列分隔符分割
val lineRDD = sc.textFile(“hdfs://node1.itcast.cn:9000/person.txt”, 1).map(_.split(" "))
//2.定义case class(相当于表的schema)
case class Person(id:Int, name:String, age:Int)
//3.导入隐式转换,在当前版本中可以不用导入
import sqlContext.implicits._
//4.将lineRDD转换成personRDD
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))
//5.将personRDD转换成DataFrame
val personDF = personRDD.toDF
6.对personDF进行处理
#(SQL风格语法)
personDF.registerTempTable(“t_person”)
sqlContext.sql(“select * from t_person order by age desc limit 2”).show
sqlContext.sql(“desc t_person”).show
val result = sqlContext.sql(“select * from t_person order by age desc”)
7.保存结果
result.save(“hdfs://hadoop.itcast.cn:9000/sql/res1”)
result.save(“hdfs://hadoop.itcast.cn:9000/sql/res2”, “json”)
#以JSON文件格式覆写HDFS上的JSON文件
import org.apache.spark.sql.SaveMode._
result.save(“hdfs://hadoop.itcast.cn:9000/sql/res2”, “json” , Overwrite)
8.重新加载以前的处理结果(可选)
sqlContext.load(“hdfs://hadoop.itcast.cn:9000/sql/res1”)
sqlContext.load(“hdfs://hadoop.itcast.cn:9000/sql/res2”, “json”)