Programming of 10 spark DataFrame

Earlier we learned about the RDD program, very much RDD advantages, but does not contain the schema information, the information that is listed, can only look forward to obtaining data through an iterative process is repeated, this article will explain the usage DataFrame so-called DataFrame that contain schema information RDD .
RDD [] is the elastic core of the distributed data set of spark, it is read-only, memory-based, RDD form a combined DAG operator [] directed acyclic graph, DAG and presumed execution delay, and very efficient. This article will RDD based programming.

1 systems, software and premise constraints

  • CentOS 7 64 workstations of the machine ip is 192.168.100.200, the reader is set according to their actual situation
  • RDD program has been completed
    https://www.jianshu.com/p/dd250c656c91
  • hadoop already installed
    https://www.jianshu.com/p/b7ae3b51e559
  • Permission to remove the effects of the operation, all operations are carried out in order to root
  • Ensure hadoop, spark, hive of metastore has started, has been performed spark-shell is connected to the scala

2 operation

  • 1 analysis people.json
    the scala command line, enter the following command:
# 导入SparkSession
import org.apache.spark.sql.SparkSession
# 导入RDD隐式转DataFrame的包
import spark.implicits._
# 创建sparkSession【代替SparkSql】
val sparkSession = SparkSession.builder().getOrCreate()
# 加载people.json形成DataFrame
val df = spark.read.json("file:///root/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json")
# 查询全部
df.show()
# 打印模式信息
df.printSchema()
# 条件过滤
df.filter("name='Andy'").show()
df.filter("age>20").show()
# 选择多列
df.select("name","age").show()
# 排序
df.sort(df("age").desc).show()
# 分组
df.groupBy("age").count().show()
# 列重命名
df.select(df("name").as("用户名称")).show()
# df数据写入本地文件
df.select("name","age").write.format("csv").save("file:///root/people.csv")
# rdd转化为df
val rdd = sc.parallelize(Array("java","python","cpp"))
val df.toDF()

These are carried out in DataFrame-based operations in sparksql.

Guess you like

Origin www.cnblogs.com/alichengxuyuan/p/12576816.html