SQL queries in Spark/scala

Gerard :

I am brand new to Spark/scala and I'm trying to import a CSV file into spark, and analyse the data within it. The CSV file has 5 columns (passengerid, flightid, from, to, date). I have successfully uploaded the csv file but when I go to perform queries on it, say to find out the total flights per month, I continually get errors - in particular 'Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: df1; line 1 pos 14'. The table is successfully uploaded because I can see it as an output, the problem lies in querying the table. Any thoughts?

My code below:

''' package GerardPRactice

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SQLContext


object trial1 extends App {

  val sparkConf = new SparkConf().setAppName("trial1").
setMaster("local[2]") //set spark configuration

val sparkContext = new SparkContext(sparkConf) // make spark context
val sqlContext = new SQLContext(sparkContext) // make sql context

  val spark = SparkSession
    .builder()
    .master("local")
    .appName("Question1")
    .getOrCreate()



val df1 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "|")
    .option("inferSchema", "true")
    .load("C:/Users/Gerard/Documents/flightData.csv")
   // df1: org.apache.spark.sql.DataFrame = [passengerID: int, flightID: int, Departure: string, Destination: string, date: int]

    val df2 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "|")
    .option("inferSchema", "true")
    .load("C:/Users/Gerard/Documents/passengers.csv")

  df1.show()
  //val PassID = df1.select("passengerId")
  val totalflightJAN = spark.sql("SELECT * FROM df1 WHERE date>= '2017-01-01' & date<='2017-01-31'")
  totalflightJAN.collect.foreach(println)
}'''
Marsellus Wallace :

Do yourself a favor and switch to DataFrame syntax rather than pure SQL! :)

Assuming that df1.show and df1.printSchema succeed (also, take a close look at your date data type), you can try the following:

df1.filter($"date" >= lit('2017-01-01') && $"date" <= lit('2017-01-31'))

you might have to wrap "date" with to_date($"date", "yyyy/MM/dd") (or other format)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325597&siteId=1