I am brand new to Spark/scala and I'm trying to import a CSV file into spark, and analyse the data within it. The CSV file has 5 columns (passengerid, flightid, from, to, date). I have successfully uploaded the csv file but when I go to perform queries on it, say to find out the total flights per month, I continually get errors - in particular 'Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: df1; line 1 pos 14'. The table is successfully uploaded because I can see it as an output, the problem lies in querying the table. Any thoughts?
My code below:
''' package GerardPRactice
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
object trial1 extends App {
val sparkConf = new SparkConf().setAppName("trial1").
setMaster("local[2]") //set spark configuration
val sparkContext = new SparkContext(sparkConf) // make spark context
val sqlContext = new SQLContext(sparkContext) // make sql context
val spark = SparkSession
.builder()
.master("local")
.appName("Question1")
.getOrCreate()
val df1 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.load("C:/Users/Gerard/Documents/flightData.csv")
// df1: org.apache.spark.sql.DataFrame = [passengerID: int, flightID: int, Departure: string, Destination: string, date: int]
val df2 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.load("C:/Users/Gerard/Documents/passengers.csv")
df1.show()
//val PassID = df1.select("passengerId")
val totalflightJAN = spark.sql("SELECT * FROM df1 WHERE date>= '2017-01-01' & date<='2017-01-31'")
totalflightJAN.collect.foreach(println)
}'''
Do yourself a favor and switch to DataFrame syntax rather than pure SQL! :)
Assuming that df1.show
and df1.printSchema
succeed (also, take a close look at your date
data type), you can try the following:
df1.filter($"date" >= lit('2017-01-01') && $"date" <= lit('2017-01-31'))
you might have to wrap "date" with to_date($"date", "yyyy/MM/dd")
(or other format)