A Beginner :
I am trying to read a CSV file, so that I can query it using Spark SQL. The CSV looks like below:
16;10;9/6/2018
The CSV file contains no headers but we know that first column is a department code, second column is building code and third column is a date of format m/d/YYYY.
I wrote the following code to load the CSV filesv with a custom schema:
StructType sch = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("department", DataTypes.IntegerType, true),
DataTypes.createStructField("building", DataTypes.IntegerType, false),
DataTypes.createStructField("date", DataTypes.DateType, true),
});
Dataset<Row> csvLoad = sparkSession.read().format("csv")
.option("delimiter", ";")
.schema(sch)
.option("header","false")
.load(somefilePath);
csvLoad.show(2);
When I use csvLoad.show(2)
it is only showing me the below output:
|department|building|date|
+----------+---------+---+
|null |null |null |
|null |null |null |
Can anyone please tell what is wrong in the code ? I am using spark 2.4 version.
TheWhiteRabbit :
The issue is with your date
field, since it has a custom format you'll need to specify the format as an option:
Dataset<Row> csvLoad = sparkSession.read().format("csv")
.option("delimiter", ";")
.schema(sch)
.option("header","false")
.option("dateFormat", "m/d/YYYY")
.load(somefilePath);
This will result in output:
+----------+--------+----------+
|department|building| date|
+----------+--------+----------+
| 16| 10|2018-01-06|
+----------+--------+----------+