Data type conversion and pyspark

spark what type of data https://spark.apache.org/docs/latest/sql-reference.html

Spark Data Types

Data Types

Spark SQL and DataFrames support the following data types:

Numeric types
- ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
- ShortType: Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
- IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
- LongType: Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
- FloatType: Represents 4-byte single-precision floating point numbers.
- DoubleType: Represents 8-byte double-precision floating point numbers.
- DecimalType: Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal. A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
String type
- StringType: Represents character string values.
Binary type
- BinaryType: Represents byte sequence values.
Boolean type
- BooleanType: Represents boolean values.
Datetime type
- TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second.
- DateType: Represents values comprising values of fields year, month, day.
Complex types
- ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type of elementType. containsNull is used to indicate if elements in a ArrayType value can have null values.
- MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. The data type of keys are described by keyType and the data type of values are described by valueType. For a MapType value, keys are not allowed to have null values. valueContainsNull is used to indicate if values of a MapType value can have null values.
- StructType(fields): Represents values with the structure described by a sequence of StructFields (fields).
  - StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.

Corresponding to the type of data here pyspark pyspark.sql.types

Some common conversion scenarios:

1. Converts a date / timestamp / string to a value of string, the string is converted into the format specified by the second argument

df.withColumn('test', F.date_format(col('Last_Update'),"yyyy/MM/dd")).show()

2. turn into a string, can be cast into the type you want, such as following the date type

df = df.withColumn('date', F.date_format(col('Last_Update'),"yyyy-MM-dd").alias('ts').cast("date"))

3. The timestamp number of seconds (from the beginning of 1970) turn into a date format string

4. unix_timestamp String to timestamp the date seconds, the operation is the inverse operation of the above

　　Because unix_timestamp not consider ms, ms must consider if you can use the following method

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)

5. timestamp seconds converted timestamp type, can be used F.to_timestamp

Ref:

https://stackoverflow.com/questions/54337991/pyspark-from-unixtime-unix-timestamp-does-not-convert-to-timestamp

Data type conversion and pyspark

Data Types

Guess you like