Delta Lake 0.3.0 release, DML on large data sets

Delta Lake is a memory layer, providing capacity Apache Spark ACID transactions and large data workloads. Delta Lake 0.3.0 release, support for multiple statements in order to update and delete data DeltaLake table, as follows:

Delete data from the table

You can delete data from DeltaLake matches the table, for example, to delete all events prior to 2017, you can run the following command:

Scala

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToEventsTable)

deltaTable.delete("date < '2017-01-01'")        // predicate using SQL formatted string

import org.apache.spark.sql.functions._
import spark.implicits._

deltaTable.delete($"date" < "2017-01-01")      

 

Java

import io.delta.tables.*;
import org.apache.spark.sql.functions;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToEventsTable);

deltaTable.delete("date < '2017-01-01'");            // predicate using SQL formatted string

deltaTable.delete(functions.col("date").lt(functions.lit("2017-01-01")));

DELETE delete data from the latest version of the Delta Lake table, but does not delete it from the physical storage until explicitly remove the old version.

Update a table

Delta Lake can update data in tables that match. For example, to fix spelling errors in event types, you can run the following command:

Scala

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToEventsTable)

deltaTable.updateExpr(            // predicate and update expressions using SQL formatted string
  "eventType = 'clck'",
  Map("eventType" -> "'click'")

import org.apache.spark.sql.functions._
import spark.implicits._

deltaTable.update(                // predicate using Spark SQL functions and implicits
  $"eventType" = "clck"),
  Map("eventType" -> lit("click"));

Java

import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToEventsTable);

deltaTable.updateExpr(            // predicate and update expressions using SQL formatted string
  "eventType = 'clck'",
  new HashMap<String, String>() {{
    put("eventType", "'click'");
  }}
);

deltaTable.update(                // predicate using Spark SQL functions
  functions.col(eventType).eq("clck"),
  new HashMap<String, Column>() {{
    put("eventType", functions.lit("click"));
  }}
);

Use Merge upwardly inserted into the table

You may be used to merge data into the operation table from Delta Lake SPark DataFrame. This operation is similar to SQL Merge command, but there are other support update, insert, and delete the deleted and additional conditions.

Suppose there is a Spark DataFrame, which contains the new data eventId event. Some of these events may have occurred in the Events table. Therefore, when you want to incorporate new data into the Events table, you need to update line (ie the presence of eventId) match and insert a new row (ie eventId does not exist). Run the following operations:

Scala

import io.delta.tables._
import org.apache.spark.sql.functions._

val updatesDF = ...  // define the updates DataFrame[date, eventId, data]

DeltaTable.forPath(spark, pathToEventsTable)
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched
  .updateExpr(
    Map("data" -> "updates.data"))
  .whenNotMatched
  .insertExpr(
    Map(
      "date" -> "updates.date",
      "eventId" -> "updates.eventId",
      "data" -> "updates.data"))
  .execute()

Java

import io.delta.tables.*;
import org.apache.spark.sql.functions;
import java.util.HashMap;

Dataset<Row> updatesDF = ...  // define the updates DataFrame[date, eventId, data]

DeltaTable.forPath(spark, pathToEventsTable)
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched()
  .updateExpr(
    new HashMap<String, String>() {{
      put("data" -> "events.data");
    }})
  .whenNotMatched()
  .insertExpr(
    new HashMap<String, String>() {{
      put("date", "updates.date");
      put("eventId", "updates.eventId");
      put("data", "updates.data");
    }})
  .execute();

You should add as much information as to the terms of the merger in order to reduce the workload and reduce the likelihood of conflict affairs. On how to use the merge in different scenarios, see the release notes.

Release Notes:

https://docs.delta.io/0.3.0/delta-update.html

Guess you like

Origin www.oschina.net/news/108762/delta-lake-0-3-0-released