Spark's read and write to ES

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python and an optimization engine that supports generic execution graphs.

Spark typically provides fast iteration/function-like functionality for large datasets by caching data into memory. Contrary to the other libraries mentioned in this document, Apache Spark is a computing framework and has nothing to do with Map/Reduce itself, but it integrates with Hadoop and is mainly aimed at HDFS. elasticsearch-hadoop allows Elasticsearch to be used in Spark in two ways: via dedicated support since 2.1 or a Map/Reduce bridge since 2.0. elasticsearch-hadoop supports Spark 2.0 since version 5.0

Just like other libraries, elasticsearch-hadoop needs to be available on Spark's classpath. Since Spark has multiple deployment modes, it can be translated into the target classpath whether it's only on one node (as is the case in local mode - will be used throughout the document) or per node, depending on what infrastructure is required.

elasticsearch-hadoop provides RDD (Resilient Distributed Dataset) (or exact paired RDD) in the form of native integration between Elasticsearch and Apache Spark to read data from Elasticsearch. RDDs come in two flavors: Scala (which returns data as a Tuple2 of Scala collections) and Java (which returns data as Tuple2 containing java.util collections).

To configure elasticsearch-hadoop for Apache Spark, you can set various properties described in the "Configuration" chapter of the SparkConf object:

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
conf.set("es.index.auto.create", "true");

Command Line. For those who want to set properties via the command line (either directly or by loading them from a file), please note that Spark only accepts those properties starting with "spark". prefix, and will ignore the rest (and depending on the version of the warning that might be thrown). To work around this limitation, define elasticsearch-hadoop properties by appending spark. prefix (so they become spark.es), elasticsearch-hadoop will resolve them automatically:

$ ./bin/spark-submit --conf spark.es.resource=index/type ...

Write data to ES:
Using elasticsearch-hadoop, any RDD can be saved to Elasticsearch as long as its content is translated into documents. In practice, this means that the RDD type needs to be a Map (whether Scala or Java), a JavaBean or a Scala case class. If this is not the case, it is easy to transform the data in Spark or insert their own custom ValueWriter.

import org.apache.spark.SparkContext    
import org.apache.spark.SparkContext._

import org.elasticsearch.spark._        

...

val conf = ...
val sc = new SparkContext(conf)         

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Write JSON to ES:
For cases where the data in the RDD is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformations; the data will be sent as-is and directly to Elasticsearch. So, in this case, elasticsearch-hadoop expects an RDD (byte[]/Array[Byte]) containing strings or byte arrays, assuming each entry represents a JSON document. If the RDD does not have the correct signature, the saveJsonToEs methods cannot be applied (in Scala they will not be available).

val json1 = """{"reason" : "business", "airport" : "SFO"}"""      
val json2 = """{"participants" : 5, "airport" : "OTP"}"""

new SparkContext(conf).makeRDD(Seq(json1, json2))
                      .saveJsonToEs("spark/json-trips") 

Write the resource to ES:

val game = Map("media_type"->"game","title" -> "FF VI","year" -> "1994")
val book = Map("media_type" -> "book","title" -> "Harry Potter","year" -> "2010")
val cd = Map("media_type" -> "music","title" -> "Surfing With The Alien")

sc.makeRDD(Seq(game, book, cd)).saveToEs("my-collection-{media_type}/doc") 

Read data from ES:

import org.apache.spark.SparkContext    
import org.apache.spark.SparkContext._

import org.elasticsearch.spark._        

...

val conf = ...
val sc = new SparkContext(conf)         

val RDD = sc.esRDD("radio/artists") 

write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325444062&siteId=291194637