spark-sql与elasticsearch整合&测试

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/stark_summer/article/details/49743687

1. 前置条件

  • spark是1.4.1版本
  • elasticsearch是1.7版本
  • java是1.7版本

2. 依赖jar包

需要使用elasticsearch-hadoop
下载地址:http://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-hadoop/2.2.0-m1


3. 配置

将下载的elasticsearch-hadoop包放置到$SPARK_HOME/lib/下

ls -lh lib

-rw-r----- 1 wangyue wangyue 692K 119 17:49 elasticsearch-hadoop-2.2.0-m1.jar
-rw-rw-r-- 1 wangyue wangyue 946K  910 14:22 mysql-connector-java-5.1.35.jar

配置$SPARK_HOME/conf/spark-env.sh
SPARK_CLASSPATH增加elasticsearch-hadoop-2.2.0-m1.jar包


4. spark-sql写数据到elasticsearch

vim /home/cluster/data/test/people.txt,增加如下测试内容:

zhang,san,20
li,si,30
wang,wu,40
li,bai,100
du,fu,101

启动spark-shell增加如下代码:

import org.apache.spark.sql.SQLContext    
import org.apache.spark.sql.SQLContext._
import org.elasticsearch.spark.sql._ 
import org.apache.spark.rdd.RDD._
import sqlContext.implicits._

//创建sqlContext
val sqlContext = new SQLContext(sc)

//定义Person case class
case class Person(name: String, surname: String, age: Int)

//创建DataFrame
val people = sc.textFile("file:///home/cluster/data/test/people.txt").map(_.split(",")).map(p => Person(p(0), p(1), p(2).trim.toInt)).toDF()

people.saveToEs("spark/people")  

查询插入到elasticsearch数据:

GET /spark/people/_search

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 1,
      "hits": [
         {
            "_index": "spark",
            "_type": "people",
            "_id": "AVDrt5MyJpFYJkWM4nwP",
            "_score": 1,
            "_source": {
               "name": "zhang",
               "surname": "san",
               "age": 20
            }
         },
         {
            "_index": "spark",
            "_type": "people",
            "_id": "AVDrt5SEJpFYJkWM4nwS",
            "_score": 1,
            "_source": {
               "name": "li",
               "surname": "bai",
               "age": 100
            }
         },
         {
            "_index": "spark",
            "_type": "people",
            "_id": "AVDrt5MyJpFYJkWM4nwR",
            "_score": 1,
            "_source": {
               "name": "wang",
               "surname": "wu",
               "age": 40
            }
         },
         {
            "_index": "spark",
            "_type": "people",
            "_id": "AVDrt5MyJpFYJkWM4nwQ",
            "_score": 1,
            "_source": {
               "name": "li",
               "surname": "si",
               "age": 30
            }
         },
         {
            "_index": "spark",
            "_type": "people",
            "_id": "AVDrt5SEJpFYJkWM4nwT",
            "_score": 1,
            "_source": {
               "name": "du",
               "surname": "fu",
               "age": 101
            }
         }
      ]
   }
}

5. spark-sql读取elasticsearch数据

Name Default value Description
path Elasticsearch index/type required
pushdown true Whether to translate (push-down) Spark SQL into Elasticsearch
strict false Whether to use exact (not analyzed) matching or not (analyzed)
  • load或read方式读取:

    import org.apache.spark.sql.SQLContext    
    import org.apache.spark.sql.SQLContext._
    
    val sqlContext = new SQLContext(sc)
    // options for Spark 1.3 need to include the target path/resource
    val options13 = Map("path" -> "spark/people",
                        "pushdown" -> "true",
                        "es.nodes" -> "localhost","es.port" -> "9200")
    
    // Spark 1.3 style
    val spark13DF = sqlContext.load("org.elasticsearch.spark.sql", options13)
    
    // options for Spark 1.4 - the path/resource is specified separately
    val options = Map("pushdown" -> "true", "es.nodes" -> "localhost", "es.port" -> "9200")
    
    // Spark 1.4 style
    val spark14DF = sqlContext.read.format("org.elasticsearch.spark.sql").options(options).load("spark/people")
    
    查询name,age:
    spark14DF.select("name","age").collect().foreach(println(_))
    [zhang,20]
    [li,100]
    [wang,40]
    [li,30]
    [du,101]
    
    注册临时表&查询name
    spark14DF.registerTempTable("people")
    val results = sqlContext.sql("SELECT name FROM people")
    results.map(t => "Name: " + t(0)).collect().foreach(println)
    Name: zhang
    Name: li
    Name: wang
    Name: li
    Name: du
  • 读取elasticsearch数据并创建临时表myPeople

    sqlContext.sql(
       "CREATE TEMPORARY TABLE myPeople    " + 
       "USING org.elasticsearch.spark.sql " + 
       "OPTIONS ( resource 'spark/people', nodes 'localhost:9200')" ) 
    
    sqlContext.sql("select * from myPeople").collect.foreach(println)
    
    [20,zhang,san]
    [100,li,bai]
    [40,wang,wu]
    [30,li,si]
    [101,du,fu]
    
    
    sqlContext.sql(
       "CREATE TEMPORARY TABLE myPeople    " + 
       "USING org.elasticsearch.spark.sql " + 
       "OPTIONS ( resource 'spark/people', nodes 'localhost:9200',scroll_size '20')" ) 
    因为使用.会导致语法异常,应该用_风格代替它. 因此,在这个例子中es.scroll.size 变成 scroll_size(由于加载数据时候es可以删除)。注意这只工作在spark1.3/1.4的spark有更严格的解析器
  • esDF方式读取elasticsearch数据:

    import org.apache.spark.sql.SQLContext       
    import org.elasticsearch.spark.sql._ 
    
    val sqlContext = new SQLContext(sc)
    val people = sqlContext.esDF("spark/people") 
    // check the associated schema
    println(people.schema.treeString) 
    
    root
     |-- age: long (nullable = true)
     |-- name: string (nullable = true)
     |-- surname: string (nullable = true)
    
    
    // get only the wang
    val wangs = sqlContext.esDF("spark/people","?q=wang" )
    
    wangs.show()
    
    +---+----+-------+
    |age|name|surname|
    +---+----+-------+
    | 40|wang|     wu|
    +---+----+-------+

参考文章:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-sql

作者:stark_summer
出处:http://blog.csdn.net/stark_summer/article/details/49743687

猜你喜欢

转载自blog.csdn.net/stark_summer/article/details/49743687