carbondata+spark环境搭建及测试

0x0 介绍

Carbondata:Apache CarbonData is an indexed columnar data format for fast analytics on big data platform, e.g. Apache Hadoop, Apache Spark, etc.

carbondata是一种带索引的列型数据格式,用于大数据快速分析平台例如:hadoop、spark等。说白了:一种数据格式!

Spark:Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

spark是一种快速的、多用途的集群计算系统,提供了Java、Scala、Python以及R语言的高级API…..说白了,一个计算框架。

0x1 Spark环境搭建

1.下载spark相应的版本

2.解压

3.如果是集群模式(推荐StandAlone模式)

  • 3.1 进入spark/conf目录,修改slaves文件,添加集群中各节点的IP
  • 3.2 进入spark/conf目录,修改spark-env.sh,添加以下内容:
#这里的192.168.0.1为spark的master主机
export SPARK_MASTER_HOST=192.168.0.1 
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/

4.(可选)如果想配置环境路径,可以把spark的路径配置到SPARK_HOME中,将$SPARK_HOME/bin配置到PATH中

0x2 Carbondata环境搭建

1. 安装thrift-0.9.3

  • 1.1先要安装一些工具(用于编译thrift)
 sudo apt-get install automake bison flex g++ git libboost-all-dev libevent-dev libssl-dev libtool make pkg-config

进入thrift文件,编译安装,命令

./configure
make
sudo make install

2. 安装Maven
自己上网下一个解压,这个就不说了,太简单。

3. 编译Carbondata

  • 3.1 下载并解压apache-carbondata-1.2.0-source-release

  • 3.2 进入carbondata源码文件,执行以下命令:

 mvn -DskipTests -Pspark-2.1 -Phadoop-2.6.0 -Dspark.version=2.1.0 clean package

4. 整合Spark和Carbondata
编译完成后:

  • 4.1 从carbondata文件下/assembly/target/scala-2.1x/carbondata_xxx.jar中,将jar包拷贝到 spark安装目录的carbonlib下。carbonlib需要手动创建)

  • 4.2 将carbondata目录下./conf/carbon.properties.template文件从CarbonData存储库复制到文件$SPARK_HOME/conf/夹并将文件重命名为carbon.properties。

  • 4.3 在Spark节点[master]中,在$SPARK_HOME/conf/spark-defaults.conf文件中配置以下属性。

spark.driver.extraJavaOptions=”-Dcarbon.properties.filepath = /opt/cloudera/parcels/SPARK2/lib/spark2/conf/carbon.properties”

spark.executor.extraJavaOptions=”-Dcarbon.properties.filepath = /opt/cloudera/parcels/SPARK2/lib/spark2/conf/carbon.properties
  • 4.4 编辑$SPARK_HOME/conf/spark-env.sh修改SPARK_CLASSPATH的值为$SPARK_HOME/carbonlib/*

  • 4.5 在$SPARK_HOME/conf/carbon.properties文件中添加以下属性:

carbon.storelocation=hdfs://HOSTNAME:PORT/Opt/CarbonStore

0x3 测试

以standalone模式启动spark-shell

./bin/spark-shell

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
import org.apache.spark.SparkConf

val conf = new SparkConf().setMaster("spark://192.168.0.1:7077").set("spark.cores.max", "4")
val carbon = SparkSession.builder().config(conf).config("hive.metastore.uris","thrift://192.168.0.1:9083").getOrCreateCarbonSession("hdfs://192.168.0.1:8020/opt")

carbon.sql("show tables").show
carbon.sql("select * from event_log").show

猜你喜欢

转载自blog.csdn.net/gx304419380/article/details/79182150