Chapter 1 Spark Overview
1.1 What is a Spark
1.2 Spark built-in module
Spark Core:
the basic functions of Spark, including task scheduling, memory management, error recovery, and storage
System interaction modules. Spark Core data set also contains a pair of elastic distributed (Resilient Distributed
DataSet, referred RDD) API definitions.
Spark SQL:
is used to program operating Spark structured data packet. By Spark SQL, we can use
SQL or SQL dialect version of Apache Hive (HQL) to query the data. Spark SQL supports a variety of data sources,
For example Hive table, Parquet and JSON, etc.
Spark Streaming:
a component of the real-time data stream provided Spark calculated. Providing the number of operations for
According to API stream, and corresponds to the height Spark Core RDD API.
Spark MLlib:
providing common machine learning (ML) function library. Including classification, regression, clustering, collaborative
Filtering, also provide a model evaluation, and other data import additional support.
Cluster Manager:
the Spark designed to be efficient in a computing node to thousands of computing nodes between extensometer
Count. In order to achieve this requirement, while the maximum flexibility, Spark supports a variety of cluster manager (Cluster
Running) Manager, comprising Hadoop YARN, Apache Mesos, and a simple scheduling comes Spark
, A scheduler called independent.
Spark been a number of big data company's support, these companies include Hortonworks, IBM, Intel,
Cloudera, MapR, Pivotal, Baidu, Alibaba, Tencent, Jingdong, Ctrip, Youku potatoes. Baidu's current
Spark has been applied to a large search, direct numbers, Baidu large data services; Ali GraphX use to build a large-scale map
Calculate and graph mining system, a lot of recommendation algorithm production systems; Tencent Spark cluster reached 8,000 units of Regulation
Mold, is currently the world's largest known cluster of Spark.
1.3 Spark Features
Chapter 2 Spark operating mode
2.1 Spark installation address
1. Official website address
http://spark.apache.org/
2. Document View Address
https://spark.apache.org/docs/2.1.1/
3. download link
https://spark.apache.org/downloads.html
2.3 Local mode
2.3.1 Overview
2.3.2 Installation
1) Upload the installation package and extract the spark
[lxl@hadoop102 sorfware]$ tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module/ [lxl@hadoop102 module]$ mv spark-2.1.1-bin-hadoop2.7 spark
2) an official request PI Case
[lxl@hadoop102 spark]$ bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --executor-memory 1G \ --total-executor-cores 2 \ ./examples/jars/spark-examples_2.11-2.1.1.jar \ 100
(1)基本语法
bin/spark-submit \ --class --master \ --deploy-mode \ --conf = \ ... # other options \ [application-arguments]
(2)参数说明:
--master 指定 Master 的地址,默认为 Local --class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi) --deploy-mode: 是否发布你的驱动到 worker 节点(cluster) 或者作为一个本地客户端 (client) (default: client)* --conf: 任意的 Spark 配置属性, 格式 key=value. 如果值包含空格,可以加引号 “key=value” application-jar: 打包好的应用 jar,包含依赖. 这个URL 在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path 都包含同样的 jar application-arguments: 传给 main()方法的参数 --executor-memory 1G 指定每个 executor 可用内存为 1G --total-executor-cores 2 指定每个 executor 使用的 cup 核数为 2 个
3) 结果展示
该算法是利用蒙特·卡罗算法求PI
4)准备文件
[lxl@hadoop102 spark]$ mkdir input
在 input 下创建 3 个文件 1.txt 和 2.txt,并输入以下内容
hello atguigu
hello spark
5)启动 spark-shell
[lxl@hadoop102 spark]$ bin/spark-shell
scala>
开启另一个 CRD 窗口
[lxl@hadoop102 spark]$ jps 3627 SparkSubmit 4047 Jps