Spark- Overview - Installation

Chapter 1 Spark Overview

1.1 What is a Spark

 

 

1.2 Spark built-in module 

 

 

 

  Spark Core: the basic functions of Spark, including task scheduling, memory management, error recovery, and storage
System interaction modules. Spark Core data set also contains a pair of elastic distributed (Resilient Distributed
DataSet, referred RDD) API definitions.
  Spark SQL: is used to program operating Spark structured data packet. By Spark SQL, we can use
SQL or SQL dialect version of Apache Hive (HQL) to query the data. Spark SQL supports a variety of data sources,
For example Hive table, Parquet and JSON, etc.
  Spark Streaming: a component of the real-time data stream provided Spark calculated. Providing the number of operations for
According to API stream, and corresponds to the height Spark Core RDD API.
  Spark MLlib: providing common machine learning (ML) function library. Including classification, regression, clustering, collaborative
Filtering, also provide a model evaluation, and other data import additional support.
  Cluster Manager: the Spark designed to be efficient in a computing node to thousands of computing nodes between extensometer
Count. In order to achieve this requirement, while the maximum flexibility, Spark supports a variety of cluster manager (Cluster
Running) Manager, comprising Hadoop YARN, Apache Mesos, and a simple scheduling comes Spark
, A scheduler called independent.
  Spark been a number of big data company's support, these companies include Hortonworks, IBM, Intel,
Cloudera, MapR, Pivotal, Baidu, Alibaba, Tencent, Jingdong, Ctrip, Youku potatoes. Baidu's current
Spark has been applied to a large search, direct numbers, Baidu large data services; Ali GraphX ​​use to build a large-scale map
Calculate and graph mining system, a lot of recommendation algorithm production systems; Tencent Spark cluster reached 8,000 units of Regulation
Mold, is currently the world's largest known cluster of Spark.

 

 

1.3 Spark Features 

 

 

 

 

 

Chapter 2 Spark operating mode

2.1 Spark installation address

1. Official website address
  http://spark.apache.org/
2. Document View Address
  https://spark.apache.org/docs/2.1.1/
3. download link
  https://spark.apache.org/downloads.html

 

2.3 Local mode

2.3.1 Overview 

 

2.3.2 Installation
1) Upload the installation package and extract the spark
[lxl@hadoop102 sorfware]$ tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module/
[lxl@hadoop102 module]$ mv spark-2.1.1-bin-hadoop2.7 spark

 

2) an official request PI Case
[lxl@hadoop102 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

 

(1)基本语法
bin/spark-submit \
--class
--master \
--deploy-mode \
--conf = \
... # other options
\
[application-arguments]

 

(2)参数说明:
--master 指定 Master 的地址,默认为 Local
--class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
--deploy-mode: 是否发布你的驱动到 worker 节点(cluster) 或者作为一个本地客户端
(client) (default: client)*
--conf: 任意的 Spark 配置属性, 格式 key=value. 如果值包含空格,可以加引号
“key=value”
application-jar:  打包好的应用 jar,包含依赖.  这个URL 在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path 都包含同样的 jar application-arguments: 传给 main()方法的参数
--executor-memory 1G 指定每个 executor 可用内存为 1G
--total-executor-cores 2 指定每个 executor 使用的 cup 核数为 2

 

 3) 结果展示

该算法是利用蒙特·卡罗算法求PI

 

4)准备文件 
[lxl@hadoop102 spark]$ mkdir input
在 input 下创建 3 个文件 1.txt 和 2.txt,并输入以下内容 
hello atguigu
hello spark
 
5)启动 spark-shell 
[lxl@hadoop102 spark]$ bin/spark-shell
scala
>

 

 

开启另一个 CRD 窗口
[lxl@hadoop102 spark]$ jps
3627 SparkSubmit
4047 Jps

 

 

Guess you like

Origin www.cnblogs.com/LXL616/p/11040692.html