Spark Overview

The following content is directly translated from http://spark.apache.org/docs/latest/index.html.
There will be a small amount of expansion and supplementation. Of course, if there are problems in understanding or translation, please correct me if there are any mistakes. The following is the official start. Spark Overview Apache Spark is a fast and general cluster computing system. It provides high-level API support for Java, Scala, Python, R, an optimization engine that supports general execution graphs. It also supports many advanced tools, including Spark SQL (sql & structured data processing), MLlib (machine learning), GraphX ​​(graph operations) and spark streaming download and installation We can download and install from the Spark project website's download page ( http:// spark.apache.org/downloads.html ) to get spark. The content in this series of documents is based on Spark 1.5.2. Spark uses the Hadoop client library to support HDFS and YARN. In the download package, some commonly used versions of Hadoop are preset. We can also download any version of hadoop first, and then let Spark run on any version of Hadoop by configuring the Spark ClassPath http://spark.apache.org/docs/latest/hadoop-provided.html . Of course, if you like to build Spark yourself through source code, please refer to http://spark.apache.org/docs/latest/building-spark.html










Spark can run on both Windows and Unix (linux, Mac os) environments. Running locally on a machine is also very simple, as long as JAVA is installed and PATH and JAVA_PATH are set correctly.
Spark needs to run on Java 7+, Python 2.6+, R3.1+. Since Spark 1.5.2 uses the Scala 2.10 API, we need to use Scala 2.10.x.

Examples
Spark provides some examples for Scala, Java, Python and R in the examples/src/main directory.
To run a java or Scala program, execute bin/run-example <class> [params] in the Spark directory. (It then calls the more generic spark-submit to start the application)
For example
./bin/run-example SparkPi 10

We can also run Spark through the Scala Shell.
./bin/spark-shell --master local[2]

The --master option specifies the master URL of the distributed cluster, or local means local single-threaded operation, local[N] means local operation uses N threads. For local testing we can only use local. A list of all options can be obtained by running the --help option of the Spark shell.

Spark also provides a Python API. Spark interactive programs that support Python can be run through bin/pyspark.
./bin/pyspark --master local[2]
. Also Spark provides Python-based examples.
./bin/spark-submit examples/src/main/python/pi.py 10

And of course R
./bin/sparkR --master local [2]

./bin/spark-submit examples/src/main/r/dataframe.R

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326667437&siteId=291194637