Spark cluster hardware configuration recommended

Spark cluster hardware configuration recommended

Computing and storage:

Most Spark may take from the external memory system: reading of input data (e.g. Cassandra, Hadoop file system or HBase), so let Spark calculation engine data persistence layer close as possible.
If used as a data storage cluster HDFS, Spark cluster can be deployed on the same cluster, and configure Hadoop Spark and the memory and CPU usage to avoid interference. Our production storage using a Cassandra cluster, spark master service alone deployment, other nodes are deployed: Cassandra + spark worker, to ensure the spark worker node can read data quickly from the local to calculate summary.

Disk:

Although Spark large number of calculations can be performed in memory, but it may still be used to store data is not available for the local RAM disk, it is recommended for each disk node is configured 4-8, no need to configure RAID (disk array), disk cost getting lower and lower, consider ssd hard drive configuration, can significantly improve performance. In addition; In Linux, mount the disk using the noatime option, to reduce unnecessary write operations. In Spark, spark.local.dir variables can be configured to address the plurality of local disk, a plurality of addresses separated by a comma.

RAM

Spark recommended memory capacity of not more than 75% of the allocation of the total memory capacity of the machine; make sure to leave enough memory for the operating system and buffer. According to business characteristics to assess how much memory is required.
Note that when the memory capacity of more than 200GB performance Java virtual machine will be unstable. If you buy RAM is greater than 200G, for each node can run multiple worker JVM. In Spark's standalone mode, you can set the number of worker processes for each node running through conf / spark-env.sh in SPARK_WORKER_INSTANCES variables, and set cpu cores available to each worker by SPARK_WORKER_CORES variables.

The internet

When the data has been stored in the memory, many Spark application performance bottleneck of network transmission rate. Minimum recommended 10G networks.

CPU

More Spark Run Summary computing tasks more cpu recommended configuration auditing, performance is quite obvious recommendation: configure each machine at least 8 to 16 cores. Spark CPU may load operation, that configuration. Once the data in the memory, the performance limitations of most applications is that the CPU and the network.

Reference Documents

http://spark.apache.org/docs/latest/hardware-provisioning.html

Guess you like

Origin blog.51cto.com/michaelkang/2422511