【Translation】Choosing the right hardware for a new Hadoop cluster (3)

Continued from the previous article: https://my.oschina.net/u/234661/blog/855913

Other Considerations

It is important to remember that the Hadoop ecosystem is designed with a parallel environment in mind. When purchasing processors, we do not recommended getting the highest GHz chips, which draw high watts (130+). This will cause two problems: higher consumption of power and greater heat expulsion. The mid-range models tend to offer the best bang for the buck in terms of GHz, price, and core count.

Always keep in mind that the Hadoop ecosystem is designed to be a parallel environment. When buying a processor, it is not recommended to buy a chip with the highest frequency and high power consumption (over 130w), which will cause 2 problems: higher power consumption and greater heat generation.

When we encounter applications that produce large amounts of intermediate data — outputting data on the same order as the amount read in — we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Bonded 2Gbps is tolerable for up to about 12TB of data per nodes. Once you move above 12TB, you will want to move to bonded 4Gbps(4x1Gbps). Alternatively, for customers that have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network-bound workloads. Confirm that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.

When the application generates a large amount of intermediate data, the output order of the data is consistent with the read order. It is recommended that a network card open 2 ports or a channel aggregation network card to provide 2Gbps bandwidth per machine. 2Gbps bandwidth can accommodate 12TB of data per node. Once you need to move 12TB of data, you need 4Gbps bandwidth (4*1Gbps). Of course, many customers have already used 10 Gigabit Ethernet cards or unlimited bandwidth, and this solution can be used in work scenarios limited by network speeds. Remember to confirm whether your operating system or BIOS is compatible with the 10G network card.

When computing memory requirements, remember that Java uses up to 10 percent of it for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM, as well as setting appropriate kernel settings on most Linux distributions.

When calculating memory, keep in mind that the Java Virtual Machine uses up to 10% to manage the Java Virtual Machine. It is recommended to specify the heap size when configuring Hadoop to avoid swapping in-memory data with disk. Disk swapping seriously affects the performance of MapReduce. Increasing memory can avoid this problem. Of course, most Linux distributions can also modify the appropriate kernel settings.

It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory, each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. Similarly, quad-channel DIMM should be in groups of four.

Memory channel bandwidth heap optimized memory is also important. For example, dual-channel memory, each machine needs to be configured with a set of 2 DDIMs. Three channels are configured as a group of 3 DDIMs. Similarly, 4-channel memory requires 4 to form a group of DDIMs.

Not just MapReduce

Hadoop is far bigger than HDFS and MapReduce; it’s an all-encompassing data platform. For that reason, CDH includes many different ecosystem products (and, in fact, is rarely used solely for MapReduce). Additional software components to consider when sizing your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all be run on the DataNode process to maintain data locality.

Hadoop is more than HDFS and MapRecude. He is a comprehensive data platform. CDH contains many different programs (in fact, few use MapReduce alone. Other software also needs to be considered in the cluster, including Hbase, Cloudera Impala, Cloudera Search. They should run on the DataNode process to maintain data in-place.

HBase is a reliable, column-oriented data store that provides consistent, low-latency, random read/write access. Cloudera Search solves the need for full text search on content stored in CDH to simplify access for new types of users, but also open the door for new types of data storage inside Hadoop. Cloudera Search is based on Apache Lucene/Solr Cloud and Apache Tika and extends valuable functionality and flexibility for search through its wider integration with CDH. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or transformation.

HBase is a reliable, column-oriented, continuous, low-latency, random read-write database.

CS is based on Lucene, Solr, Tika. Both flexibility and functionality are taken into account, so it is widely inherited in CDH. The Impala project based on the Apache protocol brings scalable database parallelism technology that solves the problem of HDFS and HBase low-latency SQL queries without data movement or transformation.

HBase users should be aware of heap-size limits due to garbage collector (GC) timeouts. Other JVM column stores also face this issue. Thus, we recommend a maximum of ~16GB heap per Region Server. HBase does not require too many other resources to run on top of Hadoop, but to maintain real-time SLAs you should use schedulers such as fair and capacity along with Linux Cgroups.

HBase users need to be concerned about the heap size limit, due to garbage collection pauses. Other JVM-based columnar storage systems face this problem. Therefore, we recommend a maximum allocation of 16GB heap memory per RegionServer. HBase does not require many other resources on Hadoop, but in order to maintain real-time SLA, schedulers such as the fair scheduler and capacity scheduler provided by the Linxu control group should be used.

Impala uses memory for most of its functionality, consuming up to 80 percent of available RAM resources under default configurations, so we recommend at least 96GB of RAM per node. Users that run Impala alongside MapReduce should consult our recommendations in “Configuring Impala and MapReduce for Multi-tenant Performance.” It is also possible to specify a per-process or per-query memory limit for Impala.

Most functions of Impala require memory. The default configuration can consume up to 80% of the available memory. It is recommended that each node has at least 96GB of memory. When Impala and MapReduce run together, need to see our opinion. Of course it is also possible to specify a core or per-query memory limit.

Search is the most interesting component to size. The recommended sizing exercise is to purchase one node, install Solr and Lucene, and load your documents. Once the documents are indexed and searched in the desired manner, scalability comes into play. Keep loading documents until the indexing and query latency exceed necessary values to the project — this will give you a baseline for max documents per node based on available resources and a baseline count of nodes not including and desired replication factor.

Determining the size of the search component is the most interesting. It is recommended to purchase a node, install Solr and Lucene, and load the data for resizing. Scalability comes into play when indexing and searching documents is not enough. Documents are loaded consistently until indexing and query latency exceeds the value specified by the item. This is the baseline for how large a document each node scales and the baseline number of nodes (excluding the expected replica factor).

Summarize

Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and Cloudera recommends deploying initial hardware with balanced specifications when getting started. It is important to remember when using multiple ecosystem components resource usage will vary and focusing on resource management will be your key to success.

We encourage you to chime in about your experience configuring production Hadoop clusters in comments!

Buying the right hardware requires benchmarking and careful planning to fully understand the working scenario. However, Hadoop clusters are usually varied. We recommend configuring balanced hardware for initial deployments. The most important thing is that the requirements for different ecological formations are different, and it is the key to pay attention to resource management.

Kevin O’Dell is a Systems Engineer at Cloudera.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325870429&siteId=291194637