Clusters of early CDH program proposal

Scale cluster computing
cluster size depends on the user data and application requirements, final planning results is available in the following calculation of the maximum value of the minimum cluster size
• capacity demand
- estimated relatively easily and accurately
- the majority of cases can be determined by the capacity of the cluster scale
• computing needs
- accurate estimates of computing resources only through small-scale test and a reasonable estimate
• other resource constraints
- such as user MapReduce applications may have special requirements for memory and other resources, and a single node can be configured relatively limited resources, the cluster is the smallest such resources required to meet user requirements

Network recommended
• We recommend using Gigabit networks or higher-speed networks
- To take full advantage of parallel disk operating bandwidth, at least Gigabit network
- even if sufficient bandwidth, the use of high-bandwidth networks can also bring performance improvements
• network bandwidth sensitive scene :
- the ETL type or other high amount of input and output data MapReduce task
- to limited space, or power resources of the environment, can be used with high-speed and high-capacity network node
- HBase other delay-sensitive applications have requirements for network transmission speed
Clusters of early CDH program proposal

Traditional tree network
• Network over (oversubscription)
- by increasing levels of network expansion, but there is a problem
- increased distance between nodes in the network
- the network over the problem worse
• So far as possible the ultra multiport ××× replacement or expansion switch backplane expansion port size
- small or medium-sized networks can use a tree architecture double
- interact only through the top layer switch uplink ports and external system
- to avoid contamination of storm Hadoop external network transmission network

Component Architecture
• Managing Node (Head / Master Node): As NameNode, Yarn and Master, etc.
- provide critical, focused, no replacement cluster management service
- if the management service is stopped, the corresponding Hadoop cluster service stops
- required reliability high hardware
• data node (the data / Worker / Slave the node)
- handles the actual tasks, such as data storage, sub-task execution
- with nodes running multiple services, in order to ensure localized
- if the service is stopped, by the other node Auto service in place
- all the various components of hardware may be damaged, but can easily replace
• edge node (edge the node)
- provide external service agents and packaging Hadoop
- Hadoop as a client to access the actual service
- require highly reliable hardware devices

Management Node Hardware Requirements
• management node roles include the NameNode, Secondary the NameNode, the Yarn RM
- Hive Meta Server and Hive Server is typically deployed on the management node server
- Zookeeper Server and Hmaster can select data node server, due to limited general load on the node no too special requirements
- all candidates HA servers (Active and Standby) with the same configuration
- memory requirements are usually high but lower storage requirements
• We recommend using high-end PC servers and even minicomputer servers to improve performance and reliability
- dual power supplies, redundant Yu fan, polymerization card, RAID ...
- system disk using RAID1
- due to the small number of managed nodes and high importance, high-profile general is not a problem

Data node configuration policy recommendations
• a small number, but the number of high-performance clustered single point but low single-point performance vs. cluster
- In general, using more machines rather than upgrading a server configuration
- the procurement of most mainstream "cost-effective" Configuration the server, you can reduce the overall cost
- for better data are distributed parallel scale-out performance and reliability
- a combination of factors need to be considered physical space, size of network and other ancillary equipment, etc.
• consider the number of clustered servers
- compute-intensive applications consider using a better CPU and more memory

Memory requirements calculation
• require large memory master node roles:
- the NameNode, the NameNode Secondary, YARN AM, Hbase RegionServer
• node memory algorithms:
- Large memory by adding memory roles
- class computing applications requiring large memory, such as Spark / Impala recommend at least 256GB RAM

Hard drive capacity select
• generally recommended to use a greater number of hard drives
- better parallelism
- different tasks can access different disks
- eight 1.5TB hard drive performance is better than six 2TB hard drive
- permanently remove data storage needs outside generally recommended to reserve 20-30% of the space is used to store temporary data
center • MapReduce task data
• actual deployment per server with 12 hard drives are very common
- a single node does not exceed the maximum storage capacity of 48TB

Demand for storage services

data source Hadoop way physical storage capacity The number of data nodes
The original file, the amount of data 625T 625TB . 3 (number of copies) 0.3 (compression ratio) / 80% (disk usage) = 703TB (detail data stored only, no sheet, no MR) Press 30T 703TB / 30 * 1.05 (redundancy) = 25 units per node
Hbase and Cassandra Data Services: Suppose the amount of historical data for the 2.6T, daily increments of 55G, data retention, 365 days when using a compressed copy of the 3 + 0.055 2.6 :( 365) 1.3 * 1.2 (Key overhead) / 70% (disk utilization) = 51T Each node according to 30T 51T / 30 * 1.3 (redundancy) = 3 units required increases open WAL: region server wal size (typically less than half the memory RS)

Server configuration recommendations

Management Server Data server Edge Server
CPU 2*E5-2620v4 2*E5-2620v4 2*E5-2620v4
hard disk SAS 600GB * 4; RAID0 + 1 600GB SAS 2; SATA 2T 15 600GB SAS 2; SATA 2T 15
RAM 256G ECC 256G ECC 256G ECC
The internet Dual Gigabit LAN Dual Gigabit LAN Dual Gigabit LAN
Quantity 3 30 3

Guess you like

Origin blog.51cto.com/14086291/2404635