[Hadoop in China 2011] Peng: Hadoop application analysis in the mass Web Search

Peng is instantly search engineer R & D platform group, instantly search is a search engine under its People, the transition from the search by the people, on June 20 this year, formally launched. According to Peng introduced, instantly searching is currently stored for more than 20 billion files, the entire system architecture uses massive Hadoop data analysis platform, and the application has been modified for a particular environment. In this presentation, we will analyze Peng engineer massive Hadoop-based web analytics cases.

Peng: Hadoop application analysis in the mass Web Search
▲ instantly search R & D team of engineers platform Peng

  With the overall structure immediately search the entire massive Hadoop analytics platform, deletions and modifications to the specific environment Middleware, improved part of the application, to improve performance, the figure below shows the overall framework map instantly search:

Peng: Hadoop application analysis in the mass Web Search
▲ instantly search the entire architecture diagram

  In the figure above, HDFS Hadoop is the massive data processing platform, which Hdfs_Bridge add new middleware, and, JikeSpider application engineers to immediately search for new development, and part of the program was modified.

  Hdfs_Bridge to instantly search massive data processing middleware platform, mainly to meet the fast write reptiles, and document provides automatic Flush sstable function. Which is converted by the write memory write, direct use DFS Flush. As an alternative to HDFS multiple unnecessary serialization and de-serialization.

Peng: Hadoop application analysis in the mass Web Search

  And, immediately Hadoop Pipes also been improved. U write Hadoop pipse communications by modifying the input and output Road King single path to multiple input and output. Also be localized debugging, and part of the code has been optimized.

  According to Peng engineer, at present instantly search of massive data processing platform also has some shortcomings, we are constantly optimized. For example, in some large operations, multiple task assigned to the same machine, causing the machine load is too large, thus slowing down the progress of the entire operation, even under extreme conditions, the emergence of memory is too slow situation. Peng believes that the main reason that the task scheduling assignment is unreasonable, its technical team is developing a middleware machine for rational allocation of tasks within the cluster.

  Peng said the initial idea is that by TaskTracker the CPU , memory, hard drive to collect information and networks, and to report to jobtracker. After the dispatcher receives this information when scheduling task the CPU , memory, hard drive information and networks into account, during the distribution of work.

  Further, due to the large energy consumption of data centers, Hepeng Xi look capable of reducing power consumption by the data center technology. Energy management cluster e.g., when the CPU, I / O, and disk is idle for a long time, the whole can enter the power saving mode; long idle even further modules of the closing operation.

Reproduced in: https: //www.cnblogs.com/licheng/archive/2011/12/05/2276405.html

Guess you like

Origin blog.csdn.net/weixin_33779515/article/details/92627742