How big data with Hadoop nine questions and answers explain the good

As the most widely used platform for big data processing, Hadoop is easy to use them, but to "make good" It's hard. The following is about how to make good use of Hadoop 9 Detailed questions and answers, we want to help.

Over the past decade, it is the emergence of Hadoop and continuous improvement, so that massive data mining possible, bring a revolutionary storm for scientific research and IT industry. Living in the center of the storm of Internet companies are even more this tool to the limit, not only run a myriad of off-line services on the platform, but also gradually to a more open mind, and nurturing community of learners. As the most widely used platform for big data processing, Hadoop is easy to use them, but to "make good" It's hard. The following is about how to make good use of Hadoop 9 Detailed questions and answers, we want to help.

Q: Hadoop applicable does not apply to e-government Why??

A: E-government is the use of Internet technology to achieve restructuring optimize the government organizational structure and work processes, and built a streamlined, efficient, clean and fair operation of the government information service platform. Therefore, e-government will certainly generate a lot of data and the corresponding associated computing requirements and needs of both data and calculations involved in the traditional system architectures reach a certain size will not be met, need the help of massive data processing platform, such as Hadoop technology, So you can build e-government cloud platform leverages Hadoop technology.

In summary, there is no absolute fit any system and is not suitable only when need arises can decide, on a very small e-government system does not require such technology hadoop if no play data processing and computational analysis needs, in fact, commercial e-government platform often involves large-scale data analysis and a lot of computing needs, so we need this technology to solve Hadoop.

Q: hadoop advantage for real-time online processing it?

A: The direct use of no advantage when hadoop real-time processing, because the main problem is Hadoop batch jobs massive computational problems, but you can use Hadoop HBase distributed NOSQL systems and related systems based on real-time processing system:

1. Based on Hadoop HBase can do real-time processing and real-time computing-related needs, mainly to solve the massive computing-related inquiries and other needs.

2. The calculation can be considered Spark, Spark is based on the system of co-occurrence RDD memory, faster time iterative calculation than Hadoop, such as data mining, machine learning algorithms.

3. There Storm, Storm is a free, open source, distributed computing systems in real-time, high fault tolerance, Storm often used for analysis in real-time, online machine learning, continuous computing, distributed and remote call ETL and other fields.

4. Consider S4, S4 is Yahoo! in October 2010 of a common set of open source, distributed, scalable, fault-tolerant part, with pluggable platform capabilities. This platform is to facilitate the application developers to develop processing streaming data (continuous unbounded streams of data) is.

You can select the appropriate system based on the actual demand.

Q: Hadoop to store huge amounts of data without problems, but how we can achieve real-time retrieval of massive data, this have any good suggestions, we are now running through a script to retrieve, the amount of data you need to wait a long time will have result.

A: The real-time retrieval of vast amounts of data can be considered HBase, recommendations can be used to build the data into hadoop to check key as key data sets, and then write a set of Hbase table, Hbase automatically indexed as key to key, in dozens even at the level of more than billion in the query key value of the estimated response time within 10 milliseconds.

If the search condition is a case where a plurality of combinations may be appropriate to design many hbase tables, such retrieval is fast, but also the support Hbase secondary index. Subject to the conditions query, Hbase also support MapReduce, if the case of the response time of less demanding, consider hive and Hbase systems be used in combination.

If the amount of data is not a great situation may also consider supporting NOSLQ SQL-like system.

Q: split large files into many small files, how to effectively deal with these small files with Hadoop conduct and how to make each node as load balancing??

A: 1. how to effectively deal with these small files with Hadoop carried out?

You mention this issue very well, hadoop when dealing with large-scale data is very effective, but when dealing with a large number of small files because the system resource overhead will be too large to be less efficient, for such problems, small files can be packaged as large files, such as using SequcenFile file format, such as a file signature for the key, the contents of the file itself is written in a file record SequcenFile for value, so that multiple small files into a large file you can by SequcenFile file format, before each small files are mapped to a record SequcenFile file.

2. How to let each node as load balancing?

In hadoop cluster load balancing is critical, this situation leads to often because of uneven distribution of user data, and calculates the number of slots is indeed a balanced distribution of resources at each node, so that when the job runs non-local task there will be a lot of data transmission, resulting in cluster load is not balanced, therefore solve the uneven point is the balanced distribution of user data, you can use the built-in balancer hadoop script commands.

For as resource scheduling imbalance caused by the need to consider the specific scheduling and job allocation mechanism.

Q: After the wish to engage in large data aspects of the work, the algorithm to grasp to what extent, the algorithm accounts for the major part of it?

A: First, if you want to engage in large data relating to the field, then, hadoop as a tool to use, you first need to learn how to use. Hadoop can not go deep into the details of the source code level.

Then is the understanding of the algorithm, often need to design a distributed data mining algorithms to achieve, but you still need to understand the algorithm itself, such as the commonly used k-means clustering.

Q: Big data processing software is installed on the server anyway, what effect does the program, clustering, operation and maintenance of large data belongs to the siege lion or content of the work it?

A: The traditional program can only run on a single machine, and a large data processing which is often written using distributed programming framework, such as hadoop mapreduce, can only run on hadoop cluster platform.

Responsibility for operation and maintenance: to ensure the stability and reliability of the cluster, the machine

hadoop development of the system itself: to improve the performance of Hadoop clusters, add new features.

Big Data applications: the hadoop as a tool to achieve mass data processing or related needs.

Q: hadoop more than 100 nodes, generally how the development, operation and maintenance how many resources to allocate tasks task, the task execution order is timed script or something else controls?

A: 1. First, the scale of application development and hadoop cluster of big data is that it does not matter, you mean the construction and operation of the cluster dimension it for commercial hadoop system relates to a lot of things, the proposed reference "hadoop core technology." actual articles "Chapter 10 Hadoop cluster Setup" section.

2. The allocation of tasks are scheduling policy decisions scheduler hadoop of default is FIFO scheduling, business clusters generally use multi-user multi-queue scheduler, can refer to reference "hadoop core technology" advanced version "Chapter 9 Hadoop job scheduling system "section.

3. The execution order of tasks is user control, you naturally can be timed to start, you can start it manually.

Q: Can big data projects of non-use hadoop?

A: Are non-Big Data Hadoop project can be the key question is whether there is mass data storage, computation, analysis and mining demand, if the existing system has been very good to meet current needs then there is no need to use Hadoop, and there is no need to use this does not mean you can not use Hadoop, Hadoop can do a lot of legacy systems also can be done, for example, instead of using HDFS LINUX NFS, instead of a single server using MapReduce statistical analysis related tasks, such as using Mysql Hbase instead of relational databases and other data under small amount usually Hadoop cluster certainly consume more resources than conventional systems.

Q: How hadoop mapreduce and third-party resources management scheduling system integration?

A: Hadoop scheduler design principle is a pluggable scheduler framework, and the third party is easily integrated scheduler, e.g. fair scheduler FairScheduler and capacity scheduler CapacityScheduler, and configure the mapred-site.xml mapreduce.jobtracker.taskscheduler and a scheduler configuration parameter itself, such as fair scheduler control parameters you need to edit fair- scheduler.xml be configured specifically refer to my book "Hadoop core technology" combat chapter tenth chapter of cluster 10.11 build 10.10.9 configuring third-party scheduler instance, while further study Chapter 9 Hadoop job scheduling system, in this chapter will detail the various third party schedulers and use the configuration methods.

Recommended Reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92431761