hadoop features

Introduction to the advantages and disadvantages of Hadoop:
(1) Advantages:
(1) High reliability. Hadoop's ability to store and process data in bits is trustworthy;
(2) High scalability. Hadoop distributes data and completes computing tasks among available clusters of computers that can easily scale to thousands of nodes.
(3) Efficiency. Hadoop is able to move data dynamically between nodes and guarantees the dynamic balance of each node, so the processing speed is very fast.
(4) High fault tolerance. Hadoop can automatically save multiple copies of data and can automatically reassign failed tasks.
(2) Disadvantages:
(1) Not suitable for low-latency data access.
(2) Inability to efficiently store a large number of small files.
(3) Multi-user writing and arbitrary modification of files are not supported

 

(1) Is Hadoop suitable for e-government? Why?
E-government is to use Internet technology to realize the reorganization and optimization of government organizational structure and work process, and build a streamlined, efficient, clean and fair government operation information service platform. Therefore, e-government will definitely generate a large amount of related data and corresponding computing requirements, and when the data and computing involved in these two requirements reach a certain scale, the traditional system architecture will not be able to meet, and it needs to rely on massive data processing platforms, such as Hadoop technology, Therefore, Hadoop technology can be used to build an e-government cloud platform.
To sum up, there is no absolute suitability or unsuitability for any system. It can only be decided when the demand arises. In a very small e-government system, if there is no need for data processing and calculation analysis, there is no need for technologies such as hadoop. In fact, commercial e-government platforms often involve large-scale data and a large number of computational analysis and processing requirements, so technologies such as Hadoop are needed to solve them.


(2) Does hadoop have advantages for real-time online processing?
There is no advantage in directly using hadoop for real-time processing, because Hadoop mainly solves the problem of massive batch processing jobs, but it can use the Hadoop-based distributed NOSQL system HBase system and related real-time processing systems:
1. Hadoop-based HBase can do To real-time processing and real-time calculation of related requirements, it mainly solves the requirements of massive <key, value> related query calculations.
2. Spark computing can be considered. Spark is a system based on co-occurring memory RDD, which is faster than Hadoop and can be used for iterative computing, such as data mining and machine learning algorithms.
3. There is also Storm. Storm is a free, open-source, distributed, and fault-tolerant real-time computing system. Storm is often used in real-time analysis, online machine learning, continuous computing, distributed remote calls, and ETL.
4. Consider S4. S4 is a set of general-purpose, distributed, extensible, partially fault-tolerant, and pluggable platforms open sourced by Yahoo! in October 2010. This platform is mainly for developers to develop applications that process continuous unbounded streams of data.
You can choose a suitable system according to your actual needs.

(3) There is no problem with Hadoop storing massive data, but how can real-time retrieval of massive data be achieved?

1. It can be combined with the open source search engine Apache Lucene, Solr or ElasticSearch.
2. HBase can be considered for real-time retrieval of massive data. It is recommended to use hadoop to build the data into a data set with the query key as the key, and then set the <key, value> collection. When writing to the HBase table, Hbase will automatically use the key as the key for indexing. At the level of billions or more, the response time of querying the value of the key is estimated to be within 10 milliseconds.
If the retrieval conditions are multiple combinations, multiple hbase tables can be appropriately designed, and such retrieval is also very fast. At the same time, Hbase also supports secondary indexes. Hbase also supports MapReduce for queries that meet the conditions. If the response time is not high, you can consider using the hive and Hbase systems together.
If the amount of data is not very large, a NOSLQ system that supports SQL can also be considered.

(4) Can you give some Hadoop learning methods and learning plans? The Hadoop system is a bit huge, and I feel like I can't learn it?
First of all, figure out what Hadoop is and what Hadoop can be used for?
Then, you can start with the most classic word frequency statistics program to gain a preliminary understanding of the basic idea of ​​MapReduce and the way of processing data.
Then, you can formally learn the basic principles of hadoop, including HDFS and MapReduce, first from the overall and macro core principles, not the source code level.
Further, you can go deep into the details of HDFS, MapReduce and modules. At this time, you can combine the source code to understand and realize the mechanism.
In the end, it is necessary to combat, and you can complete some hadoop-related applications in combination with your own projects or related needs.

(5) After a large file is split into many small files, how to efficiently process these small files with Hadoop? And how to make each node load balance as much as possible?

1. How to efficiently process these small files with Hadoop?
Your question is very good. Hadoop is very efficient when dealing with large-scale data, but when dealing with a large number of small files, it will be inefficient due to excessive system resource overhead. Package it as a large file, for example, use the SequencenFile file format, for example, use the file signature as the key, and the file content itself as a value to write a record of the SequcenFile file, so that multiple small files can be converted into a large file through the SequcenFile file format. Each small file is mapped to a record in the SequencenFile file.
2. How to balance the load of each node as much as possible?
Load balancing is very critical in a hadoop cluster. This situation is often caused by the uneven distribution of user data, and the number of computing resource slots is indeed evenly distributed on each node, so that non-local tasks are performed when the job is running. There will be a large amount of data transmission, which will cause the cluster load to be unbalanced. Therefore, the key to solving the unbalance is to balance the user's data distribution. You can use the built-in balancer script command of hadoop.
For the imbalance caused by resource scheduling, specific scheduling algorithms and job allocation mechanisms need to be considered.


(6) How do c/c++ programmers get started with Hadoop to have a deep understanding, and deploy and use it on a Linux server. Is there any directional guidance?

For C/C++ users, Hadoop provides the hadoop streaming interface and the pipes interface. The hadoop streaming interface uses standard input and standard output as the middleware for the interaction between user programs and the hadoop framework. Pipes is an interface specifically for C/C++ language. as an intermediary for classmates.
From the point of view of use, it is recommended to start with streaming. Compared with streaming, pipes have more problems, and it is not easy to debug pipes.

(7) Is Hadoop version 1.x or 2.x mainly used in enterprises now?
At present, Baidu, Tencent, and Alibaba-based Internet companies all use hadoop 1.X as the benchmark version. Of course, each company will conduct custom secondary development to meet different cluster needs.
2.X has not been officially used in Baidu, and 1.X is still the main one. However, Baidu has developed HCE system (Hadoop C++ Expand system)

to supplement the problem of 1.X. Hadoop2.x is widely used in other companies, such as Jingdong et al.


(8) want to work on big data in the future. To what extent do algorithms need to be mastered, and do algorithms account for the main part?
First of all, if you want to engage in big data-related fields, Hadoop is used as a tool. First, you need to master how to use it. . You don't need to go deep into hadoop source-level details.
Then there is the understanding of the algorithm, which often needs to be designed to the distributed implementation of the data mining algorithm, and you still need to understand the algorithm itself, such as the commonly used k-means clustering.


(9) Now spark and storm are getting more and more popular, and Google has also released Cloud Dataflow, should Hadoop mainly learn hdfs and yarn in the future, and what Hadoop programmers will mainly do in the future is to package these things, and only provide interfaces to let ordinary people programmers can use it, just like Cloudera and Google?

This classmate, you have to worry too much, hadoop, spark, and strom solve different problems, there is no good or bad, to learn Hadoop or to use the mainstream hadoop-1.X as the version, 2.X is the most important thing The yarn framework is well understood.

If you are Hadoop's own research and development suggestions, you can read it. If you are Hadoop application-related research and development, you can read the mainstream 1.X. My book "Hadoop Core Technology" is explained in the mainstream 1.X version. If you are interested, you can read it. Look.


(10) Xiaobai asked, is big data processing all related software installed on the server, and what impact does it have on the program? Are clusters and big data the work of operation and maintenance or are they for siege lions?

Traditional programs can only It runs on a single machine, while big data processing is often written using a distributed programming framework, such as hadoop mapreduce, which can only run on the hadoop cluster platform.
The responsibility of operation and maintenance: to ensure the stability and reliability of the cluster and the machine. The research and
development of the hadoop system itself: to improve the performance of the Hadoop cluster and add new functions.
Big data application: use hadoop as a tool to realize massive data processing or related requirements.

(11) How to start learning hadoop? What kind of projects should be done?
You can refer to my answers above. You can start with the simplest word frequency statistics program, and then learn to understand the basic principles and core mechanisms of HDFS and MapReduce. If you only use Hadoop as a tool, then you can do it. The most important thing is In practice, you can try to use Hadoop to process some data, such as log analysis, data statistics, sorting, inverted index and other typical applications.

(12) How to develop, operate and maintain more than 100 hadoop nodes? How to allocate task resources when there are many tasks? Is the execution order of tasks controlled by timed scripts or other methods?
1. First of all, the application development of big data has nothing to do with the scale of the hadoop cluster. Do you mean the construction and operation and maintenance of the cluster? There are many things involved in the commercial hadoop system. It is recommended to refer to the "hadoop core technology" practical chapter "Chapter 10 Hadoop Cluster Setup" chapter.
2. The assignment of tasks is determined by the scheduling policy of the Hadoop scheduler. The default is FIFO scheduling. Commercial clusters generally use multi-queue and multi-user schedulers. For reference, please refer to Chapter 9 Hadoop Job Scheduling in Advanced Hadoop Core Technology. System" chapter.
3. The execution order of tasks is controlled by the user, and you can naturally start them regularly or manually.

(13) For development based on Hadoop, is it necessary to use Java, and can other development languages ​​be better integrated into the entire Hadoop development system?

Any language can be used for development based on Hadoop, because hadoop improves the streaming programming framework and pipes programming interface , under the streaming framework, users can use any computer language that can manipulate standard input and output to develop hadoop applications.


(14) In the reduce stage, it is always stuck in the final stage for a long time. I checked the Internet and said that there may be data skew. I would like to ask if there is any solution for this?

1. You are data skew, a lot of data is concentrated In one reduce, there is less data allocated in other reducers. By default, which data is allocated to which reduce is determined by the number of reducers and the partiiton partition. The default is to hash the key. HIVE

2. Reduce is divided into three sub-stages: shuffle, sort and reduce. If the whole process of reduce takes a long time, it is recommended to first check which stage the monitoring interface is stuck in. If it is stuck in the shuffle stage, it is often a network congestion problem. That is, the amount of data in a certain reduce is too large, which is the data skew problem you mentioned. This problem is often caused by too many values ​​of a certain key. The solution is: first, the default partiiton may not suit your needs, you can Customize the partiiton; the second is to truncate at the map side, and try to make the data distribution on each reduce side evenly distributed.


(15) Can Hadoop be used for non-big data projects?
The key question of whether non-big data projects can use Hadoop is whether there is a need for massive data storage, computing, analysis and mining, etc. If the existing system can well meet the current needs Then there is no need to use Hadoop. No need to use does not mean that Hadoop cannot be used. Many traditional systems can do Hadoop. For example, HDFS can be used to replace LINUX NFS, and MapReduce can be used to replace statistical analysis related tasks on a single server. Using Hbase instead of relational databases such as Mysql, etc., when the amount of data is not large, the Hadoop cluster must consume more resources than traditional systems.

(16) How to integrate hadoop mapreduce and third-party resource management scheduling system?
One of the principles of Hadoop's scheduler design is the pluggable scheduler framework, so it is easy to integrate with third-party schedulers, such as fair scheduler FairScheduler and capacity scheduler CapacityScheduler, and configure mapreduce in mapred-site.xml. jobtracker.taskscheduler and the configuration parameters of the scheduler itself, such as fair scheduler control parameters, you need to edit fair-scheduler.xml for configuration. For details, please refer to my new book "Hadoop Core Technology" Actual Chapter 10.11 Cluster Building Example 10.10.9 Configure third-party schedulers in 10.10.9, and you can further study the Hadoop job scheduling system in Chapter 9. In this chapter, various third-party schedulers and configuration methods will be introduced in detail.


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326831309&siteId=291194637