Getting big data: Hadoop to just play with a friend some advice

 

With the middle two sessions every day, CCTV news said that big data, many people have begun to focus on big data and Hadoop and data mining and data visualization, I now venture, corporate and personal encounter many traditional industries to Hadoop data above to transformation, to mention a lot of problems, most of the problems are still the same. So I would like to organize some of it could be a problem a lot of people are concerned about.

About Select Hadoop versions?

So far, only a half-foot forward Hadoop door, I suggest that you choose Hadoop 1.x use. Many people may say, Hadoop are out to 2.4, with 1.x it is also why, no one said this played hadoop.

One reason: Hadoop 1.x and 2.x are two completely different things, not as a stand-alone webserver said the upgrade from 1.0 to 2.0 so simple. It does not mean that I am using mysql 5.0, a new version of the compiler as long as it seamlessly migrate directly to the matter 5.5. Hadoop is from 1.0 to 2.0 over-the overall architecture of the system all completely rewritten. From implementation to the user interface is completely two completely different things, not simply that it was just like nginx as an upgrade from 0.8 to 1.4. So my advice is to the production environment with 1.x, 2.x test environment deployment as familiar with.

Reason two: still, Hadoop is not a webserver, in spite of Hadoop distributed system to achieve out , but he is still a very complex system, one that HDFS storage, before Hadoop 0.20.2 want to upgrade to 0.20.203, first of all you need the new version of Hadoop deployment on all nodes, and then stop all services the entire cluster, do the metadata backup, and then do the upgrade HDFS, it can not guarantee the success of HDFS will be able to upgrade. Such upgrade once the price is great, the service did not say stop, if the upgrade is not a function does not guarantee complete and correct metadata is unpredictable. Much much more trouble than you think. Do not think with Cloudera Manager or other management software you really can automate the operation and maintenance, Hadoop is only the first step in a long march to deploy it. Want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307

Three reasons: Hadoop 2.x currently very unstable, more Bug, iteratively updated too fast, if you want to choose 2.x, think clearly make a decision, this stuff is not to say that you choose a new version foolproof, Openssl How many years, there had been heart blood of vulnerability, not to mention just come out was less than a year Hadoop2, you know, Hadoop upgrade to 1.0 by almost 7-8 years, but after numerous large companies, including Yahoo, facebook, BAT such companies constantly updates, patches, only to stabilize. Hadoop2 only appeared less than a year, there has not been a stable long-term testing and operation, see the recent Hadoop upgrade from 2.3 to 2.4 with a month and a half, they repaired more than 400 bug.

So, we do not suggest that you now directly on the production cluster on the 2.x, and then wait and see, and so on and then stabilized too late. If we are concerned about Apache JIRA, you can see Hadoop 3.0 has begun the internal bug tracking.

About Hadoop talent?

I think that companies need to consider two aspects of the problem of talent hadoop, is to develop a talent, a maintenance personnel.

Development personnel currently relatively scarce, basically concentrated in the Internet, but this is a relatively short period of time can resolve the matter, with the popularity of Hadoop training and dissemination. And Hadoop itself is perfect in terms of the interface, so that people will more and more.

I think the maintenance personnel in the industry for some time outside the Internet should not be concerned, not too much, but no. Hadoop and cloud computing final fight is the operation and maintenance, operation and maintenance personnel of large-scale distributed systems is extremely difficult to cultivate. In particular DevOps, DevOps itself is very scarce, and in most of scarce talent is a puppet, fabric engage in the operation and maintenance of the web, the steering operation and maintenance of distributed systems difficulty is still there. It is difficult to recruit such talent, it is difficult to cultivate.

Then you need to define the type of talent they want to develop, an analogy Hadoop as if windows or linux operating system, in this operating system, either with photoshop drawing, and can do with a 3dmax animation, can also be treated with Office table, but the purpose of the application software to achieve is not the same. It still needs CTO, CIO have a basic understanding of big data and Hadoop and peripheral applications. Do not put Hadoop with mysql php or traditional J2EE analogy, I think not that hard, big deal outsourcing. Simply not the case.

 

 

Training on the Hadoop?

 

After Hadoop internal training of several companies, I found just the transformation of enterprises have a problem is too much. I want to do a training in and around the hadoop understand everything thoroughly, more typical is a company I recently went to Shanghai training from Hadoop HBase to Mahout to the word you want to listen to the Spark Storm whole. Then training institutions can only find a few teachers were talking about different content, I think this training mean for businesses small, at most, to staff a chance to get together and take a nap.

First, Hadoop is not one or two lectures will be able to thoroughly understand something, in addition to theoretical knowledge, but also need the support of a lot of practical experience.

Second, each Hadoop ecological component is a very complex thing, really simple to use, but to really understand every component is not so easy. In particular, Mahout, Spark, R these things involve a large number of statistical and mathematical theory, do you call a bunch of products, there is no statistical programming and backgrounds come to lectures, they really only taking a nap, I feel let they came to listen to Hadoop is a very cruel thing, obviously did not understand, because leadership in the next, but also insisted on not had to work hard to sleep.

Third, each person different fields of expertise, there is no talk about a teacher both Windows server operation and maintenance, but also speak Excal advanced techniques can speak 3DMax animation PhotoShop drawing. And training institutions in order to grab a single, often committed with the teacher to find a few companies, companies often feel the same price, I have listened to all, it was great. In actual fact, each teacher's lecture style, knowledge level, content design is different, chicken, wheat flour, vegetable put together is not necessarily Panji and the belt surface, are also likely to be instant noodles, and finally do eat tasteless gesture. So companies in the choice of when to do the training must be targeted, not to engage in large and do not say a waste of resources, but also to no avail. Training can be separated from several different directions, looking for different, highly specialized training institutions to complete. Of course, this also requires CTO, CIO has some ideas and vision, more, at least you as a leader, should know a little more than others, not to say hold on, but the technical direction on the technical details to more accurate than employees.

About docking with the traditional business?

This is also a concern to many people, especially the traditional enterprise, prior to use is Oracle, a large amount of data stored in it, all of a sudden with Hadoop alternative is not possible. I think this belongs to want more, Hadoop that white is off-line analytical processing tools, is not intended to substitute your database, in fact, impossible to replace the relational database. He made relational databases can not do the dirty work, complementing the existing business structure, rather than replacement.

And this assistance and the gradual replacement is done, can not be done overnight, within the scope of my knowledge, no company up directly to the mysql I say no, directly on Hadoop, run into this, I will first praise his determination, and I refused to give him a plan, I will tell him clearly, this is not possible.

Hadoop provides a variety of tools to give you the butt of traditional database business, in addition to sqoop, you can also write your own, Hadoop interface is very simple, JDBC interface is also very simple.

Published 178 original articles · won praise 3 · views 30000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104869087