Big data framework Hadoop ecosystem episode

Chapter One big data and Hadoop ecosystem

The main content of this chapter:

► understand the challenges of big data

► Learn Hadoop ecosystem

► Learn Hadoop distribution

► Use Hadoop-based enterprise applications

You may hear people say, we live in "big data" environment. Technology-driven development in today's world, the rapid growth of computing power and electronic equipment is increasingly common, more and more easy Internet access, meanwhile, is being transmitted and collected than ever before and more data.

Enterprise data is being generated at an alarming rate. Facebook will collect only 250 TB of data per day. Thompson Reuters News Analytics (Thompson Reuters news analysis) showed that the total amount of digital data now than in 2009 1ZB (1ZB equivalent to one million PB) and more than tripled in 2015 is likely to reach 7.9ZB, to 2020 is likely to reach 35ZB. Other investigative agencies have made even higher prediction.

As companies increase the amount of data generated and collected, they begin to realize the importance of data analysis. However, they must effectively manage large amounts of information they own. This creates a new challenge:?? How can store large amounts of data how to handle them how to efficiently analyze data since they will increase, and how to build a scalable solution??

Data scientists and researchers not only to face the challenge of big data. A few years ago, the General Assembly on Google+, computer book publisher Tim O'Reilly cited the words of Alistair Croll, "which generated a lot of data, no law firm, being those produced relatively few new regular data ups replace ....... " In short, Croll want to say, unless your business "understand" the data you have, or your business can not compete with companies that "understand" their own data.

Companies have realized: big data and business competition, situational awareness is closely related to productivity, science and innovation, analyze these large data can get huge benefits. Because commercial competition is driving big data analysis, so most companies agree O'Reilly and Croll point of view. They believe that today's businesses survival depends on the ability to store, process and analyze large amounts of information, depending on whether control of the ability to accept big data challenges.

If you read this book, you will be familiar with these challenges, the familiar Apache Hadoop, Hadoop and know what problems can be solved. This chapter describes the prospects and challenges of big data, and an overview of Hadoop ecosystem and its components. These components can be used to build scalable, distributed data analysis solution.

1.1 When Big Data Hadoop encounter

As the "human capital" is an invisible, factors critical to success, so the majority of companies believe that their employees are their most valuable assets. In fact, there is another key factor - owned enterprise "Information." Information credibility, informativeness and information accessibility can enhance the ability of enterprise information, enabling businesses to make better decisions.

To understand a lot of digital information generated by companies is very difficult. IBM pointed out that produced 90 percent of the world's data only in the past two years. Business is collecting, processing and storage of these data could become strategic resources. Ten years ago, Michael Daconta, Leo Obrst, and Kevin T.Smith (Indianapolis: Wiley, 2004) wrote a book, "The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management" in saying motto "only those with the best information, know how to find information, and be able to take advantage of the fastest enterprise information can be invincible."

knowledge is power. The problem is that, as more and more data collection, the traditional database tools will not manage, and quickly process the data. This will lead to business "drown" in their own data: can not effectively use the data link between the data can not understand, can not understand the immense potential of the data.

People with "big data" to describe overly large data sets, these data sets are generally unable to use traditional tools used in the process of storing, managing, searching, and analysis to deal with. There are numerous large data sources may be structured, and may be non-structural type; by processing and analysis of large data, internal rules and patterns can be found, in order to make informed choices.

What is the challenge of big data? How to store such a large amount of data processing and analysis, in order to obtain useful information from massive data?

Analysis of large data, we need a lot of storage space and processing power of supercomputing. In the past decade, researchers have tried various methods to solve the problems caused by the increase of digital information. First, focus on a single computer to above more storage, processing power and memory, but found that the ability to analyze a single computer can not solve the problem. Over time, many organizations implement a distributed system (multiple computers distributed tasks), but a distributed system of data analysis solutions tend to be complex and error-prone, and even fast enough.

In 2002, Doug Cutting and Mike Cafarella development of a project called Nutch (focus on solving web crawler to index and search the web search engine project) for processing large amounts of information. In the process of solving the problem of storing and processing Nutch project, they realized the need for a reliable, distributed computing methods for the collection of data for the large number of pages Nutch.

A year later, Google published a paper on MapReduce and Google File System (GFS), MapReduce is a distributed programming and algorithms used to process large data sets platform. When the prospect of distributed processing and distributed storage cluster aware, Cutting and Cafarella these papers as the basis for building distributed Nutch platform, developed as we know Hadoop Distributed File System (HDFS) and MapReduce.

In the years after 2006, Yahoo in the process of establishing a large number of index information for search engines experienced a struggle "big data" challenges, see the prospect of Nutch project, hired Doug Cutting, and quickly decided to use Hadoop as its distributed architecture, to solve the problem of the search engines. Yahoo stripped out of storage and processing section Nutch project, forming the Apache Foundation, an open source project Hadoop, Nutch web crawler project at the same time maintain their independence. Shortly thereafter, Yahoo began using Hadoop analysis of a variety of product applications. The platform is very effective, so that Yahoo's search business and to merge into a single advertising unit, in order to better take advantage of Hadoop technology.

In the past 10 years, Hadoop has been associated with the platform from a search engine, it evolved into the most popular general-purpose computing platform for solving big data challenges brought about. It is fast becoming the basis of the next generation of data-based applications. Market research firm IDC predicts that by 2016, Hadoop big data-driven market will exceed $ 2.3 billion. Since the establishment of the first to Cloudera Hadoop-focused company in 2008, dozens of Hadoop-based start-up companies to attract hundreds of millions of dollars in venture capital. In short, Hadoop provides an effective way for businesses to large data analysis.

Recommended Reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92430834