The relationship between Hadoop and big data, big data science will learn Hadoop!

One that big data, related search terms must be Hadoop.

Well,

Big Data learning why Hadoop to learn from?

What are the characteristics of big data is that big data with Hadoop What is the relationship?

First of all, we talk about what is big data?

The so-called big data, that is, from the various types of data, the ability to quickly obtain valuable information. Big Data is a need for new treatment mode in order to have more decision-making power, insight and process optimization massive, high growth rates and diverse information assets capabilities. It is defined for those outside the normal processing range and size, forcing the user data set using the non-traditional approach under.

Different from the past vast amounts of data, the characteristics of large data can be summarized into four V: Volume, Variety, Value and Velocity, namely the volume, variety, low-value density and fast.

First, the large body of data. Big Data generally refers to 10TB (1TB = 1024GB) the size of the amount of data than is currently being jumped to PB (1PB = 1024TB) level. Not only memory capacity, but also a large amount of calculation.

Second, many types of data. In addition to numerical data, as well as text, sound, video and other formats, including various types of weblogs, videos, pictures, location information. Since data from multiple data sources, data types and formats increasingly rich, has been to break the previously defined structured data visible, it encompasses semi-structured and unstructured data.

 

Want to be a big data cloud computing Spark master, look here! I read poke

50W annual salary of Java programmers to turn Big Data learning route poke me read

Artificial intelligence, big data trends and prospects   poke me read

The latest and most complete big data exchange system path! ! Poke me read

2019 latest! Big Data Engineer jobs salary, it was amazing ! I read poke

Third, the low value of the density. Video, for example, continuously monitor the video, valuable data may be only a second or two. Find valuable information like Shalitaojin, but its value is very valuable.

Fourth, fast processing speed. In the case of very large amounts of data, you can do real-time processing of data. This and traditional data mining techniques are essentially different.

The relationship between Hadoop and big data, big data science will learn Hadoop!

 

Big Data technology refers to technology quickly obtain valuable information from various types of generally the amount of data. This is the core problem of big data. Are talking about big data refers not only to the size of the data itself, as well as tools, platforms and data analysis system data collection. The purpose is to develop Big Data Big Data R & D technology and its application to the relevant fields, and promote its breakthrough development by addressing the general volume of data processing problems. Therefore, the challenges of big data era not only in how to handle the amount of data in general and derive valuable information, but also in how to strengthen the big data technology development. Key Technology of data relates generally comprises six areas: data acquisition and data management, distributed storage and parallel computing, large data application development, data analysis and mining, large data front end application, presentation, and data services.

Secondly, Big Data and Hadoop

Big Data technologies are penetrating into all walks of life. As a typical representative Hadoop distributed data processing system, it has become standard in the field facts. Hadoop big data but does not mean it's just a successful off-line data processing distributed systems, big data field there are many other types of processing systems.

Along with the popularity of big data technology, Hadoop open source because of its superior performance characteristics and become the new darling of the moment, and even suggested that big data is Hadoop, in fact, this is a misunderstanding. Just Hadoop distributed processing system is off-line storage and processing of data. In addition to Oracle Hadoop, as well as Storm for stream data processing relational data, the number of real-time processing machine

The relationship between Hadoop and big data, big data science will learn Hadoop!

 

Many, Hadoop is just one of the representatives Splunk ...... mainstream big data system data.

1, Hadoop core modules

Hadoop Common: Hadoop common application module is the core Hadoop project, offers a variety of tools for Hadoop subprojects, such as configuration files and logs and other operations, other Hadoop subprojects is on this basis to develop.

Hadoop Distributed File System (HDFS): Hadoop distributed file system that provides high-throughput access to application data, and has a high fault tolerance. For external clients, HDFS like a traditional hierarchical file system that can be CRUD or rename files and other conventional operations. HDFS but in fact the file is divided into blocks, and then copied to a plurality of computers, which is very different from traditional RAID architectures. HDFS is particularly suitable for the needs write-once application of ultra-large data sets read many times.

Hadoop YARN: a cluster job scheduling and resource management framework.

Hadoop MapReduce: distributed parallel programming mode and program execution framework is based on large data YARN, is the open source implementation of Google's MapReduce. It helps the user to write programs to run in parallel processing of large data sets. MapReduce hides the underlying details of the distributed parallel programming, developers only need to write business logic code, regardless of the details of the program are executed in parallel, thus greatly improving the efficiency of development.

There are many other Apache Hadoop-related projects.

2, Hadoop features

As a typical representative of distributed computing, Hadoop more advantageous than other distributed frame.

Scalability: Hadoop can be made without stopping the cluster and services, the available computer data collection and distribution between clusters to complete the calculation, these clusters can be easily extended to thousands of nodes.

Simplicity: Hadoop implements a simple parallel programming mode, the user does not need to know the details of the underlying distributed storage and computing can write and run distributed applications, large data sets on clusters, so the use of Hadoop users can easily build their own distributed platforms.

Efficiency: Hadoop distributed file system design efficient data exchange, to speed up processing by parallel processing. Hadoop is scalable, it can be dynamically move data between nodes, and each node to ensure dynamic balance, thus the processing speed is very fast.

Reliability: Hadoop distributed file system to store data block, each block based on certain strategies redundant storage on the cluster nodes, ensure that the redistribution process for the failed node, thus ensuring the reliability of data.

Low cost: dependent on low-cost servers, its cost is relatively low, anyone can use.

In the era of big data, Hadoop its superior performance widespread concern in the industry, has become the de facto standard big data processing field. Today, Hadoop to show their talents in many fields. With the open source community and international numerous international technology vendors continued commitment to actively support the large number of open source technologies and, I believe the near future, Hadoop technology will be extended to more applications.

Finally, why to learn from Hadoop

1, the main reason is that most of the open source Hadoop. Compared to other products (such as Microsoft's products) are closed source, we can not get those products source code, so inconvenient to be developed based on them, and Hadoop can easily solve this problem. This is much like we know Android and iOS, many phones will choose Android secondary development on the basis of open source. Whether or Andrews Hadoop, I have to say that Google have made no small contribution.

2, Hadoop is not a long history, also ten years of history, but it has evolved into a huge ecosystem, so long as vast amounts of data and related areas, will find Hadoop figure, each branch is also very popular in the moment, such as Spark. In addition, developers around the world are to contribute their code. It can be said, Hadoop is an extremely open platform.

3, of course, now bigger data base platform is not the only Hadoop, big data companies have developed many have their own large data base platforms such as satellite, Central Technology, MapR, Hortonwork, Cloudera and so on. This should be the world's four largest of the four, three of them in Silicon Valley, one in Shanghai, but this 4 is developed on the basis of the open source Hadoop. So, Hadoop contributed, it spawned many new things come out.

4, there is a convenient angle for them to learn, Hadoop most resources, including colleges, teachers also like to rely on Hadoop published papers, learning curve a lot easier.

5, academia and related Google products have special circumstances, which will undoubtedly have a Google Hadoop descent enveloped a halo.

Guess you like

Origin blog.csdn.net/spark798/article/details/93351803