Big Data and hadoop differences and relationships

With the growing popularity of information technology, widely used in the growth rate of the rapid rise of broadband networks, as well as cloud computing, mobile Internet and networking and other new generation of information technology, global data further accelerated. At the same time, a number of data collection, storage, processing and application of rapid development and gradually converge. Software using increasingly sophisticated technology, combined increasing computing power, the ability to extract valuable information from the data significantly improved. Generally the amount of data is no longer out of order but no value, big data was born.

\
  1 Recognizing Big Data

  so-called big data, that is, from the various types of data, the ability to quickly obtain valuable information. Big Data is a need for new treatment mode in order to have more decision-making power, insight and process optimization massive, high growth rates and diverse information assets capabilities. It is defined for those outside the normal processing range and size, forcing the user data set using the non-traditional approach under.

Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive

  is different from the past vast amounts of data, the characteristics of large data can be summarized into four V: Volume, Variety, Value and Velocity, namely the volume, variety, low-value density and fast.

  First, the large body of data. Big Data generally refers to 10TB (1TB = 1024GB) the size of the amount of data than is currently being jumped to PB (1PB = 1024TB) level. Not only memory capacity, but also a large amount of calculation.

  Second, many types of data. In addition to numerical data, as well as text, sound, video and other formats, including various types of weblogs, videos, pictures, location information. Since data from multiple data sources, data types and formats increasingly rich, has been to break the previously defined structured data visible, it encompasses semi-structured and unstructured data.

  Third, the low value of the density. Video, for example, continuously monitor the video, valuable data may be only a second or two. Find valuable information like Shalitaojin, but its value is very valuable.

  Fourth, fast processing speed. In the case of very large amounts of data, you can do real-time processing of data. This and traditional data mining techniques are essentially different.

  Big Data technology refers to technology quickly obtain valuable information from various types of generally the amount of data. This is the core problem of big data. Are talking about big data refers not only to the size of the data itself, as well as tools, platforms and data analysis system data collection. The purpose is to develop Big Data Big Data R & D technology and its application to the relevant fields, and promote its breakthrough development by addressing the general volume of data processing problems. Therefore, the challenges of big data era not only in how to handle the amount of data in general and derive valuable information, but also in how to strengthen the big data technology development. Key Technology of data relates generally comprises six areas: data acquisition and data management, distributed storage and parallel computing, large data application development, data analysis and mining, large data front end application, presentation, and data services.

  2 Big Data and Hadoop

  big data technology is penetrating into all walks of life. As a typical representative Hadoop distributed data processing system, it has become standard in the field facts. Hadoop big data but does not mean it's just a successful off-line data processing distributed systems, big data field there are many other types of processing systems.

  Along with the popularity of big data technology, Hadoop open source because of its superior performance characteristics and become the new darling of the moment, and even suggested that big data is Hadoop, in fact, this is a misunderstanding. Just Hadoop distributed processing system is off-line storage and processing of data. In addition to Splunk Hadoop, as well as Storm for stream data processing Oracle relational data, processing real-time machine data ...... At present, many, Hadoop is just one of the representatives of the mainstream big data systems.

  2.1 Hadoop core modules

  Hadoop Common: Hadoop common application module is the core Hadoop project, offers a variety of tools for Hadoop subprojects, such as configuration files and logs and other operations, other Hadoop subprojects are developed on this basis up.

  Hadoop Distributed File System (HDFS): Hadoop distributed file system that provides high-throughput access to application data, and has a high fault tolerance. For external clients, HDFS like a traditional hierarchical file system that can be CRUD or rename files and other conventional operations. HDFS but in fact the file is divided into blocks, and then copied to a plurality of computers, which is very different from traditional RAID architectures. HDFS is particularly suitable for the needs write-once application of ultra-large data sets read many times.

  Hadoop YARN: a cluster job scheduling and resource management framework.

  Hadoop MapReduce: distributed parallel programming mode and program execution framework is based on large data YARN, is the open source implementation of Google's MapReduce. It helps the user to write programs to run in parallel processing of large data sets. MapReduce hides the underlying details of the distributed parallel programming, developers only need to write business logic code, regardless of the details of the program are executed in parallel, thus greatly improving the efficiency of development.

  There are many other Apache Hadoop-related projects.

  2.2 Hadoop characteristics

  typical of a distributed computing, Hadoop more advantageous than other distributed frame.

  Scalability: Hadoop can be made without stopping the cluster and services, the available computer data collection and distribution between clusters to complete the calculation, these clusters can be easily extended to thousands of nodes.

  Simplicity: Hadoop implements a simple parallel programming mode, the user does not need to know the details of the underlying distributed storage and computing can write and run distributed applications, large data sets on clusters, so the use of Hadoop users can easily build their own distributed platforms.

  Efficiency: Hadoop distributed file system design efficient data exchange, to speed up processing by parallel processing. Hadoop is scalable, it can be dynamically move data between nodes, and each node to ensure dynamic balance, thus the processing speed is very fast.

  Reliability: Hadoop distributed file system to store data block, each block based on certain strategies redundant storage on the cluster nodes, ensure that the redistribution process for the failed node, thus ensuring the reliability of data.

  Low cost: dependent on low-cost servers, its cost is relatively low, anyone can use.

  In the era of big data, Hadoop its superior performance widespread concern in the industry, has become the de facto standard big data processing field. Today, Hadoop to show their talents in many fields. With the open source community and international numerous international technology vendors continued commitment to actively support the large number of open source technologies and, I believe the near future, Hadoop technology will be extended to more applications.

Guess you like

Origin blog.csdn.net/yyu000001/article/details/90521695