From big data to Hadoop, Spark, Storm

Big Data, the official definition are those particularly large amount of data, data categories particularly complex data sets, this set of data can not be stored, managed and treated with a traditional database. The main features of a large amount of data is data (Volume), complex data type (Variety), fast data processing speed (the Velocity) and high data authenticity (Veracity), collectively referred to 4V.

Large data volume of data is very large, at the PB level. And in this huge data, including not only the structured data (e.g., numbers, symbols, data, etc.), further comprising unstructured data (such as text, images, sound, video, etc.). This enables the storage, management and processing of large data difficult to use traditional relational database to complete. Among the big data, valuable information is often hidden them. This requires large data processing speed to be very fast, within a short time in order to be able to get from among the large number of complex data into valuable information. Among the large number of complex data large data usually includes not only the actual data, some false data also mixed in. This requires large data processing in the false data removed, use real data to analyze the obtained real results.

Big Data analysis (Big Data Analysis)

After big data, the surface is large amounts of complex data, the value of the data itself is not high, but these large amounts of complex data analysis and processing, which was able to extract valuable information. Analysis of big data, is divided into five areas: visual analysis (Analytic Visualization), data mining algorithms (Date Mining Algorithms), predictive analytics capability (Predictive Analytic Capabilities), a semantic engine (Semantic Engines) and data quality management ( Data Quality Management).

Visual analysis of the results of the analysis is the manifestation of a large data consumers often see, is one of the typical cases such as Baidu made "Baidu map Spring migration in big data." Visual analysis will be automatically converted into a large number of complex visual image data charts, so that it can be more easily understood and accepted by consumers.

Data mining algorithms core theoretical analysis of large data, the nature of the data pre-defined algorithm according to a mathematical formula, the collected as a set of parameter variables into which it is possible to extract from a large number of complex data value to information. The famous "beer and diapers," the story is a classic case of data mining algorithms. Through the analysis of Wal-Mart to buy beer and diapers data, mining the previously unknown link between the two, and use this link to enhance the sales of goods. Amazon's recommendation engine and Google's advertising system heavy use of data mining algorithms.

Predictive analytics is the big data analysis of the most important fields of application. Excavated from a large number of complex data in a law, establish a scientific event model by adding new data into the model, it is possible to predict future events. Predictive analytics are often used in financial analysis and scientific research for the stock forecast or the weather forecast.

Semantic engine is one of machine learning outcomes. In the past, computer input from the user to understand just stay in character stage, not well understood meaning of an input content, and therefore often can not accurately understand the needs of users. Through the large number of complex data analysis, let the computer from which the self-learning, can enable the computer to try to understand the precise meaning of input from the user, thereby grasp the needs of users, providing a better user experience. Apple's Siri and Google's Google Now have adopted semantic engine.

Data quality management is an important application of big data in the enterprise sector. In order to ensure the accuracy of the results of big data analytics, large data needs to weed out the false data, retain the most accurate data. This requires the establishment of an effective quality management system data, analyze large amounts of complex data collection to pick out a real and effective data.

Distributed Computing (Distributed Computing)

How to deal with big data, computer science, there are two directions: The first direction is centralized computing, is to enhance the computing power of a single computer by increasing the number of processors, thereby increasing the speed of data processing. The second of distributed computing, is a group of computers connected to each other via a network system consisting of a dispersion, and then dispersing large amounts of data to be processed into a plurality of portions, referred to groups of computers within distributed systems also calculated, and finally these calculations the results combined to give the final result. While the computing power of a single computer system in the dispersion is not strong, but because each computer counting only part of the data, and multiple computers are also calculated, so in terms of the dispersion system, data processing speed much higher than a single computer.

In the past, complex distributed computing theory, technology is more difficult, so in dealing with large data, centralized computing has been a mainstream solution. IBM's mainframe hardware is a typical centralized computing, many banks and government agencies use it to handle large data. However, for the time Internet companies, IBM's mainframe price is too expensive. Therefore, the Internet company's research on distributed computing on the cheap can be used on the computer.

Server cluster (Server Cluster)

Server cluster is a solution to enhance the overall computing power of the server. It is a parallel or distributed system consists of a server farm connected to each other consisting of. Server running with a server cluster computing tasks. Therefore, from the outside, these server performance for a virtual server, external to provide a unified service.

Despite the limited computing power of a single server, but the server after hundreds of server clusters composed, the whole system with a powerful computing capability and can support large computational load data analysis. Google, Amazon, Alibaba's computing center server clusters have reached a size of 5,000 servers.

Big Data technology base: MapReduce, Google File System and BigTable

Between 2003 and 2004, Google published a MapReduce, GFS (Google File System) and BigTable three technical papers, presented a new set of distributed computing theory.

MapReduce is a distributed computing framework, GFS (Google File System) is a distributed file system, BigTable is Google File System data storage system based on these three components make up Google's distributed computing model.

Google's distributed computing model compared to the traditional distributed computing model has three major advantages: First, it simplifies the traditional distributed computing theory, reduce the difficulty of technology can be made practical application. Second, it can be used in low-cost computing device, simply increase the number of computing devices can enhance the overall computing power, application cost is very low. Finally, it is Google Apps in Google's computing center, and achieved good results, has proven practical applications.

Later, various Internet companies started to use Google's distributed computing model to build their own distributed computing system, Google's three papers will become the core technology of the era of big data.

Three mainstream distributed computing systems: Hadoop, Spark and Storm

Because Google is not open source technology model to achieve a distributed computing Google, so other Internet companies Google can only be based on three technical papers relevant principles, to build their own distributed computing system.

Yahoo engineers Doug Cutting and Mike Cafarella develop Hadoop distributed computing system in 2005. Later, Hadoop was contributed to the Apache Foundation, became the Apache Foundation's open source project. Doug Cutting also became chairman of the Apache Foundation, presided over the Hadoop development work.

Hadoop MapReduce using distributed computing frameworks, and developed according to the GFS HDFS distributed file system, according to HBase BigTable developed data storage system. Although Google uses internally and distributed computing systems the same principle, but Hadoop on Google computing speed is still not up to standard paper.

However, Hadoop open source distributed computing system features make it the de facto international standard. Yahoo, Facebook, Amazon and domestic Baidu, Alibaba and other Internet companies to build their own Hadoop-based distributed computing system.

Spark is the Apache Foundation's open source project, developed by the University of California at Berkeley Lab, is another important distributed computing systems. It made some improvements on the basis of the architecture of Hadoop on. Spark and the biggest difference is that Hadoop, Hadoop hard disk to store data, the memory used to store data Spark, Spark thus possible to provide more than double the operation speed Hadoop100. However, due to the loss of data after power failure memory, Spark can not deal with the need for long-term storage of data.

Twitter Storm is the main push of a distributed computing system, which was developed by BackType team, the Apache Foundation's incubator project. It provides the real-time operational characteristics on the basis of Hadoop, real time processing of large data streams. Unlike Hadoop and Spark, Storm not collect and store working data, it accepts the data directly through the network in real time and real time data processing, and then directly returned over the network results in real time.

Hadoop, Spark and Storm is the most important of the three distributed computing systems, Hadoop commonly used to deal with large and complex data offline, Spark commonly used for rapid processing of large data off-line, while the Storm commonly used in the online real-time processing of large data .

Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive

Guess you like

Origin blog.csdn.net/Yukioog/article/details/90289517