Features of Hadoop, storm, spark


Big Data Big data, the official definition refers to those datasets with particularly large amounts of data and complex data categories, which cannot be stored, managed and processed by traditional databases. The main characteristics of big data are the large amount of data (Volume), the complex data category (Variety), the fast data processing speed (Velocity) and the high data authenticity (Veracity), which are collectively called 4V. The amount of data in big data is very huge, reaching the petabyte level. And this huge data includes not only structured data (such as numbers, symbols, etc.), but also unstructured data (such as text, images, sounds, videos, etc.). This makes the storage, management and processing of big data difficult to use with traditional relational databases. In big data, valuable information is often hidden deep in it. This requires a very fast processing speed of big data, so that valuable information can be obtained from a large amount of complex data in a short period of time. In the large amount of complex data of big data, it usually contains not only real data, but also some false data. This requires the removal of false data in the processing of big data, and the use of real data to analyze and obtain real results. Big data analysis (Big Data Analysis) Big data, on the surface, is a large amount of complex data. The value of this data itself is not high, but after analyzing and processing these large amounts of complex data, it can be extracted from it. information. The analysis of big data is mainly divided into five aspects: visual analysis (Analytic Visualization), data mining algorithm (Date Mining Algorithms), predictive analysis capability (Predictive Analytic Capabilities), semantic engine (Semantic Engines) and data quality management ( Data Quality Management). Visual analysis is a form of big data analysis results that ordinary consumers can often see. For example, the "Baidu Map Spring Festival Population Migration Big Data" produced by Baidu is one of the typical cases. Visual analysis automatically converts a large amount of complex data into intuitive charts, making it easier for ordinary consumers to accept and understand. Data mining algorithm is the theoretical core of big data analysis. Its essence is a set of mathematical formulas defined in advance according to the algorithm, and the collected data is brought into it as a parameter variable, so that valuable data can be extracted from a large amount of complex data. information. The famous "beer and diapers" story is a classic example of a data mining algorithm. Walmart’s analysis of beer and diaper purchase data has uncovered a previously unknown link between the two, which it uses to boost sales. Both Amazon's recommendation engine and Google's advertising system make heavy use of data mining algorithms. Predictive analytics capabilities are the most important application areas of big data analytics. Mining rules from a large amount of complex data, establishing a scientific event model, and by bringing new data into the model, it is possible to predict future events. Predictive analysis capabilities are often used in financial analysis and scientific research, such as stock forecasting or weather forecasting. A semantic engine is one of the fruits of machine learning. In the past, the computer's understanding of the user's input content only stayed at the character stage, and could not well understand the meaning of the input content, so it often failed to accurately understand the user's needs. By analyzing a large amount of complex data and letting the computer learn from it, the computer can understand the meaning of the user's input as accurately as possible, so as to grasp the user's needs and provide a better user experience. Apple's Siri and Google's Google Now both use the semantic engine. Data quality management is an important application of big data in the enterprise field. In order to ensure the accuracy of big data analysis results, it is necessary to remove unreal data from big data and retain the most accurate data. This requires the establishment of an effective data quality management system, analysis of a large number of complex data collected, and selection of real and effective data. Distributed Computing Computing) There are two major directions in the computer science community for how to deal with big data: the first direction is centralized computing, which is to increase the computing power of a single computer by continuously increasing the number of processors, thereby increasing the speed of processing data. The second direction is distributed computing, which is to connect a group of computers to each other through a network to form a decentralized system, and then disperse a large amount of data that needs to be processed into multiple parts, which are handed over to the computer group in the decentralized system for simultaneous calculation, and finally these calculations are performed. The results are merged to get the final result. Although the computing power of a single computer in a decentralized system is not strong, since each computer only calculates a part of the data, and multiple computers calculate at the same time, the speed of processing data in a decentralized system will be much higher than that of a single computer. In the past, the theory of distributed computing was more complicated and the technology was difficult to implement. Therefore, centralized computing has always been the mainstream solution for processing big data. IBM's mainframe is a typical piece of hardware for centralized computing, used by many banks and government agencies to process big data. However, IBM's mainframes were too expensive for Internet companies at the time. Therefore, Internet companies are focusing their research on distributed computing that can be used on cheap computers. Server Cluster A server cluster is a solution to improve the overall computing power of a server. It is a parallel or distributed system consisting of server farms connected to each other. Servers in a server cluster run the same computing task. Therefore, from the outside, this group of servers appears as a virtual server, providing unified services to the outside world. Although the computing power of a single server is limited, after hundreds or thousands of servers are formed into a server cluster, the entire system has powerful computing power and can support the computing load of big data analysis. The server clusters in the computing centers of Google, Amazon, and Alibaba have reached the scale of 5,000 servers. The technical basis of big data: MapReduce, Google File System and BigTable From 2003 to 2004, Google published three technical papers, MapReduce, GFS (Google File System) and BigTable, and proposed a new set of distributed computing theory. MapReduce is a distributed computing framework, GFS (Google File System) is a distributed file system, and BigTable is based on Google File System's data storage system, these three components constitute Google's distributed computing model. Compared with the traditional distributed computing model, Google's distributed computing model has three advantages: First, it simplifies the traditional distributed computing theory, reduces the difficulty of technical implementation, and can be applied in practice. Secondly, it can be applied to cheap computing devices, and the overall computing power can be improved simply by increasing the number of computing devices, and the application cost is very low. Finally, it was applied by Google in Google's computing center, and achieved very good results, with the proof of practical application. Later, various Internet companies began to use Google's distributed computing model to build their own distributed computing systems, and these three Google papers became the technical core of the big data era. The three mainstream distributed computing systems: Hadoop, Spark and Storm Since Google does not have the technical implementation of the open source Google distributed computing model, other Internet companies can only build their own distributed computing based on the relevant principles in Google's three technical papers system. Yahoo engineers Doug Cutting and Mike Cafarella collaborated in 2005 to develop the distributed computing system Hadoop. Later, Hadoop was contributed to the Apache Foundation and became an open source project of the Apache Foundation. Doug Cutting also became the founder of the Apache Foundation xi, presided over the development of Hadoop. Hadoop adopts the MapReduce distributed computing framework, develops the HDFS distributed file system based on GFS, and develops the HBase data storage system based on BigTable. Although it has the same principle as the distributed computing system used internally by Google, Hadoop still does not meet the standards in Google's paper in terms of computing speed. However, the open source nature of Hadoop has made it the de facto international standard for distributed computing systems. Yahoo, Facebook, Amazon, Baidu, Alibaba and many other Internet companies in China have built their own distributed computing systems based on Hadoop. Spark is also an open source project of the Apache Foundation. It was developed by a laboratory at the University of California, Berkeley, and is another important distributed computing system. It has some architectural improvements on the basis of Hadoop. The biggest difference between Spark and Hadoop is that Hadoop uses hard disk to store data, while Spark uses memory to store data, so Spark can provide 100 times faster computing speed than Hadoop. However, Spark cannot be used to process data that needs to be kept for long periods of time due to the loss of data when the memory is powered off. Storm is a distributed computing system promoted by Twitter. It was developed by the BackType team and is an incubation project of the Apache Foundation. It provides real-time computing features on the basis of Hadoop, and can process big data streams in real time. Unlike Hadoop and Spark, Storm does not collect and store data. It directly receives data in real time through the network and processes it in real time, and then returns the results directly through the network in real time. Hadoop, Spark and Storm are the three most important distributed computing systems at present. Hadoop is often used for offline complex big data processing, Spark is often used for offline fast big data processing, and Storm is often used for online real-time big data processing. </p>
</div>

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326364936&siteId=291194637