Five core technology of big data to big data research fellow Getting little knowledge of reference

21 century, the world has entered the era of data explosion, the Big Data era has arrived. Various management and operational data from internal company business, to personal mobile devices and social data consumer electronics products, to the vast amounts of information generated by the Internet and other data, the amount of information generated every day in the world is growing rapidly. 2009 amount of information to reach 800 billion GB, while in 2011 reached 1.8 ZB. Turing Award winner Jim Gray's "new Moore's Law": "Every 18 months, the world's new information is the sum of all the history of computer information", has been verified.

Big Data "big" is not only reflected in the mass of data, the data type is still in its complexity. With reports, billing, images, office documents and other widely used in commercial companies, Internet video, music, online gaming continues to develop, more and more unstructured data further promote the digital universe explosion. Massive and complex data, which is the interpretation of large data. Compared with traditional data, big data has the scale (Volume), diversity (Variety), high speed (Velocity) and low density value (Value) of 4V features. Scale and high speed data processing has been studied and discussed issues of low value diversity and density of current data processing development continue to appear out of the question, but in the foreseeable future, with the wisdom of the city, the wisdom of the Earth and other kinds of new ideas continue to become a reality, the above 4 issues will become more prominent, but also had to face the problem.

Data generated experienced a passive, active and automatic three stages. The rapid development of big data is the inevitable result of the information age digital devices computing power and exponential growth of the number of deployments. Problem solving big data research, research must generate background data from large were. Data from a large scale, this scale to store, manage and analyze data data has brought great challenges, changes in the data management and brewing occur. Large scale data requires storage, computation scheme should also be considered from the scale. Relying solely on traditional single-device processing power of the vertical development of technology already can not meet the large data storage and processing requirements. To Google as the representative of some of the large data processing company through horizontal, distributed file storage, distributed data processing and distributed data analysis technology a good solution to problems because the data generated by the blast.

Big Data is the future direction of development, we are challenging analytical skills and awareness of the way the world, so we advance with the times, embrace change, and continue to study the growth of big data base:! [868 plus eight hundred forty-seven 735] Finally, to discuss the progress of learning together

 

1 large key technical data

1.1 Large data system architecture

Five core technology of big data to big data research fellow Getting little knowledge of reference

No matter how big data processing system structure is complex, the use of technology vary widely, but in general can always be divided into the following several important parts.

It can be seen from the general flow of data processing, the key technologies required major computing and massive data storage for massive data in the big data environment. Traditional relational database after nearly 40 years of development has become a mature while still evolving data management and analysis techniques, Structured Query Language (SQL) has been standardized as a relational database access language, its function and expression capacity has also been growing. However, the scalability of the relational database management system facing unprecedented obstacles in the Internet environment, can not exceed the requirements of large data analysis. Relational data management model in pursuit of a high degree of consistency and accuracy. Longitudinal expansion of the system, or by increasing the ability of replacing the CPU, memory, hard disk single node to expand, you will eventually encounter the "bottleneck."

Large data mainly from relying on data for commercial gain of large companies. Google Inc. as the world's largest information retrieval company, which is at the forefront of big data research. Face showing explosive growth of Internet information, relying solely on improving server performance has been far from meeting the needs of the business. If a large variety of data applications as a "car", support from these "cars" run "highway" is cloud computing. It is cloud computing technical support in terms of data storage, management and analysis of large data makes have its uses. Extend laterally from Google Inc., by using an inexpensive computer node cluster, rewriting the software so that it can be executed in parallel on a cluster, storage and retrieval functions to solve mass data. In 2006 Google first proposed the concept of cloud computing. Google's various key support large data applications, it is its own research and development of a series of cloud computing technologies and tools. Google's three key technologies for big data processing: Google file system GFS [4], MapReduce [5] and Bigtable [6]. Google provides technical solutions for other companies with a good reference solutions, major companies have put forward their own big data processing platform, the technology used are also similar. The following will support large distributed file system data required for the system, distributed data processing technology, distributed database systems and open source Hadoop big data systems and other aspects of the introduction of key technologies of large data systems.

Distributed File System 1.2

Five core technology of big data to big data research fellow Getting little knowledge of reference

File system supports large data base applications. Google is the only need to process huge amounts of data so ever large companies. For Google, existing programs have been difficult to meet such a large amount of data storage, for which Google proposed a distributed file management system --GFS.

GFS conventional distributed file system and many of the same goals, such as performance, scalability, reliability, and availability. However, GFS's success lies in its distinct traditional file system. GFS design ideas based on the following assumptions: For a system, component failure is the norm rather than the exception. GFS is built on top of a large number of inexpensive servers scalable distributed file system, a master-slave configuration. By data block, the incremental update, etc. to achieve efficient mass data storage, given the architecture shown GFS. However, with further changes in the volume of business, GFS increasingly unable to meet the demand. Google's GFS was designed to achieve a Colosuss system that can solve the problem GFS single point of failure and massive storage of small files.

In addition to Google's GFS, many businesses and scholars studied in detail to meet the needs of large data storage file system from different aspects. Cosmos developed by Microsoft to support its search advertising business. HDFS, FastDFS, OpenAFS and CloudStore are similar open source implementation of GFS. Class GFS distributed file system designed primarily for large files, but the image is stored applications such scenario, the file system is mainly small store massive files, Facebook with the launch of specialized file systems for massive Haystack small files by multiple logical a file sharing the same physical file, increase the buffer layer, part of the metadata is loaded into memory, etc. to effectively solve the problem of small files stored in the mass. Lustre is a large-scale, safe and reliable, high reliability cluster file system, developed and maintained by the SUN. The main purpose of this project is to develop the next generation clustered file system that can support more than 10 000 nodes, the number of tens of petabytes of storage systems.

Five core technology of big data to big data research fellow Getting little knowledge of reference

1.3 Distributed data processing system

Large data processing mode and batch stream into two kinds. Process is a process flow directly, using the batch process previously stored again.

流处理将数据视为流,源源不断的数据形成数据流。当新的数据到来即立即处理并返回所需的结果。大数据的实时处理是一个极具挑战性的工作,数据具有大规模、持续到达的特点。因此,如果要求实时的处理大数据,必然要求采用分布式的方式,在这种情况下,除了应该考虑分布式系统的一致性问题,还将涉及到分布式系统网络时延的影响,这都增加了大数据流处理的复杂性。目前比较有代表性的开源流处理系统主要有:Twitter的Storm、Yahoo的S4以及Linkedin的Kafka等。

Google公司2004年提出的MapReduce编程模型是最具代表性的批处理模型。MapReduce架构的程序能够在大量的普通配置的计算机上实现并行化处理。这个系统在运行时只关心如何分割输入数据,在大量计算机组成的集群上的调度,集群中计算机的错误处理,管理集群中的计算机之间必要的通信。

对于有些计算,由于输入数据量的巨大,想要在可接受的时间内完成运算,只有将这些计算分布在成百上千的主机上。这种计算模式对于如何处理并行计算、如何分发数据、如何处理错误需要大规模的代码处理,使得原本简单的运算变得难以处理。MapReduce就是针对上述问题的一种新的设计模型。

MapReduce模型的主要贡献就是通过简单的接口来实现自动的并行化和大规模的分布式计算,通过使用MapReduce模型接口实现在大量普通的PC上的高性能计算。

MapReduce编程模型的原理:利用一个输入键-值(Key/Value)对集合来产生一个输出的key/value对集合。MapReduce库的用户用两个函数表达这个计算:Map和Reduce。用户自定义的Map函数接受一个输入的key/value值,然后产生一个中间key/value对集合。MapReduce库把所有具有相同中间key值的value值集合在一起传递给Reduce函数。用户自定义的Reduce函数接收一个中间key的值和相关的一个value值的集合。Reduce函数合并这些value值,形成一个较小的value值集合

MapReduce的提出曾经遭到过一系列的指责和诟病。数据专家Stonebraker就认为MapReduce是一个巨大的倒退,指出其存取没有优化、依靠蛮力进行数据处理等问题。但是随着MapReduce在应用上的不断成功,以其为代表的大数据处理技术还是得到了广泛的关注。研究人员也针对MapReduce进行了深入的研究,目前针对MapReduce性能提升研究主要有以下几个方面:多核硬件与GPU上的性能提高;索引技术与连接技术的优化;调度技术优化等。在MapReduce的易用性的研究上,研究人员正在研究更为高层的、表达能力更强的语言和系统,包括Yahoo的Pig、Microsoft的LINQ、Hive等。

除了Google的MapReduce,Yunhong Gu等人设计实现了Sector and Sphere云计算平台[18],包括Sector和Sphere两部分。Sector是部署在广域网的分布式系统,Sphere是建立在Sector上的计算服务。Sphere是以Sector为基础构建的计算云,提供大规模数据的分布式处理。Sphere的基本数据处理模型如图4所示。

针对不同的应用会有不同的数据,Sphere统一地将它们以数据流的形式输入。为了便于大规模地并行计算,首先需要对数据进行分割,分割后的数据交给SPE执行。SPE是Sphere处理引擎,是Sphere的基本运算单元。除了进行数据处理外SPE还能起到负载平衡的作用,因为一般情况下数据量远大于SPE数量,当前负载较重的SPE能继续处理的数据就较少,反之则较多,如此就实现了系统的负载平衡。

Five core technology of big data to big data research fellow Getting little knowledge of reference

1.4 分布式数据库系统

传统的关系模型分布式数据库难以适应大数据时代的要求,主要的原因有以下几点:

(1)规模效应带来的压力。大数据时代的数据远远超出单机处理能力,分布式技术是必然的选择。传统的数据库倾向于采用纵向扩展的方式,这种方式下性能的增加远低于数据的增加速度。大数据采用数据库系统应该是横向发展的,这种方式具有更好的扩展性。

(2)数据类型的多样性和低价值密度性。传统的数据库适合结构清晰,有明确应用目的的数据,数据的价值密度相对较高。在大数据时代数据的存在的形式是多样的,各种半结构化、非结构化的数据是大数据的重要组成部分。如何利用如此多样、海量的低价值密度的数据是大数据时代数据库面临的重要挑战之一。

(3)设计理念的冲突。关系数据库追求的是“一种尺寸适用所有”,但在大数据时代不同的应用领域在数据理性、数据处理方式以及数据处理时间的要求上千差万别。实际处理中,不可能存在一种统一的数据存储方式适应所有场景。

面对这些挑战,Google公司提出了Bigtable的解决方案。Bigtable的设计目的是可靠的处理拍字节级别的数据,并且能够部署到千台机器上。Bigtable已经实现了以下几个目标:适用性广泛、可扩展、高性能和高可靠性。Bigtable已经在超过60个Google的产品和项目上得到了应用。这些产品在性能要求和集群的配置上都提出了迥异的需求,Bigtable都能够很好地满足。Bigtable不支持完整的关系数据模型,为用户提供了简单的数据模型,利用这个模型,客户可以动态控制数据的分布和格式。用户也可以自己推测底层存储数据的位置相关性。数据的下标是行和列的名字,名字可以是任意的字符串。Bigtable将存储的数据都视字符串,但是Bigtable本身不去解释这些字符串,客户程序通常会把各种结构化或者半结构化的数据串行化到这些字符串。通过仔细选择数据的模式,客户可以控制数据的位置的相关性。最后,可以通过Bigtable的模式参数来控制数据是存放在内存中、还是硬盘上。Bigtable数据模型如图5所示,给出了Bigtable存储大量网页信息的实例。

除了Google公司为人熟知的Bigtable,其他的大型Internet内容提供商也纷纷提出大数据系统。具有代表性的系统有Amazon的Dynamo[19]和Yahoo的PNUTS[20]。Dynamo综合使用了键/值存储、改进的分布式哈希表(DHT)、向量时钟等技术实现了一个完全的分布式、去中性化的高可用系统。PNUTS是一个分布式的数据库系统,在设计上使用弱一致性来达到高可用性的目标,主要的服务对象是相对较小的记录,比如在线的大量单个记录或者小范围记录集合的读和写访问,不适合存储大文件、流媒体。

Bigtable、Dynamo、PNUTS等技术的成功促使研究人员开始对关系数据库进行反思,产生了一批为采用关系模型的数据库,这些方案通称为:NoSQL(not only SQL)。NoSQL数据库具有以下的特征:模式只有、支持简易备份、简单的应用程序接口、一致性、支持海量数据。目前典型的非关系型数据库主要有以下集中类别。

Five core technology of big data to big data research fellow Getting little knowledge of reference

1.5 大数据系统的开源实现平台

Hadoop

除了商业化的大数据处理方案,还有一些开源的项目也在积极的加入到大数据的研究当中。Hadoop[22]是一个开源分布式计算平台,它是MapReduce计算机模型的载体。借助于Hadoop,软件开发者可以轻松地编出分布式并行程序,从而在计算机集群上完成海量数据的计算。Intel公司给出了一种Hadoop的开源实现方案,如图6所示。

In the GFS system is similar to the HDFS distributed file system, which may be constructed from the cluster several thousand to several conventional servers, and provides a high polymerization input and output file read and write access. HBase [23] is similar to Bigtable distributed, stored by columns, a multi-dimensional real-time distributed database table structure. It provides a large amount of data can be structured and unstructured data read and write operations height. Hive [24] is based Hadoop Big Data Distributed data warehouse engine. It can store data in a distributed file system or distributed databases, and use SQL language to vast amounts of information statistics, query and analysis operations. ZooKeeper [25] is reliable and harmonized system for large-scale distributed systems, available features include: configuration maintenance, name services, distributed synchronization, group services. It can maintain the system configuration, user and group names and other information. Sqoop [26] In the connector assembly to provide efficient bidirectional transmission of data between Hadoop and structured data source. It converts data transfer tasks for distributed Map task to achieve, in the transmission process can also achieve data conversion. Flume [27] is a distributed, highly reliable, and highly available log collection system, which is used to collect, aggregate and move a large amount of log data to a centralized data storage system from a different source.

Mainly with the background, the demand for big data and system architecture, introduced the current global progress in the technical aspects of big data. Can be seen from the analysis of large data systems solutions will fall to existing cloud computing platform. Cloud computing platform for distributed file systems, distributed computing model and distributed database management technologies provide ideas and ready-made platform for solving big data problems.

Can also be seen through analysis, research questions of big data, is necessarily a commercial interest for the drive, some large companies rely on big data for profit is bound to be a large body of data applications, big data will become a key research areas.

Reproduced in: https: //www.cnblogs.com/wuxiaoxia888/p/11002908.html

Guess you like

Origin blog.csdn.net/weixin_34245082/article/details/93450788