Foreign Internet companies big data technology infrastructure research

Google technical architecture Big Data Quest

A, Google

Google is the big data era founders, its big data technology architecture has been the focus of Internet companies keen to learn and study, as well as big data technology architecture industry benchmark and model.

1, Google data center

Google has built the world's fastest, most powerful, highest-quality data center, it's eight major data centers away from its headquarters in Mountain View, Calif., Are located in the US state of South Carolina Berkeley County, Iowa state Council Bluffs, Douglas County, Georgia, Oklahoma Mayes County, Lenoir, North Carolina, Oregon dals; the other two outside the United states, namely Finland Hamina and Belgium St Ghislain. In addition, Google is still China Taiwan and Hong Kong, China, Singapore and Chile as well as the establishment of a data center.

2, Google search engine platform and a new generation of big data analysis core technologies

Google is the founder of GFS MapReduce BigTable, but a new generation of Google search engine platform is gradually to replace the old system with a more capable system, next-generation search engine platform has several core technology system:

First, instead of MapReduce batch processing incremental indexing system based indexing system Percolator use, the indexing system is called Caffeine, which is faster than MapReduce batch search indexing system.

Google technical architecture Big Data Quest

Second, BigTable distributed storage designed specifically for the Colossus, also known as GFS2 (second-generation Google File System), it is designed to establish and use Caffeine search indexing system.

Third, column-store database BigTable, but in order to better support interactive analysis of large data sets, Google launched a Dremel and PowerDrill. Dremel is designed to manage very large amounts of large data sets (refer to the size and number of data sets of each data set is larger), while PowerDrill is designed to analyze small number of large data sets (refer to large-scale data sets, but the data to provide a more robust performance analysis of a small number of sets of) time.

Fourth, to provide services to Google Instant real-time search engine storage and analysis infrastructure.

Fifth Pregel, which is Google's network more efficient and graph algorithms.

On the Google search engine, a new generation of platforms, 4 billion hours of video per month, 425 million Gmail users, 150,000,000 GB Web index, was able to achieve 0.25 seconds search results.

3, Google foundation for cloud services

Provide computing, storage, and application-based Colossus, Google users cloud services. Calculation services include engine (ComputeEngine) calculation and application of APP Engine (AppEngine); storage services including cloud storage (CloudStorge), Cloud SQL (CLoudSQL), cloud data storage (Cloud DataStore), a persistent disk and other services; cloud application services include BigQuery, cloud terminal (cloud Endpoints), buffer, queuing.

4, Google's big data intelligence application services

Google's big data analysis intelligence applications including customer sentiment analysis, transaction risk (fraud analysis), product recommendations, message routing, diagnostics, customer churn prediction, classification of legal copy, e-mail content filtering, political orientation prediction, and many other species identification aspect. Allegedly, big data has brought $ 23 million in revenue to Google every day.

For example, some typical applications include:

(1) based on Map Reduce, Google's traditional applications include data storage, data analysis, log analysis, search quality, and other data analysis applications.

(2) Based on Dremel system, Google launched its powerful data analysis software and services - BigQuery, it is also part of the Internet search service Google's own use. Google has started selling online data analysis services, and trying to market similar to Amazon Web Services (Amazon Web Services) cloud computing services such as enterprise competition. This service can help business users complete terabytes of scanning within seconds.

(3) statistical algorithm based search, Google launched a search engine for writing the error correction, statistical-type machine translation and other services.

(4) Google's application trends. By user attention to the search term, quickly understand what hot spots on society Yes. For advertisers, its commercial value is quickly know what users care about, they should put an ad somewhere. Accordingly, Google has also developed a number of large data products, such as "Brand Lift in Adwords", "Active GRP" and so on, to help advertisers analyze and evaluate the effectiveness of their campaigns.

(5) Google Instant. Enter the keyword process, Google Instant will be possible to predict showed their search results.

Google's big data platform architecture is still evolving, the goal is to chase larger datasets, faster, more accurate analysis and calculation. This will further lead the direction of technological development of big data.

http://s12.sinaimg.cn/bmiddle/4aa50b4dtx6DoIHk8Zl0b&690

二、Yahoo

Hadoop is the most popular big data technology architecture, many large data applications are built on top of Hadoop platform. Many people know that Hadoop is the Apache Foundation's top open source projects, but not everyone knows, in the evolution and development of Hadoop, 70% of the contribution from Yahoo company. Yahoo is the largest user Hadoop platform, an important promoter of the strongest supporters of the application and commercialization of Hadoop, Hadoop has been the company's core Yahoo cloud computing platform, the company's largest single Yahoo Hadoop cluster consists of 4000 nodes, Yahoo company the recommendation system, advertising analysis and other applications are built on Hadoop distributed computing platform, Yahoo company through developer forums trained a large master Hadoop platform of professional and technical personnel every year, Yahoo spin-off company Hadoop technology and investment in R & D -Hortonworks currently one of the fastest growing commercial Hadoop company. Yahoo company did not lie on these achievements complacent, instead they actively promote Hadoop2.0 - Yahoo new generation of big data technology infrastructure.

Yahoo's new generation of big data technology architecture consists of the following components:

1, the core YARN

YARN or MapReduce2.0 is called the core technology architecture, it can be seen as a next-generation operating system Yahoo Big Data platform. To resolve performance bottlenecks Hadoop1.0 of, YARN in a JobTacker in the MapReduce has two main functions (resource management and job scheduling / monitoring) to achieve the separation, the main method is to create a global resource manager (ResourceManager, RM) and a number of applications for the primary application server (ApplicationMaster, AM). After such a change, there YARN in greatly improved scalability, support 10000+ computer cluster the MapReduce while improving performance and support other than the Hadoop computing framework, such as low latency, flow calculation frame.

Calculation processing frame. In addition to supporting Hadoop batch, there are integrated and Spark Storm peer computing framework. Which, Hadoop for off-line data analysis, Spark for multi-iteration batch data analysis, Storm is the real-time analysis and forecasting for streaming data. With YARN, Yahoo large data off-line, near-line and real-time data to achieve the integration.

2、Storm

Storm was originally Twitter stream computing tools. Calculation of Yahoo in the next-generation technology architecture practice, the Storm and Storm-YARN YARN become integrated to support real-time stream. Storm frame is designed for real-time calculation and analysis of data types (i.e., data stream), the process flow in the movement of changing data analyzed in real time, the captured information may be useful to the user, and the results sent quickly. For example, to support personalized search advertising, real-time processing system needs several million unique users per second from thousands

http://s8.sinaimg.cn/bmiddle/4aa50b4dtx6DoJUe1Ztd7&690  million times on queries, and analyze real-time user session features to improve the accuracy of the ad relevance and predictive models.

3、Spark

Spark is originated from the University of California, Berkeley AMPLab cluster computing platform, has formally applied to join Apache Incubator, Yahoo next-generation technology architecture will integrate its YARN. Spark is based on in-memory computing, from multi-iteration batch processing, data warehousing eclectic, stream processing and computing and other computing paradigm, with lightweight, quick calculation and so on. Spark Scala-based language, is a few lines of code than Hadoop lightweight system, but it's also very fast, small data sets to achieve sub-second delay for large data sets typical iterative machine learning, ad hoc query FIG computing applications, based on the version than the Spark achieve fast MapReduce, Hive and Pregel ten times to hundred times.

4, the storage layer

Underlying storage is still based on Hadoop HDFS file system and NOSQL database HBase.

Although, at present in order to YARN as the core technology of next-generation architecture, there are many aspects need improvement, but its strategic position has been basically established in Yahoo. Its usual open-source strategy, the Gospel will also bring big data industry.

Three, Amazon

Big Data analysis typically need to rely on a large number of distributed computing infrastructures, distributed computing framework, and storage systems. But not every user has conditions to establish these big data infrastructure. The contradiction between the huge market demand and user limited computing resources become increasingly prominent, in this context, big data cloud services came into being. Amazon, Google and other Internet companies have taken a fancy cake big data cloud services market, have launched a large data analysis WEB service provided to users of paid use.

1、Amazon Elastic MapReduce(EMR)

Amazon Elastic MapReduce (EMR) is big data analysis provided by Amazon cloud services. This is a commercial Hadoop infrastructure services based on distributed computing capabilities it provides, businesses, researchers, data analysts and developers to easily to process and analyze large amounts of data according to their needs. It used to analyze massive amounts of data customers to submit analysis jobs by Hadoop cluster virtual servers running on the Amazon cloud. Since 2009, thousands of customers worldwide use Amazon EMR to start the millions of cluster. In the open source project Hadoop framework running above, such as Hive, Pig, HBase, DistCp, Ganglia, Mahout and R, are integrated with Amazon EMR. Amazon EMR can immediately required flexibility to configure their own capacity size, perform data-intensive computing applications, complete Web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research and other tasks.

Cloud sharing big data analysis services

2, Amazon EC2 and S3

Amazon EMR is a Web-based scale of Amazon Elastic Compute Cloud (Amazon EC2) technology and Amazon Simple Storage Service (Amazon S3) technologies in big data analytics infrastructure services. Amazon EMR service with AWS's other Web services to achieve a high level of integration. Running on Amazon EMR Hadoop cluster using Amazon EC2 instance Linux servers as a virtual master and a slave node, the Amazon S3 for bulk storage input and output data, and Amazon CloudWatch cluster performance for monitoring and alarm, You can use Amazon EMR and Hive to migrate data to and from Amazon DynamoDB move out. All these operations by starting and Amazon EMR software management control Hadoop cluster coordination arrangements. Of course, these Web services integration, mostly use requires a separate fee. From the current pricing EMR see, basically in accordance with the calculation of time to calculate the cost of specific price can query to the official website.

3, the new big data services

In 2012, AWS launched two new big data services to supplement the previously released service Elastic Map Reduce (EMR is an online Hadoop engine that can analyze the data). A service called DyamoDB, it is a solid state drive NoSQL database managed by Amazon, has a high degree of scalability and fault tolerance. Since 2007, Amazon on-premises, it can optimize the Amazon consumer website. There is a service RedShift, it is an online data warehouse. Redshift will be combined with other data storage products Amazon, one of the most well-known is its Simple Storage Service (S3). Earlier this year, Amazon also launched Glacier as a long-term low-cost storage options.

Four, Facebook

Facebook has been the most active of those big data technology applications, because the amount of data it has extremely large, an information display in 2011 it has been compressed data have 25PB, uncompressed data 150PB, uncompressed new data generated every day 400TB. On Facebook, Big Data technology is widely used in advertising, all areas of news sources, news / chat, search, site security, specific analysis and reporting. Facebook is one of the largest contributors to Apache open source big data projects. Facebook is officially turned to the year 2007 Hadoop computing framework, along with its contribution to the Hive, ZooKeeper, Scribe, Cassandra and other open source tools to the famous Apache Foundation, an open source process is currently still actively promote the Facebook. Facebook Big Data technology architecture has gone through three stages of evolution.

1, Facebook early big data technology architecture

Facebook early big data technology architecture is built on open source tools foundation Hadoop, HBase, Hive, Scribe and so on. Generating log data stream from the HTTP server, is transferred to the shared memory file system NFS Scribe log collection system takes time in seconds, hours then stage Copier / Loader (i.e.

http://s15.sinaimg.cn/mw690/001mKEk5ty6DD9hxEXI3e&690  MapReduce jobs) to upload data files to Hadoop. Abstract data daily routine assembly-line production, which is based on SQL-like language development Hive, the results will be updated regularly to the front of the Mysql server to generate reports by OLTP tools. Hadoop cluster nodes have 3000, issue scalability and fault tolerance can be a good solution, but the main problem is the overall system of early treatment delay is large, generated from the log 1 to 2 days to get the final report.

2, Facebook's current big data technology architecture

Facebook large data current technology architecture is the data transmission channel and the data processing system is optimized based on the earlier architecture, as shown, is divided into a distributed log system Scribe, HDFS and HBase distributed storage system, distributed computing and analysis system (MapReduce, Puma and Hive) and so on.

Wherein, Scribe log system for aggregating data from a large log HTTP server. Thrift is a software framework provided by Facebook for cross-language services development, to achieve seamless support between C ++, Java, PHP, Python and Ruby languages. Thrift RPC uses Scribe to call log collection service for log data collection. Scribe the Policy and traffic model is log management node, transmits the metadata to the client and Scribe Scribe the HDFS, the collected data is stored in the log Scribe HDFS.

Facebook on data path of an earlier system optimization called Data Freeway, capable of processing data peaks 9GB / s and end to end delay within 10s, supporting more than 2500 kinds of log species. Data Freeway mainly comprises four components, Scribe, Calligraphus, Continuous Copier and PTail. Scribe for the client, the data is transmitted via Thrift RPC; Calligraphus the intermediate layer and the data write HDFS carding, it provides a kind of log management, the use of auxiliary Zookeeper; Continuous Copier copies files from one to another HDFS HDFS; PTail tail in parallel on a plurality of directory HDFS, and writes the file data to standard output. In the current architecture, part of the data processing is still performed in batch mode by processing MapReduce hour class, stored in a central HDFS, daily analysis and processing through Hive. Another portion of the near real-time data streams to be processed by the order of minutes Puma. Facebook provides Peregrine (Hipal) tool for specialized analysis, cyclical analysis provides Nocron tool for analysis.

3, Facebook the next big data technology architecture

Facebook the next big data technology architecture prototype has been out. First, open source is a possible alternative to Hadoop MapReduce system of Corona, similar YARN Yahoo raised. One of the biggest advances in Corona is the cluster manager to do the management based on demand resources CPU, memory and other job processing, which can make the Corona can handle both MapReduce jobs can also handle non-MapReduce jobs, so that applications Hadoop cluster more widely. Second, Facebook latest interactive big data query system Presto, like Cloudera Impala and Hortonworks of the Stinger, to solve the rapid expansion of Facebook's massive data warehouse fast query needs. According to Facebook said, using Presto simple queries only a few hundred milliseconds, even very complex queries, but also just a few minutes to complete, it runs in memory and does not write to the disk. The third is the Wormhole

Flow computing system, similar to the Storm and Yahoo Twiitter Storm-YARN. The fourth important item is Prism, its ability to run a large, global data centers will be able to link Hadoop cluster, may be re-distributed real-time data in a data center dang time out, this is a Spanner with Google similar projects.

Facebook's big data technology architecture evolution path represents the roadmap big data technology, commendable it is that open source is the usual route Facebook, Yahoo and other companies with it and made a great contribution to the development of big data technology.

Five, Twitter

Twitter to be listed, once again the world's attention, it is it creates twitte make the Internet entered the era of micro-innovation. Although it did not enter China, but inspired by its Sina Weibo and Tencent Weibo in China's Internet has become a beautiful landscape. Twitter has a huge user base in the world, massive social information flow to support its operation is also quite big data technology infrastructure concern.

Twitter's Big Data architecture is divided, mainly for development and development based on the main types of Hadoop-based Storm batch and real-time flow calculation based on open source projects.

1, the batch architecture

Twitter data collection uses Facebook open source logging tools Scribe, batch store and analyze data using Hadoop + MapReduce, fast analysis on large data using Pig. Pig Hadoop is calculated based on a parallel level programming language that provides a high-level data analysis SQL-language text, called Pig Latin, the compiler will SQL-language data analysis request into a series of optimized processes MapReduce operations. Data analysis used mainly to support packet Pig, filtered, combined and the like.

2, the flow computing architecture

Storm is Twitter's open source stream computing platform, Storm through a simple API allows developers to reliably handle continuous unbounded streams of data, real-time computing, develop language Clojure and Java. Storm's a lot of scenarios, such as real-time analysis, online machine learning, continuous computing.

3, NOSQL database

Twitter has a lot of storage tools, reflecting its role in different stages of development, but also to try different scenarios. NOSQL database contains at least the HBase, Cassandra and FlockDB and so on. HBase batch is used for analysis and data generation sets, for the online system is Cassandra, supports dynamic read, FlockDB for storing the real-time distribution of social graph.

4, Mesos operating system

In the era of Big Data 2.0, Twitter's big data technology infrastructure is constantly absorb more essence of open source technology, inclusive and evolving. For example, Mesos was introduced twitter operating system for distributed big data technology architecture, it can be a reasonable schedule for computing storage resources like Hadoop.

Twitter's big data technology architecture

5、Summingbird

Again, just open source Summingbird, to achieve a batch and real-time stream computing architecture in a platform integration, developers can use very close to the native Scala or Java implementation of MapReduce jobs on Summingbird, you can either use Summingbird do "batch "You can also use it to do" real-time processing, "while mixing the two modes can also be used. Do write-once logic, once and for all the problems. Summingbird further work include support for Akka, Spark and Tez and other platforms, which is conducive to Twitter to more platforms and tools into their big data technology architecture system.

Twitter-based powerful big data technology architecture, it is to achieve the transition to large data analysis and service providers, a growing number of Twitter-based analysis tools, applications and business models APP were exhumed, data analysis industry ecosystem gradually built up , it brings enormous imagination. I think people expect Weibo secret of big data technology infrastructure projects must be carried forward to moving in this direction.

Published 63 original articles · won praise 52 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_41521681/article/details/104266517