Big Data Hadoop architecture of the times

Mention big data analytics platform, have to say Hadoop system, Hadoop and it has more than 10 years of history, a lot of things have changed, evolved from 0.x versions to the current version 2.6. After I put defined as Hadoop platform for the post-2012 era, this is not said no Hadoop, but as NoSQL (Not Only SQL), as there are other complementary selection. In order to pave the way for everyone to have a simple talk about some related open source components.

Background papers

Hadoop:  open source data analysis platform to address the big data (large to a computer can not be stored, a computer can not be processed within the required time) of reliable storage and processing. Suitable unstructured data, including HDFS, MapReduce basic components.

HDFS: providing a resilient system for data storage across servers.

The MapReduce : sensing technology provides a standardized data processing flow position: data reading, data mapping (the Map), using a key-value data rearrangement, and simplified data (the Reduce) to give the final output.

Amazon Elastic Map Reduce (EMR): hosted solution, running on the network scale infrastructure by the Amazon Elastic Compute Cloud (EC2) and Simple Strorage Service (S3) consisting of above. If you need a one-time or infrequent large data processing, EMR could save you money. However EMR is highly optimized to work with the data S3, have higher latency.

Hadoop also contains a series of extended system techniques, these techniques primarily include Sqoop, Flume, Hive, Pig, Mahout, Datafu HUE, and the like.

Pig : a platform for analyzing large data sets, together form the platform analysis program from a high-level language and expression data infrastructure for the assessment of these programs.

Hive: for a data warehouse system for Hadoop, which provides a SQL-like query language, by using the language, can be easily aggregated data, specific queries and analysis.

Hbase: in a distributed, scalable, and large data repositories, supports random, realtime read / write access.

Sqoop : a tool for the efficient transmission of bulk data designed for Apache Hadoop structured data store and transfer data between the database such as relational databases.

Flume: a distributed, reliable, available services, which for efficiently collect, aggregate, moving large amounts of log data.

ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Cloudera: Release Hadoop most molding, with the largest deployment of cases. It provides powerful deployment, management and monitoring tools. Project development and contributed Impala can handle big data in real time.

Hortonworks : Use 100% open source Apache Hadoop provider. Developed and submitted a number of enhancements to the core backbone, which makes it possible to run native on Hadoop, including platforms, including Windows Server and Azure.

MapR: better performance and ease of use and support local Unix file system instead of HDFS. Providing such snapshot image or stateful high availability fault recovery characteristics. Leading the Apache Drill project is the open source implementation of Google's Dremel, the purpose is to execute a SQL-like query to provide real-time processing.

Recommended Reading

40 + annual salary of big data development [W] tutorial, all here!

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92430933