Big Data concepts

The concept involves:

  1. And synchronous data transmission: Sqoop, OGG
  2. Distributed computing frameworks: MapReduce, Spark, Spark Streamning, Flink
  3. Data media: Hive, HBase, Kafka
  4. 核心:Hadoop(HDFS+MapReduce+YARN)

Sqoop

RDB tool Hadoop and data transmission, in fact, is a command-line tool (commands -> MR program), completion of import and export between MySQL, Oracle and HDFS, Hive, HBase.

OGG

Oracle GoldenGate (OGG) is structured based on the log data replication software, or read Online Redo Log Archive Log database using the source extraction process (Extract Process), and then parsed to extract only data which change information, For example DML operations - add, delete, change operation, the extracted information into the intermediate format GoldenGate custom file stored in the queue (trail file) in. Recycling queue file transfer process (trail file) to the target system via TCP / IP.

Hadoop

Hadoop is a distributed system infrastructure and take advantage of the power of a cluster of high-speed computing and storage, it solves two problems: large data storage, big data analytics. Is Hadoop's two core: HDFS and MapReduce.

  • HDFS (Hadoop Distributed File System) is a scalable, fault-tolerant, high-performance distributed file systems, asynchronous replication, write once read, is responsible for storage.
  • MapReduce distributed computing framework, comprising a map (map) and reduce (reduction) process, responsible for computing the HDFS.
  • YARN Resource Management Architecture (Yet Another Resource Manager) include ResourceManager (Explorer), Applica-tionMaster, NodeManager (Node Manager).

Hive

Hive is a data warehouse is built on top of Hadoop software, which enables structured data already stored, it provides similar sql query HiveQL for data analysis and processing. Hive HiveQL statement will convert into a series of MapReduce jobs and executed.

Spark

Spark is an around-speed, ease of use and data processing framework for analysis of complex construction, faster than the MR. Spark itself uses Scala language, providing Java, Scala, Python, R four languages API.
Based on the Spark, and the like can be used to make java development operator, comprising Transformation (intermediate handling process) and action (Submit Job trigger SparkContext job, and outputs).
You can also use Spark SQL, which is also a SQL-on-Hadoop Tools, Spark SQL is such a SQL-based declarative programming interface. You can think of it as a layer on top of the Spark package, calculated on the basis of the model in the RDD provides DataFrame API and a built-in SQL execution plan optimizer Catalyst.

Spark Streaming

Spark Streaming Streaming is a batch (in real time) calculated frame. The basic principle is to process input data at a certain time interval batch, shortening the interval when the batch to the second stage, it may be used to handle real-time data streams.
Support access to data from multiple data sources, including Kafk, Flume, Twitter, ZeroMQ, Kinesis and TCP sockets, after obtaining data from a data source, you can use such as map, reduce, join and other advanced functions for processing complex algorithms. Finally, the processing result may be stored in the file system, a database (HDFS, HBase) and the like.

Distributed stream processing computing framework, its core is written in Java and Scala distributed streaming data flow engine. Flink data parallel and pipelined manner to execute arbitrary program data stream, and a batch system may perform stream handler Flink pipeline operation.

HBase

HBase is a high-reliability, high performance, column-oriented, scalable, distributed storage system, using technology erected HBase mass storage cluster configuration on cheap PC Server. HBase Hadoop HDFS use as a file storage system.

Guess you like

Origin www.cnblogs.com/lknny/p/11242075.html