Introduction to Big Data - Overview of Big Data Technology (1)

Big Data Getting Started Series Articles

  1. Getting Started with Big Data - What is Big Data

1. Concept

Big data technology refers to the technology needed when building a big data platform. Including storage system, database, data warehouse, resource scheduling, query engine, real-time framework, etc. The following is a brief introduction to some technologies that I have learned so far. The current introduction is a simple concept.

2. Technical details

1. Infrastructure: Hadoop

1. Architecture
insert image description here

2. Introduction

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage.

2. Distributed file system: HDFS
1. HDFS architecture

insert image description here

2. Introduction

Refers to a distributed file system designed to run on general-purpose hardware.

3. Features

HDFS is highly fault-tolerant and designed to be deployed on inexpensive hardware. And it provides high throughput to access application data, suitable for applications with very large data sets.

3. Data Warehouse: Hive
1. Architecture

2. Introduction

Hive is a Hadoop-based data warehouse tool for data extraction, transformation, and loading. It is a mechanism that can store, query, and analyze large-scale data stored in Hadoop.

3. Features

The execution process is relatively slow through MapReduce, the processing scale is large, and the scalability is high. The loading mode is the read-time mode. Later, MapReduce will give a special explanation.

4. Storage engine: Kudu
1. Architecture

2. Introduction

Apache Kudu is an open source storage engine by Cloudera, which can provide low-latency random read and write and efficient data analysis capabilities at the same time. Kudu supports horizontal expansion, uses Raft protocol for consistency guarantee, and is closely integrated with currently popular big data query and analysis tools such as Cloudera Impala and Apache Spark.

3. Features

Supports random read and write, supports OLAP analysis, performance drops when too many columns are queried, similar to relational data. Its storage files are not on HDFS, but have its own storage file system.

5. Distributed database: HBase
1. Architecture

2. Introduction

HBase is an open source non-relational distributed database. It refers to Google's BigTable model, and the programming language implemented is Java. It is part of the Hadoop project of the Apache Software Foundation and runs on the HDFS file system, providing Hadoop with services similar to the scale of BigTable. Therefore, it can store massive and sparse data fault-tolerantly.

3. Features

High reliability, high performance, column-oriented, scalable.

6. Real-time framework: Flink
1. Architecture

2. Introduction

Apache Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at memory speed and at any scale.

3. Features

Stream processing features, API support, Libraries support, integration support.

3. Others
The above are some of the technologies I am currently involved in. The next article will be Zookpeer, Yarn, Spark, Impala, Kafka, and Flume.
————————————————
Copyright statement: This article is the original article of CSDN blogger "Shui Jian Shi Qing", following the CC 4.0 BY-SA copyright agreement, please attach the original source link for reprinting and this statement.
Original link: https://blog.csdn.net/helongqiang/article/details/119282811

Guess you like

Origin blog.csdn.net/weixin_53547097/article/details/131206102