Big Data Getting Started Series Articles
- Getting Started with Big Data - What is Big Data
1. Concept
Big data technology refers to the technology needed when building a big data platform. Including storage system, database, data warehouse, resource scheduling, query engine, real-time framework, etc. The following is a brief introduction to some technologies that I have learned so far. The current introduction is a simple concept.
2. Technical details
1. Infrastructure: Hadoop
1. Architecture
2. Introduction
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage.
2. Distributed file system: HDFS
1. HDFS architecture
2. Introduction
Refers to a distributed file system designed to run on general-purpose hardware.
3. Features
HDFS is highly fault-tolerant and designed to be deployed on inexpensive hardware. And it provides high throughput to access application data, suitable for applications with very large data sets.
3. Data Warehouse: Hive
1. Architecture
2. Introduction
Hive is a Hadoop-based data warehouse tool for data extraction, transformation, and loading. It is a mechanism that can store, query, and analyze large-scale data stored in Hadoop.
3. Features
The execution process is relatively slow through MapReduce, the processing scale is large, and the scalability is high. The loading mode is the read-time mode. Later, MapReduce will give a special explanation.
4. Storage engine: Kudu
1. Architecture
2. Introduction
Apache Kudu is an open source storage engine by Cloudera, which can provide low-latency random read and write and efficient data analysis capabilities at the same time. Kudu supports horizontal expansion, uses Raft protocol for consistency guarantee, and is closely integrated with currently popular big data query and analysis tools such as Cloudera Impala and Apache Spark.
3. Features
Supports random read and write, supports OLAP analysis, performance drops when too many columns are queried, similar to relational data. Its storage files are not on HDFS, but have its own storage file system.
5. Distributed database: HBase
1. Architecture
2. Introduction
HBase is an open source non-relational distributed database. It refers to Google's BigTable model, and the programming language implemented is Java. It is part of the Hadoop project of the Apache Software Foundation and runs on the HDFS file system, providing Hadoop with services similar to the scale of BigTable. Therefore, it can store massive and sparse data fault-tolerantly.
3. Features
High reliability, high performance, column-oriented, scalable.
6. Real-time framework: Flink
1. Architecture
2. Introduction
Apache Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at memory speed and at any scale.
3. Features
Stream processing features, API support, Libraries support, integration support.
3. Others
The above are some of the technologies I am currently involved in. The next article will be Zookpeer, Yarn, Spark, Impala, Kafka, and Flume.
————————————————
Copyright statement: This article is the original article of CSDN blogger "Shui Jian Shi Qing", following the CC 4.0 BY-SA copyright agreement, please attach the original source link for reprinting and this statement.
Original link: https://blog.csdn.net/helongqiang/article/details/119282811