Introduction to big data, classification of technical systems

1. Introduction to Big Data

1. Basic concepts

Big data refers to the collection of data that cannot be captured, managed and processed with conventional software tools within a certain time frame. It is a massive and high growth rate that requires new processing modes to have stronger decision-making power, insight discovery and process optimization capabilities. And diverse information assets. Big data technology is mainly used to solve the storage and analysis of massive data.

2. Characteristic analysis

5V characteristics of big data (proposed by IBM): Volume (large), Velocity (high speed), Variety (diversity), Value (low value density), Veracity (authenticity).

3. Development process

Google published three papers around 2004, namely the file system GFS, the computing framework MapReduce, and the NoSQL database system BigTable. Mass data files, analysis, calculation, and storage establish the basic principles and ideas of big data.

DougCutting, a genius programmer, is also the initiator of Lucene and Nutch projects. According to the principle of Google's paper, it initially realized functions similar to GFS and MapReduce, and later developed into the famous Hadoop.

Later, after rapid development, Hadoop has formed an ecosystem. Based on Hadoop, there are a series of contents such as real-time computing, offline computing, NoSQL storage, data analysis, and machine learning.

Looking at the law of technology from the development of this series of things: Google’s business practice creatively proposes papers as a basis, business growth and demand, forcing technology to be constantly updated. So business is the key to continuous technological development.

Two, Hadoop framework

1. Introduction to Hadoop

Note that this is based on the Hadoop 2.X version description. If there is no special instructions in the follow-up, all versions are 2.7.

Hadoop is a distributed system infrastructure developed by the Apache Foundation;

Provide massive data storage capabilities, and analytical computing capabilities;

As the top-level project of Apache, it is an ecosystem with many sub-projects;

2. Framework features

Reliability: Hadoop stores and stores multiple data copies bit by bit, providing reliable services;

Scalability: Hadoop uses computer clusters to distribute data and complete computing tasks, which can be easily expanded to thousands of nodes;

High efficiency: Based on the MapReduce idea, it provides efficient parallel computing for massive data;

Fault tolerance: automatically save multiple copies of data, and can automatically redistribute failed tasks;

3. Composition structure

HDFS storage

  • NameNode

Store file-related metadata, such as file name, file directory, creation time, number of authorized copies, etc.

  • DataNode

The file system stores file block data and the mapping relationship with the data block ID.

Yarn scheduling

Responsible for resource management and job scheduling, allocating system resources to various applications running in the Hadoop cluster, and scheduling tasks to be executed on different cluster nodes.

MapReduce calculation

MapReduce divides the calculation process into two stages: the Map stage processes the input data in parallel, and the Reduce stage summarizes the Map results.

3. Big data technology stack

1. Kafka middleware

Open source organization: Apache Software

Application scenarios:

Kafka is a high-throughput distributed publish-subscribe messaging system that provides message persistence through a disk data structure. This structure can maintain long-term stable performance even for terabytes of message storage. High throughput: Even very common hardware Kafka can support millions of messages per second. Support partitioning messages through Kafka server and consumer machine cluster. Support Hadoop parallel data loading.

2. Flume log system

Open source organization: Cloudera

Application scenarios:

Flume is a highly available, highly reliable, distributed mass log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system to collect data; at the same time, Flume provides The data is simply processed and written to various data recipients (customizable).

3. Sqoop synchronization tool

Open source organization: Apache Software

Application scenarios:

Sqoop is an open source tool mainly used to transfer data between Hadoop, Hive and traditional databases such as MySql. It can import data from a relational database (such as MySQL, Oracle, etc.) to Hadoop. In HDFS, you can also import HDFS data into a relational database.

4. HBase database

Open source organization: Apache Software

Application scenarios:

HBase is a distributed, column-oriented open source database. HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a sub-project of Apache's Hadoop project. HBase is different from a general relational database. It is a database suitable for unstructured data storage and a column-based rather than row-based storage mode.

5. Storm real-time computing

Open source organization: Apache Software

Application scenarios:

Storm is used for real-time calculations, continuously querying data streams, and outputting the results to users in the form of streams during calculation. Storm is relatively simple and can be used with any programming language.

6. Spark computing engine

Open source organization: Apache Software

Application scenarios:

Spark is a fast and universal computing engine designed for large-scale data processing. It has the advantages of Hadoop's MapReduce; but it is different from MapReduce-Job intermediate output results can be stored in memory, eliminating the need to read and write HDFS, so Spark is better suited for MapReduce algorithms that require iteration such as data mining and machine learning. Spark is implemented in the Scala language and uses Scala as its application framework.

7. R language

Open source organization: Microsoft Corporation

Application scenarios:

R is a language and operating environment for statistical analysis and drawing. R is a free, free, and open source software belonging to the GNU system. It is an excellent tool for statistical calculations and statistical graphics.

8. Hive data warehouse tool

Open source organization: Facebook

Application scenarios:

Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading. This is a mechanism that can store, query and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map structured data files to a database table, and provide SQL query functions, which can convert SQL statements into MapReduce tasks for execution.

9. Oozie components

Open source organization: Apache Software

Application scenarios:

Oozie is a workflow scheduling management system for managing Hdoop jobs (job).

10. Azkaban components

Open source organization: Linkedin

Application scenarios:

Batch workflow task scheduler. Used to run a set of jobs and processes in a specific sequence within a workflow. Azkaban defines a KV file format to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track workflows.

11. Mahout components

Open source organization: Apache Software

Application scenarios:

Mahout provides some scalable implementations of classic algorithms in the field of machine learning, aiming to help developers create smart applications more conveniently and quickly. Mahout includes many implementations, including clustering, classification, recommendation filtering, and frequent subitem mining.

12. ZooKeeper components

Open source organization: Apache Software

Application scenarios:

ZooKeeper is a distributed, open source distributed application coordination service, an open source implementation of Google’s Chubby, and an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc.

Four, technology stack classification

Storage system: Hadoop-HDFS, HBase, MongoDB, Cassandra

Computing system: Hadoop-MapReduce, Spark, Storm, Flink

Data synchronization: Sqoop, DataX

Resource scheduling: YARN, Oozie, Zookeeper

Log collection: Flume, Logstash, Kibana

Analysis engine: Hive, Impala, Presto, Phoenix, SparkSQL

Cluster monitoring: Ambari, Ganglia, Zabbix

Guess you like

Origin blog.csdn.net/sinat_37903468/article/details/108599868