1. Introduction to Big Data
1. Basic concepts
Big data refers to the collection of data that cannot be captured, managed and processed with conventional software tools within a certain time frame. It is a massive and high growth rate that requires new processing modes to have stronger decision-making power, insight discovery and process optimization capabilities. And diverse information assets. Big data technology is mainly used to solve the storage and analysis of massive data.
2. Characteristic analysis
5V characteristics of big data (proposed by IBM): Volume (large), Velocity (high speed), Variety (diversity), Value (low value density), Veracity (authenticity).
3. Development process
Google published three papers around 2004, namely the file system GFS, the computing framework MapReduce, and the NoSQL database system BigTable. Mass data files, analysis, calculation, and storage establish the basic principles and ideas of big data.
DougCutting, a genius programmer, is also the initiator of Lucene and Nutch projects. According to the principle of Google's paper, it initially realized functions similar to GFS and MapReduce, and later developed into the famous Hadoop.
Later, after rapid development, Hadoop has formed an ecosystem. Based on Hadoop, there are a series of contents such as real-time computing, offline computing, NoSQL storage, data analysis, and machine learning.
Looking at the law of technology from the development of this series of things: Google’s business practice creatively proposes papers as a basis, business growth and demand, forcing technology to be constantly updated. So business is the key to continuous technological development.
Two, Hadoop framework
1. Introduction to Hadoop
Note that this is based on the Hadoop 2.X version description. If there is no special instructions in the follow-up, all versions are 2.7.
Hadoop is a distributed system infrastructure developed by the Apache Foundation;
Provide massive data storage capabilities, and analytical computing capabilities;
As the top-level project of Apache, it is an ecosystem with many sub-projects;
2. Framework features
Reliability: Hadoop stores and stores multiple data copies bit by bit, providing reliable services;
Scalability: Hadoop uses computer clusters to distribute data and complete computing tasks, which can be easily expanded to thousands of nodes;
High efficiency: Based on the MapReduce idea, it provides efficient parallel computing for massive data;
Fault tolerance: automatically save multiple copies of data, and can automatically redistribute failed tasks;
3. Composition structure
HDFS storage
- NameNode
Store file-related metadata, such as file name, file directory, creation time, number of authorized copies, etc.
- DataNode
The file system stores file block data and the mapping relationship with the data block ID.
Yarn scheduling
Responsible for resource management and job scheduling, allocating system resources to various applications running in the Hadoop cluster, and scheduling tasks to be executed on different cluster nodes.
MapReduce calculation
MapReduce divides the calculation process into two stages: the Map stage processes the input data in parallel, and the Reduce stage summarizes the Map results.
3. Big data technology stack
1. Kafka middleware
Open source organization: Apache Software
Application scenarios:
Kafka is a high-throughput distributed publish-subscribe messaging system that provides message persistence through a disk data structure. This structure can maintain long-term stable performance even for terabytes of message storage. High throughput: Even very common hardware Kafka can support millions of messages per second. Support partitioning messages through Kafka server and consumer machine cluster. Support Hadoop parallel data loading.
2. Flume log system
Open source organization: Cloudera
Application scenarios:
Flume is a highly available, highly reliable, distributed mass log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system to collect data; at the same time, Flume provides The data is simply processed and written to various data recipients (customizable).
3. Sqoop synchronization tool
Open source organization: Apache Software
Application scenarios:
Sqoop is an open source tool mainly used to transfer data between Hadoop, Hive and traditional databases such as MySql. It can import data from a relational database (such as MySQL, Oracle, etc.) to Hadoop. In HDFS, you can also import HDFS data into a relational database.
4. HBase database
Open source organization: Apache Software
Application scenarios:
HBase is a distributed, column-oriented open source database. HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a sub-project of Apache's Hadoop project. HBase is different from a general relational database. It is a database suitable for unstructured data storage and a column-based rather than row-based storage mode.
5. Storm real-time computing
Open source organization: Apache Software
Application scenarios:
Storm is used for real-time calculations, continuously querying data streams, and outputting the results to users in the form of streams during calculation. Storm is relatively simple and can be used with any programming language.
6. Spark computing engine
Open source organization: Apache Software
Application scenarios:
Spark is a fast and universal computing engine designed for large-scale data processing. It has the advantages of Hadoop's MapReduce; but it is different from MapReduce-Job intermediate output results can be stored in memory, eliminating the need to read and write HDFS, so Spark is better suited for MapReduce algorithms that require iteration such as data mining and machine learning. Spark is implemented in the Scala language and uses Scala as its application framework.
7. R language
Open source organization: Microsoft Corporation
Application scenarios:
R is a language and operating environment for statistical analysis and drawing. R is a free, free, and open source software belonging to the GNU system. It is an excellent tool for statistical calculations and statistical graphics.
8. Hive data warehouse tool
Open source organization: Facebook
Application scenarios:
Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading. This is a mechanism that can store, query and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map structured data files to a database table, and provide SQL query functions, which can convert SQL statements into MapReduce tasks for execution.
9. Oozie components
Open source organization: Apache Software
Application scenarios:
Oozie is a workflow scheduling management system for managing Hdoop jobs (job).
10. Azkaban components
Open source organization: Linkedin
Application scenarios:
Batch workflow task scheduler. Used to run a set of jobs and processes in a specific sequence within a workflow. Azkaban defines a KV file format to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track workflows.
11. Mahout components
Open source organization: Apache Software
Application scenarios:
Mahout provides some scalable implementations of classic algorithms in the field of machine learning, aiming to help developers create smart applications more conveniently and quickly. Mahout includes many implementations, including clustering, classification, recommendation filtering, and frequent subitem mining.
12. ZooKeeper components
Open source organization: Apache Software
Application scenarios:
ZooKeeper is a distributed, open source distributed application coordination service, an open source implementation of Google’s Chubby, and an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc.
Four, technology stack classification
Storage system: Hadoop-HDFS, HBase, MongoDB, Cassandra
Computing system: Hadoop-MapReduce, Spark, Storm, Flink
Data synchronization: Sqoop, DataX
Resource scheduling: YARN, Oozie, Zookeeper
Log collection: Flume, Logstash, Kibana
Analysis engine: Hive, Impala, Presto, Phoenix, SparkSQL
Cluster monitoring: Ambari, Ganglia, Zabbix