2 minutes of vernacular: what is a big data architecture. everyone understands

problem background

In the reader community (50+) of the 40-year-old architect Nien , Nien has been guiding everyone to write resumes and do interviews, and the highest annual salary is nearly 100W.

Yesterday, I guided an Ali p6 guy to write a resume and do an interview. When helping him dig out the highlights of his resume, he found that his project has no technical highlights in Java. The core highlight of his project lies in big data.

That's right, big data. But my friends are afraid of big data , worrying that they can't control it.

Nien wants to say that big data is actually very simple, so you don’t need to be afraid.

Here Nien writes an article for everyone, 2 minutes to speak plainly, explaining to everyone what a big data architecture is.

  • From then on, everyone will not have big data fear
  • Even, starting from this article, you will fall in love with big data

what is big data

For example, a large-scale big data governance platform project led by Nien once had a huge data scale:

  • The amount of data is 10PB
  • More than 10 billion data records

For example, a large-scale search platform project led by Nien

  • The amount of data is 1PB
  • Data recorded in 100 million+

Once the data scale increases, the execution of the task will take a long time. For example, in a large-scale search platform project, how long is the execution cycle for a high-frequency, routine, and important index refresh task?

  • you thought it was 1 hour
  • actually 10 days

The core of the big data framework

Faced with such a huge amount of data, how to store and how to use large-scale server clusters to process and calculate is the core of big data technology.

Big data technology discusses how to use more computers to meet large-scale data computing requirements.

The core of the big data framework is to divide and conquer, both in terms of storage and computing.

Therefore, Nien divides the core of the big data framework into two aspects:

  • Ultra-large-scale distributed storage framework
  • Very Large Scale Distributed Computing Framework

Ultra-large-scale distributed storage framework: HDFS distributed file storage architecture

How to store hundreds of terabytes or hundreds of petabytes of data and manage them uniformly through a file system is a great challenge in itself.

Large-scale data computing must first solve the problem of large-scale data storage.

The architecture of HDFS has two major core roles, as shown in the figure below:

The NameNode server acts as a file control block to manage file metadata, that is, to record information such as file names, access rights, and data storage addresses, while the real file data is stored on the DataNode server.

DataNode stores file data in blocks. In Hadoop2.x version, the block size can be specified by configuration parameters. The default is 128M, but the block size is adjusted to 256M in the new version.

How do NameNode and DataNode work together?

The details are as follows:

▲Figure 31-1 HDFS architecture

  • All block information, such as the block ID, the IP address of the server where the block is located, etc., are recorded on the NameNode server;
  • The specific block data is stored on the DataNode server.

HDFS can combine thousands of servers into a unified file storage system. There are thousands of servers, among which NameNode and DataNode cooperate with each other. Of course, DataNode accounts for the vast majority in number.

How does HDFS achieve high reliability of storage? In order to ensure that files will not be damaged due to hard disk or server damage, HDFS will replicate data blocks, and each data block will be stored on multiple servers, or even multiple racks.

Very Large Scale Distributed Computing Framework

The ultra-large-scale distributed computing framework can be divided from many dimensions. Here, Nien is simply divided into:

  • File-Based Hyperscale Computing
  • Memory-based small and medium-scale computing

File-based ultra-large-scale computing is very simple, that is, the intermediate process data is output to the file

  • The advantage is that the calculation scale can be infinitely large, because theoretically the disk can be infinitely large
  • The disadvantage is that the speed is low, because the disk IO is very low performance and very slow

Memory-based ultra-large-scale computing is very simple, that is, the intermediate process data is output to the memory

  • The disadvantage is that the calculation scale cannot be infinitely large, because the memory is theoretically small
  • The advantage is high speed, because memory IO is high-performance, very fast

Ultra-large-scale distributed computing framework: MapReduce

The ultimate goal of data storage on HDFS is still for computing, and beneficial results can be obtained through data analysis or machine learning. However, if HDFS is treated as an ordinary file like a traditional application, and the data is read from the file and then calculated, then for the big data calculation scenario that needs to calculate hundreds of terabytes of data at a time, it is not known when it will be calculated.

What is MapReduce?

The core idea of ​​MapReduce is to perform slice calculations on data.

MapReduce starts the same calculation program on multiple servers, and the program process on each server is responsible for processing the data blocks to be processed on the server, so a large amount of data can be calculated at the same time.

Then, the data of each data block is independent, what if these data blocks need to be associated with the calculation? MapReduce completes it through the action of shuffle.

The responsibility of shuffle shuffling is also too simple: in terms of responsibility, shuffle shuffling is the transmission and routing of data. Intuitively speaking, shuffle transfers the middleware data in the calculation process to the nodes required for subsequent calculations.

Therefore, generally speaking, a MapReduce big data processing flow consists of three stages:

  • map
  • shuffle
  • reduce

MapReduce divides the calculation process into three parts

MapReduce divides the calculation process into three parts:

  • In the map process , multiple map processes are started on each server. The map reads local data first for calculation, and outputs a set of <key, value> after calculation;
  • The shuffle process , the so-called shuffle is to send the same key to the same reduce process, and complete the data association calculation in the reduce.
  • In the reduce process , MapReduce starts multiple reduce processes on each server, and then aggregates reduce operations on the <key, value> sets output by all maps.

Processing example of WordCount

Let's take the classic WordCount, which counts the word frequency data of the same word in all data, as an example to understand the processing process of map and reduce, as shown in Figure 31-2.

Assuming that the original data has two data blocks, the MapReduce framework starts two map processes for processing, and they read in the data respectively.

  • The map function will perform word segmentation processing on the input data, and then output <key, value> results such as <word, 1> for each word.
  • Then perform shuffle operation, the same key is sent to the same reduce process,
  • The last is the reduce operation. The input is a structure like <key, value list>, that is, the values ​​of the same key are combined into a value list.

In this example, the value list is a list of many 1s. Reduce sums these 1s to get the word frequency result of each word.

A specific MapReduce program is as follows:

In fact, big data is very simple, but many small partners are too timid

public class WordCount {
    
    
 
    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{
    
    
 
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
 
    public void map(Object key, Text value, Context context
                   ) throws IOException, InterruptedException {
    
    
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
    
    
          word.set(itr.nextToken());
          context.write(word, one);
          }
      }
 
    public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    
    
    private IntWritable result = new IntWritable();
 
    public void reduce(Text key, Iterable<IntWritable> values,
                     Context context
                     ) throws IOException, InterruptedException {
    
    
        int sum = 0;
        for (IntWritable val : values) {
    
    
          sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

The above describes the process of map and reduce processes cooperating to complete data processing, so how are these processes started on a distributed server cluster?

The close relationship between HDFS and MapReduce

Both HDFS and MapReduce are components of Hadoop. HDFS and MapReduce are a very intimate relationship

as the picture shows.

▲Figure 31-3 MapReduce1 calculation process

The task scheduling of MapReduce mainly has two process roles, JobTracker and TaskTracker. JobTracker is the role of master, and there is only one in the MapReduce cluster, while TaskTracker is the role of worker, which starts together with DataNodes on all servers in the cluster .

After the MapReduce application JobClient starts, it will submit the job to the JobTracker (master). The JobTracker analyzes which DataNode servers need to start the map process on the file path entered in the job, and then sends task commands to the TaskTracker (worker) on these servers.

After TaskTracker (worker) receives the task, it starts a TaskRunner process to download the program corresponding to the task, and then reflects the map function in the loading program, reads the data blocks allocated in the task, and performs map calculation.

After the map calculation is completed, the TaskTracker (worker) will perform a shuffle operation on the map output, and then the target worker will start the TaskRunner again to load the reduce function for subsequent calculations.

Memory-based small and medium-scale computing: Spark fast big data computing architecture

MapReduce mainly uses hard disks to store data in the calculation process. Although the reliability is relatively high, the performance is poor.

In addition, MapReduce can only use map and reduce functions for programming. Although it can complete various big data calculations, the programming is more complicated.

Moreover, due to the relatively simple programming model of map and reduce, complex calculations must be completed by combining multiple MapReduce jobs, further increasing the difficulty of programming.

In short, MapReduce is a super-large-scale calculation based on files. It is very simple, that is, the intermediate process data is output to the document

  • The advantage is that the calculation scale can be infinitely large, because theoretically the disk can be infinitely large
  • The disadvantage is that the speed is low, because the disk IO is very low performance and very slow

Therefore, memory-based ultra-large-scale computing has evolved.

Memory-based ultra-large-scale computing is very simple, that is, the intermediate process data is output to the memory

  • The disadvantage is that the calculation scale cannot be infinitely large, because the memory is theoretically small
  • The advantage is high speed, because memory IO is high-performance, very fast

Spark evolved on the basis of MapReduce, memory-based ultra-large-scale computing,

Spark mainly uses memory for intermediate calculation data storage, which speeds up the calculation execution time, and in some cases, the performance can be improved by hundreds of times.

Spark's RDD data model

Without files, Spark abstracts its own data model called RDD

RDD (Resilient Distributed Dataset) is called Resilient Distributed Dataset, which is the most basic data abstraction in Spark. It represents an immutable, partitionable collection whose elements can be calculated in parallel.

In Spark, all operations on data are nothing more than creating RDDs, transforming existing RDDs, and calling RDD operations for evaluation.

Each RDD is divided into partitions, which run on different nodes in the cluster.

RDD can contain objects of any type in Python, Java, and Scala, and even user-defined objects.

RDDs have the characteristics of a dataflow model: automatic fault tolerance, location-aware scheduling, and scalability.

RDD allows users to explicitly cache the working set in memory when executing multiple queries, and subsequent queries can reuse the working set, which greatly improves the query speed.

RDD supports two operations: transformation operation and action operation.

The conversion operation of RDD is an operation that returns a new RDD, such as map() and filter(), and the action operation is an operation that returns a result to the driver program or writes the result to an external system. For example count() and first().
Spark adopts the lazy calculation mode, and the RDD will only be actually calculated when it is used in an operation for the first time.

Spark can optimize the entire calculation process. By default, Spark's RDDs are recomputed every time you perform an operation on them. If you want to reuse the same RDD in multiple operations, you can use RDD.persist() to let Spark cache the RDD.

Spark's lightweight mapreduce function

The main programming model of Spark is RDD , or Elastic Dataset. Many common big data calculation functions are defined on RDD, and these functions can be used to complete more complex big data calculations with very little code.

In Spark, RDD replaces the status of files in mapreduce.

Spark's carding processing model, similar to mapreduce, can be understood as a lightweight mapreduce function

If the WorkCount in the previous example is programmed with Spark, only three lines of code are needed:

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

First, read data from HDFS to construct an RDD textFile. Then, perform three operations on this RDD:

  • One is to split each line of text of the input data into words with spaces;
  • The second is to convert each word, such as word→(word, 1), to generate a structure of <Key, Value>;
  • The third is to carry out statistics for the same Key, and the statistical method is to sum the Value. Finally, write the RDD counts to HDFS to complete the result output.

In the above code, flatMap, map, and reduceByKey are all RDD conversion functions of Spark. The calculation result of the RDD conversion function is still an RDD, so the above three functions can be written on one line of code, and the final result is still an RDD.

Spark will generate a computing task execution plan according to the conversion function in the program, and this execution plan is a DAG. Spark can complete very complex big data calculations in one job. An example of Spark DAG is shown in Figure 31-8.

▲Figure 31-8 Spark RDD directed acyclic graph DAG example

In Figure 31-8, A, C, and E are RDDs loaded from HDFS. A is calculated by the groupBy group statistics conversion function to obtain RDD B, C is calculated by the map conversion function to obtain RDD D, D and E are calculated by the union merge conversion function to obtain RDD F, and B and F are calculated by the join connection conversion function to obtain the final result RDD G.

How to handle big data like writing sql: Hive big data warehouse architecture

question:

Whether it is a file-based MapReduce or a memory-based spark, it is necessary to write a large section of processing operators, and it is also necessary to manage the complex sequential dependencies between those operators, so the development is more complicated.

Traditionally, SQL is mainly used for data analysis. If MapReduce can be automatically generated based on SQL, the application threshold of big data technology in the field of data analysis can be greatly reduced.

Hive is one such tool.

Translate SQL into MapReduce program code

What Hive has to do is: translate SQL into MapReduce program code

Let's see how Hive converts the following common SQL statement into MapReduce calculation.

SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

This is a common SQL statistical analysis statement, which is used to count the interests and preferences of users of different ages when visiting different web pages. The specific data input and execution results are shown in the figure.

▲Figure 31-4 Examples of SQL statistical analysis input data and execution results

Looking at this example, we will find that this calculation scenario is very similar to WordCount.

In fact, it is true. We can use MapReduce to complete the processing of this SQL, as shown in the figure.

▲Figure 31-5 Example of MapReduce completing SQL processing

The key output by the map function is the row record of the table, and the value is 1. The reduce function records the same row, that is, sums the value sets with the same key, and finally obtains the SQL output result.

In fact, Hive has many built-in Operators, and each Operator completes a specific calculation process.

Hive constructs these operators into a directed acyclic graph DAG, and then encapsulates them into map or reduce functions according to whether there is a shuffle between these operators, and then submits them to MapReduce for execution.

The DAG composed of Operators is shown in the figure

▲Figure 31-6 MapReduce directed acyclic graph DAG of sample SQL

This is an SQL containing where query conditions, where query conditions correspond to a FilterOperator.

Overall architecture of Hive

The overall architecture of Hive is shown in the figure.

▲Figure 31-7 Hive overall architecture

Hive table data is stored in HDFS.

The structure of the table, such as table name, field name, delimiter between fields, etc. is stored in the metastore.

The user submits SQL to the Driver through the Client, and the Driver requests the Compiler to compile the SQL into the DAG execution plan in the above example, and then hand it over to Hadoop for execution.

Big Data Flow Computing Architecture

There are two directions of big data: offline computing and streaming computing. The above mentioned are all offline calculations. Let's look at stream computing again.

How to process big data in seconds

Although Spark is much faster than MapReduce, it still takes minutes for calculations in most scenarios . This kind of calculation is generally called big data batch processing calculation.

In practical applications, sometimes it is necessary to complete the calculation and processing of continuously input massive data at the second or millisecond level .

There are many seconds and milliseconds:

  • For example, the personalized recommendation of Toutiao
  • For example, monitoring and analyzing the data collected by the camera in real time is the so-called big data stream computing.

The relatively well-known streaming big data computing engine in the early days was Storm, and the personalized recommendation of Toutiao is a very large-scale Storm cluster.

Later, with the popularity of Spark, Spark Streaming, the streaming computing engine on Spark, also gradually became popular.

The architectural principle of Spark Streaming is to divide the real-time incoming data into small batches of data, and then hand over these small batches of data to Spark for execution.

Due to the relatively small amount of data, Spark Streaming is resident in the system and does not need to be restarted, so the calculation can be completed in milliseconds, which looks like real-time calculation, as shown in the figure.

▲Figure 31-9 Spark Streaming stream computing converts real-time streaming data into small batch computing

The architectural principle of Flink, a popular big data engine in recent years, is very similar to that of Spark Streaming. It can flexibly adapt to stream computing and batch computing based on different data sources and according to the requirements of data volume and computing scenarios.

Regarding the study of flink, the Nien team will write a super awesome "flink study bible" for you later.

There is no expansion here.

After 2 minutes of learning the vernacular

After reading the 2-minute vernacular of this article, everyone basically knows what a big data architecture is.

So, in fact, big data is actually very simple, and you don't need to be afraid.

From then on, everyone does not have to be afraid of big data . Even, starting from this article, everyone will fall in love with big data

In the process of learning, if you encounter problems, you can talk to Nien

If you want to learn more about big data, you can read the "Nin's Big Data Interview Collection" that will be launched by Nien's team. For details, please refer to Nien's official account at the end of the article: Technology Freedom Circle

Recommended related reading

" Nin's Big Data Interview Collection Topic 1: The Most Complete Hadoop Interview Questions in History "

" Nin's Big Data Interview Collection Topic 2: Top Secret 100 Spark Interview Questions, Memorized 100 Times, Get a High Salary "

" Nin's Big Data Interview Collection Topic 3: The Most Complete Hive Interview Questions in History, Continuously Iterating and Continuously Upgrading "

" Starting from 0, Handwriting Redis "

" Starting from 0, Handwriting MySQL Transaction ManagerTM "

" Starting from 0, Handwriting MySQL Data Manager DM "

"Nin's Architecture Notes", "Nin's High Concurrency Trilogy" and "Nin's Interview Collection" PDF, please go to the following official account [Technical Freedom Circle] to take it↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132030127