2019/08/16 Hadoop basics (01)

Bigdata: data is divided into three types of
structured data: constraints
semi-structured data: xml, json, yaml without a predefined data model
Unstructured data: no metadata; log information,

Search engine: search component, index component (generally stored in distributed storage)
spider program; (the crawled data is unstructured, semi-structured data)
search engine constructs an inverted index to retrieve,
ELK, any document The searched needs to be analyzed first, and the analysis is done by the analyzer, which needs word segmentation and regularization, and the results of the analysis are normalized to generate an index

Storage:
Analysis and processing:

Insert picture description here
2003: The Google File System paper, google,
GFS does not support random real-time access to data, and is only suitable for storing a small number of huge files
2004: MapReduce: Simplified Data Processing On Large Cluster (Able to process tasks Divided into task units, running nodes in parallel, collecting the results of each node for processing, and obtaining data by secondary processing, PB data parallel processing framework)
2006: BigTable: A Distributed Storage System for Structure Data
(for storage structure Distributed data storage system for data, does not support the traditional paradigm, and implements data storage in a format similar to key-value pairs.
HDFS + MapReduce = Hadoop (underlying distributed storage form, complete distributed data processing framework,
hadoop's distributed file system Situation HDFS)
HBase hadoop's database
2
Nutch is a web crawler program that specifically generates poll data for Lucene, the implementation of open source search engines, the crawled data can be imported into Lucene, Lucene is responsible for indexing, but the data is getting larger Bad storage, google published a paper
hadoop has a flaw, mapreduce a batch structure, and limited by the storage mechanism, and the operating speed is very poor handling properties, stored using the storage format of the central node
mapreduce when a user on a distributed storage system for data processing At the time, you need to call the mapreduce development framework yourself, call this library to write a program, and then submit it to the mapreduce framework to run
Mapreduce represents three things, development API, running framework, runtime environment
programmers must first call this program, write code, a code run is divided into multiple instances, at the same time run a copy of each node, each copy only handles one copy For local application data, the default data copy is 2 each, and each data consists of 3 copies. The distributed framework needs to verify that the data is stored on those nodes. After the
data processing, it needs to be merged.
However, 1 node completes the time block, and 2 nodes slow
map. reduce, when does the second phase start, which is the slowest completion of many instances, and restarts if an error occurs, so this speed is very slow, hadoop is a mapreduce framework, can not request real-time feedback, submit a job may The next day only results in
Insert picture description here
NAS shared storage, SAN
defects, and only one storage system. When facing large amounts of data access, there are many front-end nodes that need to complete data processing, and network and disk IO will face great challenges, so only later Distributed storage applications, this is a centralized traditional solution with
Insert picture description here
a central node distributed file system, HDFS, GFS, There is a node dedicated to the metadata server (it is highly available, but it is in memory and needs to be persisted). In HDFS, the metadata is called namenode NN,
data node DN, datanode
distributed file system without a central node, and
innodb storage. Data is dependent on the transaction log
NN: The name node is a data block that stores the file name and the file where it is located. Each file is divided into trunks and stored as independent storage units. Each trunk may be stored once on a certain node, and a copy will also be generated. If some data is not available for storage, some trunks may not have metadata locally. It is detected like the file system. If there is metadata but no data, delete the metadata, and mark the data without metadata as data blocks. Once the disk is available, The crash needs to load the disk file, wait for the list held by each data block, and finally complete the file system detection. If the data reaches the TB level, it will take more than half an hour. This
means that once the metadata server crashes, it will take half an hour to start again.
Insert picture description here
HDFS Using a more advanced method, you can provide a node, a secondary name node, a second name node, a SNN secondary name node. The name node must update the data itself, and it needs to be placed in the appended log file itself. to merge the log files and image files together,
the content on the shared storage By secondary name node to merge the information into it, hung up in case the master node, the second node is responsible from shared storage, loaded into memory, waiting for the report, part of the time is shortened, which is HDFS1 problems encountered
massive amounts of data might yuan There is not a lot of data, and there is not a lot of access frequency. It is very likely that a name node becomes a bottleneck. Later, I studied the federal system.
Insert picture description hereInsert picture description here
Any file system of Federatic is basically a tree structure. Federatic is responsible for the entire file system. One branch, only one fragment is seen, there are four branches added, the right branch maintains the server on the right, the left server maintains on the left, user access data on the left is sent to the left server, right The request from the side is sent to another server,
but this is also very beautiful.
HDFS2 is solved, nn is available and available.
The metadata is not stored directly in the local memory, but to find a shared storage to store. This shared storage is based on metadata storage. The common solution for this shared storage is NFS. Later, ZOOKeeper (self can It is particularly good to support distributed applications and can automatically resolve dependencies), so it is possible to store nodes on zookeeper, and the update operation of the second node is also sent to zookpeer, so every zookeeper can be used to get a consistent attempt, so access each There is no problem with the two nodes. Zookeeper, the two nodes recorded are one master and one standby. If there is a problem, replace
HDFS2 with another one . There is a good solution for high availability.

Insert picture description here
Data nodes are mainly used to store various trunks, and will not be used to solve data problems through high availability, but through data copies. The trunk stored in each node is also found on other nodes. One copy of data is stored, and the default storage is three points. , One master and two backups, one is saved when it is saved, and the other is decided by HDSF to find another two copies, based on the chain method to copy, the first completes the second, the second completes the third, so it is called For the chained replication mechanism, it is necessary to report the end of storage after each replication is completed. The internal node list of the metadata node is updated.
Any node failure will result in insufficient copies of a certain data block. It is necessary to let the metadata node know as soon as possible. , Because the copy must be made up, and then there will be a failure. In the future, data may be lost. The data node continuously reports its situation to the metadata node. Online, the data block list, the metadata node has two lists, and the data is the core on those nodes. node as the core of which data blocks
if a node has not reported a timeout, if the number of data blocks and stored data column reports Inconsistency might think that this node data loss, this time on the need to restart the chain, not enough to make up the data block list, this is the availability of the data block, which is the basic operating rules HDFS
Insert picture description here
**
storage solved, how the data Processing, applicable to mapreduce, mapreduce also works as a cluster, still regards the node that stores data as a node in the cluster, and each node is a node that can be used to run programs.
MAP REDUCE, 1. is an API, 2. runs the framework 3. Runtime environment
How a task is mapped, after a task is requested, it should be divided into several maps, controlled by mapreduce freework, a master control node is needed,
The mapreduce cluster also needs a node that can schedule and submit user requests, so when users want to run jobs, they need to ask the headquarters to allow a task. This map needs to be allowed on multiple nodes and run on nodes with data, but It is not suitable to run several nodes where the data is. The number of tasks that one node can run is limited. Therefore, when you start it, there may be tasks already. The tasks run by each node are java programs. A java program usually requires a lot of system resources. Each node can run several tasks in advance. The
jobtracker is defined in advance . The jobs tracked by the job tracker are all run on the datanode. In the context of the job tracker, Not called datanode, but called tasktracker task tracker,
so each data node needs to run two types of processes, datanode is responsible for storing or deleting data management operations, tasktracker is responsible for completing data tasks, belongs to mapreduce,
so a hadoop is the combination of two types of cluster Two clusters share two nodes that store data, so each node Data is stored is processing data, such data processing model and traditional or differentiated
traditional: programs running where the data should be where to run and where you need to load data into where
today: Where is the data and where is the program? The Insert picture description here
small-scale Hadoop cluster may look like this, N data nodes, tasktracker, and another node is jobtracker or namenode. tasktracker and datanode cannot be separated. The tasktracker used for a task is not all the nodes in the cluster, it may be part of it. It may even be one of them. In order to avoid a datanode from executing many tasks, there is a Task slot to determine how many tasks a task runs. This is the Hadoop distributed operation processing framework. Insert picture description here
Insert picture description here
Understanding mapreduce first understands functional programming, a function You can take other functions as parameters

Functional programming:
Lisp, ML functional programming language; higher-order functions;
map (a task is mapped to multiple tasks), fold (folding)

map: (Python supports functional programming)
map (f ())
map: accepts a function as a parameter and applies it to all elements in the list; thereby generating a result list;
able to put a function in each element of the list Run
"Ou Yangfeng", "Dongfang Bubai", "Saodi Seng", "Dugu Qiubai" once

fold: fold folds again on the above result list. It
accepts two parameters: function, initial value.
The function can be applied to each element in the list and the next element
is based on fold and then passed to a parameter g, initial value init
fold (g (), init)
"Hamo Gong", "Kuihua Baodian", "Yijin jing", "Dugu Jiujian"
second third
apply this function and the obtained result to the second parameter, the obtained result is regarded as init
(When processing the first element, it is handled by the g function. When processing, the initial value is used as a parameter. After processing, a result is obtained, and the second value of second is obtained. , Using second as a participating value, will generate a third parameter, third, and then apply g to the third list, third as its parameter, and ultimately generate a final result)

Kuihua Baodian

Mapreduce reference, mechanisms such as map and fold Insert picture description here
mapreduce:
mapper (is an instance running on tasktracker, you need to wait for all mappers to run out, the result is very large, how to process it slowly), reducer

Count the number of occurrences of each word in a book:
mapper: one unit per 100 pages, 5 mappers are
used to split into words; 10000000, split into words is the result list
reducer:
split 10000000 into two
(the same word all Put it in a reducer)
reducer1, reducer2
this 500
is 20

how 301
do 32
mapper needs to distribute each word to multiple reducers, the same word can only be distributed to the same reducer's
process is shuffle and sort, transmission and sorting

shuffle and sort

mapper: Cut the entire book into words and convert it into kv data. The
same key appears multiple times
this 1, is 1, this 1, how 1 The

same key can only be sent to the same reducer, and a key can only be in one of the reducers. On, the reducer is generating a key-value data
reducer:
this 500
is 20

Sometimes the result part formed by the mapreducer is that the final result needs to be processed again, which may be mapreduceN times

Cut to make a large plurality of data blocks, is converted into the key data, is sent to a mapper mapper, mapper then into the key, the key input of mapper after processing is outputted from the key
number with the reducer may not mapper Similarly, the same key can only be sent to a reducer in the
middle of the process is shuffle and sort
Insert picture description here

Published 252 original articles · Likes6 · Visits 60,000+

Guess you like

Origin blog.csdn.net/qq_42227818/article/details/99688053