Basic knowledge of big data---hadoop ecosystem

Basic knowledge points of big data:
Java
List features: elements have an order in which they are placed, and elements can be repeated. Set features: elements have no order in which they are placed, and elements cannot be repeated.
The three paradigms of the database: atomicity, consistency, uniqueness
Object and reference object: An object is an object that is not initialized, and a reference object is initialized even if the object is initialized
ArrayList and Vector: Arrays are used to store data, and elements are accessed
according automatically expand the internal data length as needed, so as to add and insert elements. Both allow direct serial number indexing of elements, but
inserting data involves arrays . Element movement and other memory operations, so the index data is fast and the data is inserted slowly.
The difference between them is the use of synchronized synchronization.
   LinkedList uses a doubly linked list for storage. Indexing data by serial number needs to be traversed forward or backward, but
when inserting data, you only need to record the front and rear items of this item, so the insertion rate is faster!
   If you're just looking for elements at a specific location or adding and removing elements only at the end of the collection, then either a Vector
or an ArrayList will work. If it is the insertion and deletion of other specified locations, it is best to choose
the difference between LinkedList HashMap and HashTable and their advantages and disadvantages:
     The methods in HashTable are synchronized. The methods of HashMap are asynchronous by default,
so in a multi-threaded environment An additional synchronization mechanism is required.
    HashTable does not allow null value key and value are not allowed, and HashMap allows null value both key and value are allowed,
so HashMap uses containKey () to determine whether there is a key.
HashTable uses Enumeration while HashMap uses iterator.
     Hashtable is a subclass of Dictionary, and HashMap is an implementation class of Map interface.
When you need to operate on strings, use StringBuffer instead of String. String is read-only.
If you modify it, a temporary object will be generated, while StringBuffer is modifiable and will not generate temporary objects.

Hadoop
core: hdfs mr
RPC--remote procedure call protocol
HDFS (Hadoop Distributed File System) Hadoop distributed file system.
Features: ① Save multiple copies, and provide a fault-tolerant mechanism, the copy is lost or automatically recovered from downtime. 3 copies are saved by default.
    ② run on cheap machines.
    ③ Suitable for big data processing. HDFS divides files into blocks by default, and 64M is 1 block. Then store the block key-value pair on HDFS, and store the key-value pair map in memory. If there are too many small files, the memory load will be heavy.
HDFS: Master and Slave structure (NameNode, SecondaryNameNode, DataNode)
NameNode: It is the Master node and the leader. Manage data block mapping; handle client read and write requests; configure replication strategy; manage HDFS namespace;
SecondaryNameNode: is a younger brother, sharing the workload of the big brother namenode; it is the cold backup of NameNode; merge fsimage and fsedits and then send it to namenode .
DataNode: Slave node, slave, working. Responsible for storing the data block block sent by the client; perform read and write operations of the data block.
         Hot backup: b is the hot backup of a, if a fails. Then b immediately runs the work in place of a.
         Cold backup: b is a cold backup of a, if a fails. Then b cannot immediately replace a to work. However, some information of a is stored on b to reduce the loss after a is broken.
         fsimage: Metadata image file (the directory tree of the file system.)
         edits: The operation log of the metadata (records of modification operations made for the file system)
         The namenode memory stores =fsimage+edits.
The SecondaryNameNode is responsible for timing the default 1 hour. From the namenode, it obtains the fsimage and edits for merging, and then sends it to the namenode. Reduce the workload of namenode
hdfs: namenode(metadata)

Fault monitoring mechanism
(1) Node failure monitoring mechanism: each DN sends heartbeat information to NN at a fixed period (3 seconds) to prove that it is working normally
, and if it exceeds a certain time (10 minutes), it is considered that the DN pump
(2) communication failure Monitoring mechanism As long as data is sent, the receiver will return a confirmation code
(3) Data error monitoring mechanism Data + sum check code All DNs periodically send data block storage status to NN
The storage strategy of HDFS is to store one replica on the local rack node, and the other two replicas on different nodes in different racks.
This allows the cluster to survive the complete loss of a rack. At the same time, this strategy reduces data transmission between racks and
improves the efficiency of write operations, because data blocks are only stored on two different racks, reducing the total network transmission bandwidth required to read data. In this way, data security and network transmission overhead are taken into account to a certain extent.
MapReduce job running process:
1. Start a job on the client.
2. Request a Job ID from the JobTracker.
3. Copy the resource files required to run the job to HDFS, including the JAR file packaged by the MapReduce program, the configuration file, and the input partition information calculated by the client.
These files are stored in a folder created by JobTracker specifically for the job. The folder name is the Job ID of the job. The JAR file will have 10 copies by default (controlled by the mapred.submit.replication property);
the input partition information tells the JobTracker how many map tasks should be started for this job.
4. After the JobTracker receives the job, it puts it in a job queue and waits for the job scheduler to schedule it. When the job scheduler schedules the job according to its own scheduling algorithm,
it will divide the information for each division according to the input. Create a map task and assign the map task to TaskTracker for execution.
For map and reduce tasks, TaskTracker has a fixed number of map slots and reduce slots according to the number of host cores and memory size.
What needs to be emphasized here is that the map task is not randomly assigned to a TaskTracker. There is a concept here: Data-Local.
It means: assign the map task to the TaskTracker containing the data block processed by the map, and copy the program JAR package to the TaskTracker to run, which is called "operation movement, data movement".
Data localization is not considered when assigning reduce tasks.
5. The TaskTracker will send a heartbeat to the JobTracker every once in a while, telling the JobTracker that it is still running, and the heartbeat also carries a lot of information, such as the progress of the current map task completion.
When the JobTracker receives the job's last task completion message, it sets the job to "success". When the JobClient queries the status, it will know that the task is complete and display a message to the user.
The above is to analyze the working principle of MapReduce at the level of client, JobTracker, and TaskTracker. Let's be a little more detailed and analyze and analyze from the level of map task and reduce task.
Map side:
1. Each input fragment will be processed by a map task. By default, the size of a block of HDFS (default is 64M) is used as a fragment, of course, we can also set the size of the block.
The result of map output will be temporarily placed in a ring memory buffer (the size of the buffer is 100M by default, controlled by the io.sort.mb property), when the buffer is about to overflow
(the default is 80% of the buffer size) , controlled by the io.sort.spill.percent property), will create a spill file in the local file system and write the data in the buffer to this file.
2. Before writing to disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, one reduce task corresponds to the data of one partition.
This is done to avoid the embarrassing situation that some reduce tasks are allocated a large amount of data, while some reduce tasks are allocated little or no data. In fact, partitioning is the process of hashing data.
Then sort the data in each partition. If the Combiner is set at this time, the sorted result will be subjected to the Combia operation. The purpose of this is to write as little data as possible to the disk.
3. When the map task outputs the last record, there may be many overflow files, and these files need to be merged. In the process of merging, sorting and combia operations are continuously performed for
    two purposes: 1. Minimize the amount of data written to the disk each time; 2. Minimize the amount of data transmitted by the network in the next replication phase. Finally merged into one partitioned and sorted file.
In order to reduce the amount of data transmitted over the network, the data can be compressed here, as long as mapred.compress.map.out is set to true.
4. Copy the data in the partition to the corresponding reduce task.
Some people may ask: How does the data in the partition know which reduce it corresponds to? In fact, the map task has always kept in touch with its parent TaskTracker, and the TaskTracker has always kept a heartbeat with the JobTracker.
Therefore, the macro information of the entire cluster is saved in the JobTracker. As long as the reduce task obtains the corresponding map output position from the JobTracker, it is ok.
At this point, the map side is analyzed.
So what exactly is Shuffle? Shuffle means "shuffle" in Chinese. If we look at it this way: the data generated by a map is partitioned through the hash process but assigned to different reduce tasks.
Is it a process of shuffling the data?
Reduce side:
1. Reduce will receive data from different map tasks, and the data from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is stored directly in memory (the
buffer size is controlled by the mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose), if the amount of data If it exceeds a certain proportion of the buffer size
(determined by mapred.job.shuffle.merge.percent), the data is merged and then overflowed and written to disk.
2. As overflow files grow, background threads will merge them into a larger ordered file, in order to save time for subsequent merges. In fact, no matter on the map side or the reduce side, MapReduce performs sorting and
merging operations repeatedly. Now I finally understand why some people say: sorting is the soul of hadoop.
3. There will be many intermediate files (written to disk) during the merging process, but MapReduce will make the data written to the disk as little as possible, and the result of the last merge is not written to the disk, but is directly input to the reduce function .
Shuffle phase: starting from the output of the Map, including the system performing sorting and sending the Map output to the Reduce as input.
Sort stage: refers to the process of sorting the keys output by the Map side. Different Maps may output the same Key, and the same Key must be sent to the same Reduce side for processing.
The Shuffle phase can be divided into Shuffle on the Map side and Shuffle on the Reduce side.
1. When the Shuffle
        Map function on the Map side starts to generate output, it does not simply write the data to disk, because frequent disk operations will cause serious performance degradation. Its processing process is more complex, the data is first written to a buffer in memory, and some pre-sorting is done to improve efficiency;
        each MapTask has a circular memory buffer (default size is 100MB) used to write output data ), when the amount of data in the buffer reaches a certain threshold (the default is 80%), the system will start a background thread to write the contents of the buffer to the disk (that is, the spill stage).
  In the process of writing to disk, the Map output continues to be written to the buffer, but if the buffer is filled during this period, then the Map will block until the writing to disk process is completed;
        before writing to disk, the thread will first pass the data to The Reducer divides the data into corresponding partitions. In each partition, the background thread is sorted by Key (quick sort), and if there is a Combiner (ie Mini Reducer), it will run on the sorted output;
        once the memory buffer reaches the threshold for overflow writing, it will create a The overflow write file, so after the MapTask finishes its last output record, there will be multiple overflow write files. Before the MapTask completes, the overflow write files are merged into an index file and data file (multi-way merge sort) (Sort stage);
        after the overflow write files are merged, Map will delete all temporary overflow write files and inform the TaskTracker task Completed, as soon as one of the MapTasks completes, the ReduceTask starts copying its output (Copy stage);
        The output file of the Map is placed on the local disk of the TaskTracker running the MapTask. It is the input data required by the TaskTracker running the ReduceTask, but the Reduce output is not like this. It is generally written to HDFS (Reduce phase).
2. Shuffle
Copy stage on the Reduce side: The Reduce process starts some data copy threads and requests the TaskTracker where MapTask is located to obtain the output file through HTTP.
Merge stage: Put the data copied from the Map side into the memory buffer first. Merge has three forms, namely memory to memory, memory to disk, and disk to disk. The first form is not enabled by default, the second Merge method has been running (spill stage) until the end, and then the third disk-to-disk Merge method is enabled to generate the final file.
Reduce stage: The final file may exist on disk or in memory, but it is on disk by default. When the input file of Reduce is determined, the whole Shuffle is over, and then the Reduce is executed, and the result is put into HDFS.
Serialization is to convert the state information of objects in memory into byte sequences for storage (persistence) and network transmission.
Deserialization is to convert the received byte sequence or the persistent data of the hard disk into memory object in .
Hadoop serialization features: 1. Compact; 2. Object reusability; 3. Scalability; 4. Interoperability
The native serialization class of hadoop needs to implement an interface called Writeable, which is similar to the serializable interface. To
implement you must implement two Method: write(DataOutputStream out);readField(DataInputStream in) method.
combiner, local reducer, reduces the amount of data transmitted to the reducer, can be used to filter data
zookeeper--provides a general distributed lock mechanism for Hadoop distributed coordination services to achieve data synchronization, typical application scenarios: unified naming service, configuration management, cluster management
YARN: ResourceManager, NodeManager, ApplicationMaster and Container
Resource Manager: RM is a global resource manager responsible for resource management and allocation of the entire system. It consists of two components: scheduler and application manager.
YARN work steps:
Step 1 The user submits the application to YARN, including the ApplicationMaster program, the command to start the ApplicationMaster, and the user program.
Step 2 ResourceManager allocates the first Container for the application, and communicates with the corresponding Node-Manager, asking it to start the ApplicationMaster of the application in this Container.
Step 3 The ApplicationMaster first registers with the ResourceManager, so that the user can view the running status of the application directly through the ResourceManager, and then it will apply for resources for each task and monitor its running status
       until the end of the operation, that is, repeat steps 4~7.
Step 4 The ApplicationMaster applies for and receives resources from the ResourceManager through the RPC protocol in a polling manner.
Step 5 Once the ApplicationMaster applies for the resource, it communicates with the corresponding NodeManager and asks it to start the task.
Step 6 After NodeManager sets the running environment (including environment variables, JAR packages, binary programs, etc.) for the task, it writes the task startup command into a script, and starts the task by running the script.
Step 7 Each task reports its status and progress to the ApplicationMaster through an RPC protocol, so that the ApplicationMaster can keep track of the running status of each task, so that the task can be restarted when the task fails.
       During the running process of the application, the user can query the current running status of the application to the ApplicationMaster through RPC at any time.
Step 8 After the application runs, the ApplicationMaster logs out of the ResourceManager and shuts itself down.
sqoop: Passing data between Hadoop and relational databases
Pig: A lightweight scripting language for operating hadoop - not much used at first, a data flow language for fast and easy processing of huge data.
It can process HDFS and HBase data very conveniently. Like Hive, Pig can process what it needs to do very efficiently. By directly operating Pig queries, a lot of labor and time can be saved.
Pig.
Hive, friends who are familiar with SQL can use Hive to open and offline data processing and analysis.
Note that Hive is now suitable for offline data operations, which means that it is not suitable for real-time online queries or operations in a real production environment, because a word is "slow". Instead,
originating from FaceBook, Hive plays the role of data warehouse in Hadoop. Built on the top layer of the Hadoop cluster, it provides a SQL-like interface to operate on the data stored in the Hadoop cluster.
You can do select, join, etc. operations with HiveQL.
If you have data warehousing needs and you are good at writing SQL and don't want to write MapReduce jobs you can use Hive instead.
The execution entry of Hive is the Driver. The executed SQL statement is first submitted to the Drive driver, then the compiler is called to interpret the driver, and finally it is interpreted as a MapReduce task for execution.
HBase
HBase runs on HDFS as a column-oriented database. HDFS lacks random read and write operations, and HBase appears for this reason. HBase is modeled on Google BigTable and stored in the form of key-value pairs.
The goal of the project is to quickly locate and access the desired data among the billions of rows of data in the host.
HBase is a database, a NoSql database. Like other databases, it provides random read and write functions. Hadoop cannot meet real-time needs, but HBase can meet it. If you need to access some data in real time, store it in HBase.
You can use Hadoop as a static data warehouse and HBase as a data store for data that will be changed by some operations.

Pig VS Hive
Hive is more suitable for data warehouse tasks, Hive is mainly used for static structures and work that requires frequent analysis. Hive's similarity to SQL makes it an ideal intersection of Hadoop and other BI tools.
Pig gives developers more flexibility in the field of large data sets and allows the development of concise scripts for transforming data streams for embedding into larger applications.
Pig is relatively lightweight compared to Hive, and its main advantage is that it can greatly reduce the amount of code compared to directly using Hadoop Java APIs. Because of this, Pig still attracts a large number of software developers.
Both Hive and Pig can be used in combination with HBase. Hive and Pig also provide high-level language support for HBase, making it very simple to perform data statistical processing on HBase.
Hive VS HBase
Hive is a batch system built on top of Hadoop to reduce the work of writing MapReduce jobs, and HBase is to support projects that make up for Hadoop's shortcomings in real-time operations.
Imagine you are operating an RMDB database, if it is a full table scan, use Hive+Hadoop, if it is an index access, use HBase+Hadoop.
Hive query means that MapReduce jobs can last from 5 minutes to several hours. HBase is very efficient, definitely much more efficient than Hive.

HIVE------a distributed column storage system built on HDFS, which stores data according to tables, rows and columns
. Features of Hbase tables are
large : a table can have billions of rows and millions of columns;
no schema : Each row has a sortable primary key and any number of columns, columns can be dynamically added as needed, and different rows in the same table can have distinct columns;
Column-oriented: Column-oriented (family) storage and permission control, Columns (families) are retrieved independently;
sparse: empty (null) columns do not occupy storage space, and the table can be designed to be very sparse;
data multi-version: data in each unit can have multiple versions, and the version number is automatically assigned by default , is the timestamp when the cell is inserted; the
data type is single: the data in Hbase are all strings and have no type.
Basic concepts:
RowKey: It is a Byte array, which is the "primary key" of each record in the table, which is convenient for quick search. The design of Rowkey is very important.
Column Family: Column family, has a name (string), contains one or more related columns
Column: belongs to a certain column family, familyName: columnName, each record can be dynamically added
Version Number: type is Long, the default value is the system timestamp , which can be customized by the user
Value(Cell): Byte array
Hbase physical model
Each column family is stored in a separate file on HDFS, and null values ​​will not be saved.
Key and Version number have one copy in each column family;
HBase maintains a multi-level index for each value, namely: <key, column family, column name, timestamp>
Physical storage:
1. All rows in the Table are arranged in the lexicographical order of the row key;
2. Table is in the direction of the row Divided into multiple Regions;
3. Regions are divided by size, and each table has only one region at the beginning. As the data increases, the region continues to increase. When it increases to a threshold, the
region will be divided into two new regions. 4. Region is the smallest unit of distributed storage and load balancing in Hbase
, and different Regions are distributed to different RegionServers.
5. Although Region is the smallest unit of distributed storage, it is not the smallest unit of storage.
Region consists of one or more Stores, each store saves a column family;
each Store consists of a memStore and 0 to multiple StoreFiles, StoreFiles contain HFiles;
memStores are stored in memory, and StoreFiles are stored on HDFS.
Description of basic HBase components:
Client
 Contains the interface to access HBase, and maintains the cache to speed up access to HBase, such as the location information of the region
Master
 Assigns regions to the Region server
 Responsible for the load balancing of the Region server
 Find the failed Region server and reassign the region on it
 Manage user's addition, deletion, modification and query operations on the table
Region Server  Region server
maintains regions and handles IO requests to these regions The role
of the region
Zookeeper
 Through elections, it is guaranteed that there is only one master in the cluster at any time, and the Master and RegionServers will register with ZooKeeper when they are started
 Store the addressing entries of all Regions
 Monitor the online and offline information of Region servers in real time. And notify the Master in real time
 Store HBase schema and table metadata
 By default, HBase manages ZooKeeper instances, such as starting or stopping ZooKeeper
 The introduction of Zookeeper makes the Master no longer a single point of failure
Find RegionServer
ZooKeeper--> -ROOT -(Single Region)--> .META.--> User table
-ROOT-
 The table contains the list of regions where the .META. table is located, and the table will only have one Region;
 The location of the -ROOT-table is recorded in Zookeeper
.META.
 The table contains a list of all user space regions, and the server address of the RegionServer.
Flume--a complete data collection tool that collects data from the data source and sends it to the destination.
        In order to ensure that the transmission is successful, the data will be cached before being sent to the destination, and the cached data will be deleted after the data actually reaches the destination.
Flume is a highly available and highly reliable open source distributed massive log collection system provided by Cloudera. Log data can flow to the terminal destination that needs to be stored through Flume.
The log here is a general term that refers to many data such as files and operation records.
The core of flume is the agent. The agent is a java process that runs on the log collection end, receives logs through the agent, temporarily stores them, and sends them to the destination.
Three core components:
  ①Source: It is dedicated to collecting logs and can process log data of various types and formats, including avro, thrift, spooling directory, netcat, http, legacy, custom, etc.
  ②Channel: dedicated to temporary storage of data, which can be stored in memory, jdbc, file, database, custom, etc. The data stored in it will only be deleted after the sink is sent successfully.
  ③Sink: It is dedicated to sending data to the destination point, including hdfs, logger, avro, thrift, ipc, file, null, hbase, solr, custom, etc.












Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941926&siteId=291194637