Basic knowledge of hadoop ecosystem

Java
List features: elements have an order in which they are placed, and elements can be repeated.
Set features: elements have no order in which they are placed, and elements cannot be repeated.

The three paradigms of the database: atomicity, consistency, uniqueness
Object and reference object: An object is an object that is not initialized, and a reference object is initialized even if the object is initialized

ArrayList and Vector: Arrays are used to store data, and elements are accessed according to the index. Both can automatically expand the internal data length as needed, so as to add and insert elements. Both allow direct serial number indexing elements, but inserting data involves array elements. Move and other memory operations, so the index data is fast and the data is inserted slowly. The biggest difference between them is the use of synchronized synchronization.
LinkedList uses a doubly linked list for storage. Indexing data by serial number needs to be traversed forward or backward, but when inserting data, you only need to record the front and rear items of this item, so the insertion rate is faster!
If you're just looking for elements at a specific location or adding and removing elements only at the end of the collection, then either a Vector or an ArrayList will work. If it is the insertion and deletion of other specified positions, it is best to choose LinkedList

Differences between HashMap and HashTable and their advantages and disadvantages:
The methods in HashTable are synchronized. The methods of HashMap are asynchronous by default, so additional synchronization mechanisms are required in a multi-threaded environment.
HashTable does not allow null values, both key and value are not allowed, while HashMap allows null values. Both key and value are allowed.
This HashMap uses containKey() to determine whether a key exists.
HashTable uses Enumeration while HashMap uses iterator.
Hashtable is a subclass of Dictionary, and HashMap is an implementation class of Map interface.
When you need to operate on strings, use StringBuffer instead of String. String is read-only. If you modify it, a temporary object will be generated, while StringBuffer is modifiable and will not generate temporary objects.

Hadoop
core: hdfs mr
RPC - remote procedure call protocol
HDFS (Hadoop Distributed File System) Hadoop distributed file system.
Features: ① Save multiple copies, and provide a fault-tolerant mechanism, the copy is lost or automatically recovered from downtime. 3 copies are saved by default.
② run on cheap machines.
③ Suitable for big data processing. HDFS divides files into blocks by default, and 64M is 1 block. Then store the block key-value pair on HDFS, and store the key-value pair map in memory. If there are too many small files, the memory load will be heavy.
HDFS: Master and Slave structure (NameNode, SecondaryNameNode, DataNode)
NameNode: It is the Master node and the leader. Manage data block mapping; handle client read and write requests; configure replication strategy; manage HDFS namespace;
SecondaryNameNode: is a younger brother, sharing the workload of the big brother namenode; it is the cold backup of NameNode; merge fsimage and fsedits and then send it to namenode .
DataNode: Slave node, slave, working. Responsible for storing the data block block sent by the client; perform read and write operations of the data block.
Hot backup: b is the hot backup of a, if a fails. Then b immediately runs the work in place of a.
Cold backup: b is a cold backup of a, if a fails. Then b cannot immediately replace a to work. However, some information of a is stored on b to reduce the loss after a is broken.
fsimage: Metadata image file (directory tree of the file system.)
edits: Metadata operation log (records of modification operations made to the file system)
Namenode memory is stored in =fsimage+edits.
The SecondaryNameNode is responsible for timing the default 1 hour. From the namenode, it obtains the fsimage and edits for merging, and then sends it to the namenode. Reduce the workload of namenode
hdfs: namenode(metadata)

Fault monitoring mechanism
(1) Node failure monitoring mechanism: each DN sends heartbeat information to NN at a fixed period (3 seconds) to prove that it is working normally, and if it exceeds a certain time (10 minutes), it is considered that the DN pump
(2) communication failure Monitoring mechanism As long as data is sent, the receiver will return a confirmation code
(3) Data error monitoring mechanism Data + sum check code All DNs regularly send data block storage status to NN

The storage strategy of HDFS is to store one replica on the local rack node, and the other two replicas on different nodes in different racks.
This allows the cluster to survive the complete loss of a rack. At the same time, this strategy reduces data transmission between racks and
improves the efficiency of write operations, because data blocks are only stored on two different racks, reducing the total network transmission bandwidth required to read data. In this way, data security and network transmission overhead are taken into account to a certain extent.

MapReduce job running process:
1. Start a job on the client.
2. Request a Job ID from the JobTracker.
3. Copy the resource files required to run the job to HDFS, including the JAR file packaged by the MapReduce program, the configuration file, and the input partition information calculated by the client.
These files are stored in a folder created by JobTracker specifically for the job. The folder name is the Job ID of the job. The JAR file will have 10 copies by default (controlled by the mapred.submit.replication property);
the input partition information tells the JobTracker how many map tasks should be started for this job.
4. After the JobTracker receives the job, it puts it in a job queue and waits for the job scheduler to schedule it. When the job scheduler schedules the job according to its own scheduling algorithm,
it will divide the information for each division according to the input. Create a map task and assign the map task to TaskTracker for execution.
For map and reduce tasks, TaskTracker has a fixed number of map slots and reduce slots according to the number of host cores and memory size.
What needs to be emphasized here is that the map task is not randomly assigned to a TaskTracker. There is a concept here: Data-Local.
It means: assign the map task to the TaskTracker containing the data block processed by the map, and copy the program JAR package to the TaskTracker to run, which is called "operation movement, data movement".
Data localization is not considered when assigning reduce tasks.
5. The TaskTracker will send a heartbeat to the JobTracker every once in a while, telling the JobTracker that it is still running, and the heartbeat also carries a lot of information, such as the progress of the current map task completion.
When the JobTracker receives the job's last task completion message, it sets the job to "success". When the JobClient queries the status, it will know that the task is complete and display a message to the user.
The above is to analyze the working principle of MapReduce at the level of client, JobTracker, and TaskTracker. Let's be a little more detailed and analyze and analyze from the level of map task and reduce task.

Map side:
1. Each input fragment will be processed by a map task. By default, the size of a block of HDFS (default is 64M) is used as a fragment, of course, we can also set the size of the block.
The result of map output will be temporarily placed in a ring memory buffer (the size of the buffer is 100M by default, controlled by the io.sort.mb property), when the buffer is about to overflow
(the default is 80% of the buffer size) , controlled by the io.sort.spill.percent property), will create a spill file in the local file system and write the data in the buffer to this file.
2. Before writing to disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, one reduce task corresponds to the data of one partition.
This is done to avoid the embarrassing situation that some reduce tasks are allocated a large amount of data, while some reduce tasks are allocated little or no data. In fact, partitioning is the process of hashing data.
Then sort the data in each partition. If the Combiner is set at this time, the sorted result will be subjected to the Combia operation. The purpose of this is to write as little data as possible to the disk.
3. When the map task outputs the last record, there may be many overflow files, and these files need to be merged. In the process of merging, sorting and combia operations are continuously performed for
two purposes: 1. Minimize the amount of data written to the disk each time; 2. Minimize the amount of data transmitted by the network in the next replication phase. Finally merged into one partitioned and sorted file.
In order to reduce the amount of data transmitted over the network, the data can be compressed here, as long as mapred.compress.map.out is set to true.
4. Copy the data in the partition to the corresponding reduce task.
Some people may ask: How does the data in the partition know which reduce it corresponds to? In fact, the map task has always kept in touch with its parent TaskTracker, and the TaskTracker has always kept a heartbeat with the JobTracker.
Therefore, the macro information of the entire cluster is saved in the JobTracker. As long as the reduce task obtains the corresponding map output position from the JobTracker, it is ok.
At this point, the map side is analyzed.
So what exactly is Shuffle? Shuffle means "shuffle" in Chinese. If we look at it this way: the data generated by a map is partitioned through the hash process but assigned to different reduce tasks.
Is it a process of shuffling the data?

Reduce side:
1. Reduce will receive data from different map tasks, and the data from each map is ordered. If the amount of data accepted by the reduce side is quite small, it is stored directly in memory (the
buffer size is controlled by the mapred.job.shuffle.input.buffer.percent property, which represents the percentage of heap space used for this purpose), if the amount of data If it exceeds a certain proportion of the buffer size
(determined by mapred.job.shuffle.merge.percent), the data is merged and then overflowed and written to disk.
2. As overflow files grow, background threads will merge them into a larger ordered file, in order to save time for subsequent merges. In fact, no matter on the map side or the reduce side, MapReduce performs sorting and
merging operations repeatedly. Now I finally understand why some people say: sorting is the soul of hadoop.
3. There will be many intermediate files (written to disk) during the merging process, but MapReduce will make the data written to the disk as little as possible, and the result of the last merge is not written to the disk, but is directly input to the reduce function .

Shuffle phase: starting from the output of the Map, including the system performing sorting and sending the Map output to the Reduce as input.
Sort stage: refers to the process of sorting the keys output by the Map side. Different Maps may output the same Key, and the same Key must be sent to the same Reduce side for processing.

The Shuffle phase can be divided into Shuffle on the Map side and Shuffle on the Reduce side.
1. When the Shuffle
Map function on the Map side starts to generate output, it does not simply write the data to disk, because frequent disk operations will cause serious performance degradation. Its processing process is more complex, the data is first written to a buffer in memory, and some pre-sorting is done to improve efficiency;
each MapTask has a circular memory buffer (default size is 100MB) used to write output data ), when the amount of data in the buffer reaches a certain threshold (the default is 80%), the system will start a background thread to write the contents of the buffer to the disk (that is, the spill stage).
In the process of writing to disk, the Map output continues to be written to the buffer, but if the buffer is filled during this period, then the Map will block until the writing to disk process is completed;
before writing to disk, the thread will first pass the data to The Reducer divides the data into corresponding partitions. In each partition, the background thread is sorted by Key (quick sort), and if there is a Combiner (ie Mini Reducer), it will run on the sorted output;
once the memory buffer reaches the threshold for overflow writing, it will create a The overflow write file, so after the MapTask finishes its last output record, there will be multiple overflow write files. Before the MapTask completes, the overflow write files are merged into an index file and data file (multi-way merge sort) (Sort stage);
after the overflow write files are merged, Map will delete all temporary overflow write files and inform the TaskTracker task Completed, as soon as one of the MapTasks completes, the ReduceTask starts copying its output (Copy stage);
The output file of the Map is placed on the local disk of the TaskTracker running the MapTask. It is the input data required by the TaskTracker running the ReduceTask, but the Reduce output is not like this. It is generally written to HDFS (Reduce phase).
2. Shuffle
Copy stage on the Reduce side: The Reduce process starts some data copy threads and requests the TaskTracker where MapTask is located to obtain the output file through HTTP.
Merge stage: Put the data copied from the Map side into the memory buffer first. Merge has three forms, namely memory to memory, memory to disk, and disk to disk. The first form is not enabled by default, the second Merge method has been running (spill stage) until the end, and then the third disk-to-disk Merge method is enabled to generate the final file.
Reduce stage: The final file may exist on disk or in memory, but it is on disk by default. When the input file of Reduce is determined, the whole Shuffle is over, and then the Reduce is executed, and the result is put into HDFS.

Serialization is to convert the state information of objects in memory into byte sequences for storage (persistence) and network transmission.
Deserialization is to convert the received byte sequence or the persistent data of the hard disk into memory object in .
Hadoop serialization features: 1. Compact; 2. Object reusability; 3. Scalability; 4. Interoperability
The native serialization class of hadoop needs to implement an interface called Writeable, which is similar to the serializable interface. To
implement you must implement two Method: write(DataOutputStream out);readField(DataInputStream in) method.

combiner, local reducer, reduces the amount of data transmitted to the reducer, can be used to filter data

zookeeper – provides a general distributed lock mechanism for Hadoop distributed coordination services to achieve data synchronization, typical application scenarios: unified naming service, configuration management, cluster management

YARN: ResourceManager, NodeManager, ApplicationMaster and Container
Resource Manager: RM is a global resource manager responsible for resource management and allocation of the entire system. It consists of two components: scheduler and application manager.
YARN work steps:
Step 1 The user submits the application to YARN, including the ApplicationMaster program, the command to start the ApplicationMaster, and the user program.
Step 2 ResourceManager allocates the first Container for the application, and communicates with the corresponding Node-Manager, asking it to start the ApplicationMaster of the application in this Container.
Step 3 The ApplicationMaster first registers with the ResourceManager, so that the user can view the running status of the application directly through the ResourceManager, and then it will apply for resources for each task and monitor its running status
until the end of the operation, that is, repeat steps 4~7.
Step 4 The ApplicationMaster applies for and receives resources from the ResourceManager through the RPC protocol in a polling manner.
Step 5 Once the ApplicationMaster applies for the resource, it communicates with the corresponding NodeManager and asks it to start the task.
Step 6 After NodeManager sets the running environment (including environment variables, JAR packages, binary programs, etc.) for the task, it writes the task startup command into a script, and starts the task by running the script.
Step 7 Each task reports its status and progress to the ApplicationMaster through an RPC protocol, so that the ApplicationMaster can keep track of the running status of each task, so that the task can be restarted when the task fails.
During the running process of the application, the user can query the current running status of the application to the ApplicationMaster through RPC at any time.
Step 8 After the application runs, the ApplicationMaster logs out of the ResourceManager and shuts itself down.

sqoop : Passing data between Hadoop and relational databases

Pig : Lightweight scripting language for operating hadoop - not much used at first, a data flow language for fast and easy processing of huge data.
It can process HDFS and HBase data very conveniently. Like Hive, Pig can process what it needs to do very efficiently. By directly operating Pig queries, a lot of labor and time can be saved.
Pig.

Hive , friends who are familiar with SQL can use Hive to open and offline data processing and analysis.
Note that Hive is now suitable for offline data operations, which means that it is not suitable for real-time online queries or operations in a real production environment, because a word is "slow". Instead,
originating from FaceBook, Hive plays the role of data warehouse in Hadoop. Built on the top layer of the Hadoop cluster, it provides a SQL-like interface to operate on the data stored in the Hadoop cluster.
You can do select, join, etc. operations with HiveQL.
If you have data warehousing needs and you are good at writing SQL and don't want to write MapReduce jobs you can use Hive instead.
The execution entry of Hive is the Driver. The executed SQL statement is first submitted to the Drive driver, then the compiler is called to interpret the driver, and finally it is interpreted as a MapReduce task for execution.

HBase
HBase runs on HDFS as a column-oriented database. HDFS lacks random read and write operations, and HBase appears for this reason. HBase is modeled on Google BigTable and stored in the form of key-value pairs.
The goal of the project is to quickly locate and access the desired data among the billions of rows of data in the host.
HBase is a database, a NoSql database. Like other databases, it provides random read and write functions. Hadoop cannot meet real-time needs, but HBase can meet it. If you need to access some data in real time, store it in HBase.
You can use Hadoop as a static data warehouse and HBase as a data store for data that will be changed by some operations.

Pig VS Hive
Hive is more suitable for data warehouse tasks, Hive is mainly used for static structures and work that requires frequent analysis. Hive's similarity to SQL makes it an ideal intersection of Hadoop and other BI tools.
Pig gives developers more flexibility in the field of large data sets and allows the development of concise scripts for transforming data streams for embedding into larger applications.
Pig is relatively lightweight compared to Hive, and its main advantage is that it can greatly reduce the amount of code compared to directly using Hadoop Java APIs. Because of this, Pig still attracts a large number of software developers.
Both Hive and Pig can be used in combination with HBase. Hive and Pig also provide high-level language support for HBase, making it very simple to perform data statistical processing on HBase.

Hive VS HBase
Hive is a batch system built on top of Hadoop to reduce the work of writing MapReduce jobs, and HBase is to support projects that make up for Hadoop's shortcomings in real-time operations.
Imagine you are operating an RMDB database, if it is a full table scan, use Hive+Hadoop, if it is an index access, use HBase+Hadoop.
Hive query means that MapReduce jobs can last from 5 minutes to several hours. HBase is very efficient, definitely much more efficient than Hive.

HIVE - a distributed column storage system built on HDFS, which stores data in tables, rows and columns
. Features of HBase tables
Large : a table can have billions of rows and millions of columns;
no schema: each row has There is a sortable primary key and any number of columns, columns can be dynamically added as needed, and different rows in the same table can have distinct columns;
column-oriented: column-oriented (family) storage and permission control, column (family) Independent retrieval;
sparse: empty (null) columns do not occupy storage space, and the table can be designed to be very sparse;
data multi-version: data in each cell can have multiple versions, and the version number is automatically assigned by default, which is the cell Timestamp when inserting;
single data type: the data in Hbase are all strings without type.
Basic concepts:
RowKey: It is a Byte array, which is the "primary key" of each record in the table, which is convenient for quick search. The design of Rowkey is very important.
Column Family: Column family, has a name (string), contains one or more related columns
Column: belongs to a certain column family, familyName: columnName, each record can be dynamically added
Version Number: type is Long, the default value is the system timestamp , which can be customized by the user
Value(Cell): Byte array
Hbase physical model
Each column family is stored in a separate file on HDFS, and null values ​​will not be saved.
Key and Version number have one copy in each column family;
HBase maintains a multi-level index for each value, namely:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941915&siteId=291194637
Recommended