Distributed storage principle summary

JVM startup

Most of the big data technologies are built on top of Java's JVM. Therefore, our understanding of the JVM startup is necessary, of course, we do not have to go very deeply about this, we just need to grasp two crucial points to:
  1. We can only start a JVM via the java command, such as: When executed java com.twq.HelloWorld will start a JVM, then executing the program code in the main method of com.twq.HelloWorld on the JVM
  2. When starting a JVM required parameters, we can provide the parameters set by the parameter of the java command. For example, we can set the start of the JVM heap memory size by -Xmx300M, we can set additional Java byte code file to start the JVM need to rely on by -cp parameters, we can set our way through -D the program requires some parameters

 

java -cp C:\\bigdata-course\\workspace\\hdfs-course\\target\\hdfs-course-1.0-SNAPSHOT.jar -Dname=yellow -DsleepDuration=5 -Xmx300m com.twq.basic.launcher.JvmLauncherTest

  When starting a JVM using the java command, execution of program code main method we specify the main class of this method inside the program can be as simple as print Hello World, of course, can be as complex as any other complex program, this depends on the business scene.

RPC

The JVM can start to do any complex business procedures, when a JVM on top of the program can not meet the needs of business, then we might need to start or restart a JVM program multiple JVM program, JVM between these programs is the need to communicate with each other and then coordinate the completion of business needs, this time will be involved in the communications technology between the JVM and the JVM is our common RPC technology
RPC is an abbreviation of the English word Remote Procedure Call, translated into Chinese is the remote procedure call means, in fact, meant to invoke the remote program to be executed, and RPC is a programming language does not matter, most of the programming language can support RPC. If used in the Java language, it means that RPC communication between the JVM and the JVM.
RPC technology is the basis of Socket network programming, Java programming language also supports Socket, a server program and a client program, both programs are two different JVM is running, where the client JVM by Socket scene communication between JVM JVM technology to send to the server where the message after receiving the message server can process messages, then you can also choose whether the results back to the client, this is a classic.
When communication between the JVM by Socket, of course, need to specify the protocol, the client certainly can not send a message server can not handle it. The client can not receive messages can not handle it. So between technology, the amount of client server-based RPC protocol will certainly
Of course, in a real implementation of RPC, we will not use the native Socket programming, we will use Socket is encapsulated and mature industrial-grade RPC framework, such as netty etc.
Distributed storage principle
Distributed storage solution is a lot of data storage problem. This amount is generally TB, PB level
1PB = 1024TB;1TB = 1024GB;1GB = 1024M
If a file is smaller than the data, then the next one machine can be stored, when the amount of data the file is getting bigger, and so large that time no less than the memory of a machine, this time need distributed storage.
For example, we now have a large file, it is the amount of data 5PB. This time a machine is certainly no less than the store. Then we can divide these 5PB data file into several small pieces, assume that each block size is 256M, so 5PB data file is divided into 20,971,520 blocks, we can be so much data blocks distributed 1000 stored on machine (assuming each machine disk capacity is 10TB), storing data about each machine a little more than 20,000 blocks.
Data block, in the multi-state machines, this is the first feature of the distributed storage distributed storage.
Assumptions above 1000 machines have a machine hang up, then exist on this machine data block can not provide services, and so the data file 5PB's not complete. Then in order to solve this problem, we can then each data block a backup, two identical blocks of data are then stored on different machines, in which case a data block where the machine hung up, then another machine on the same the block can also provide services. Doing so can the fault tolerance, improved high-availability data block
In multiple machines in order to improve high-availability data blocks redundant memory block, which is stored in the distributed second feature
Now the question again, so much so much data block machine nodes and storage nodes on the machine how to manage it? We can start on another one server a JVM process, the JVM process is responsible for managing all of the data stored in the data blocks of all machine nodes and storage nodes on these machines, as follows:

 

 

 

 

 

Storage master image above is responsible for managing all of the data stored in the machine as well as all of the data blocks, it will exist in the Storage master: Message Machine node (Node Info), and data block information (Block Info)
Storage slaves figure above is responsible for storing data blocks, when the Storage slaves each machine will start up their own disk capacity and other information contained told Storage master machine, when a Storage slave in each data block storage on his own time will tell information Storage master machine.
Therefore, the distributed storage third characteristic is that: with the primary / from a distributed storage cluster (master / slave) Structure
Here we must understand three points:
  1. Is a JVM process starts on the Storage master and Storage slave, on the Storage master machine, the machine is responsible for the JVM process node and management of data blocks; storage service JVM process on the Storage slave responsible for the data block
  2. Communication between processes on the Storage master JVM and the JVM process on the Storage slave is done via RPC. Of course, two different JVM process on the Storage slave machine is also possible to communicate through the RPC (a block of data needs to be backed up, the backup data block is then transmitted to the other slave machine via RPC)
  3. So, the foundation distributed memory that we have already mentioned two things: JVM startup and RPC. Of course, the foundation of our future Big Data technologies are also encountered basic JVM startup and RPC

 

Summarizes characteristics of distributed storage

  1. Data block, stored in the plurality of distributed machines
  2. Redundant data block is stored in multiple machines to improve high availability data blocks
  3. With the primary / from a distributed storage (master / slave) cluster structure

Distributed Storage Files

In compliance with the master / slave distributed storage cluster (master / slave) structure, in fact, the presence of two types of files:
  1. Real data file storage, file such documents are stored on the slave, which we call the physical file
  2. With respect to the files stored on the slave, then the master in fact, the concept of a file, the file this file is not stored data, it is a logical file is represented by a file full path name, the full file path name corresponding to storing information data block (data block storage location information)
  3. If there is no understanding of the physical and logical files distributed file storage system, it does not matter, we are talking about when HDFS will be mentioned again

 

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11488049.html