Big Data - the basics

1. What is big data?

"5V" characteristic data set forth:

A, Volume: large volumes of data, including collection, storage and calculation are very large. Start unit of measurement is at least large data P (1000 th T), E (100 million th T) or Z (10 billion th T).

Two, Variety: diversity of species and origin. Including structured, semi-structured and unstructured data, specific performance network logs, audio, video, pictures, location information, etc., many types of data processing capability of the data put forward higher requirements.

Three, Value: value of the data density is relatively low, or that waves in Sentosa but precious. With the widespread use of the Internet and the Internet of things, information perception everywhere, a flood of information, but a lower density value, and how to combine business logic with powerful data mining algorithms to the value of the machine, the era of big data is most needed to solve the problem.

Four, Velocity: growing faster data processing speed is fast, time-critical requirements. For example, search engines require news a few minutes ago to user queries can be personalized recommendation algorithm requires real time as possible to complete the recommendation. This is different from the traditional big data mining significant feature.

Five, Veracity: accuracy and reliability of the data, that is data quality.

2, hadoop cluster installation process?

Basis cluster environment preparation, including:

System configuration: the default start level, the average user to create and configure sudoer permissions, turn off the firewall and selinux

Network configuration: gateway, ip, host name, host mapping

Time Configuration: a unified time zone, time synchronization

Security Configuration: free dense Login

Software Installation:

Obtain the installation package: upload the installation package, and extract the relevant directory

Configure the environment variables: JAVA_HOME and HADOOP_HOME

Hadoop 6 profiles modification:

hadoop-env.sh,core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml,slaves

3, the installation package distributed to each node, each node of the cluster Hadoop Hadoop installation package to install

4, Run namenode initialized on the master node HDFS

5, the master node on the start HDFS HDFS

6, the start YARN YARN master node, requires the master node starts YARN

7, test cluster is installed successfully

3, hdfs where? Edits and? Fsimage role?

1) fsimage file is actually a permanent checkpoint Hadoop file system metadata, which contains the sequence information of all the directories and files idnode Hadoop file system;

2) edits the file is stored in the path all the updates Hadoop file system, so first of all write operations are logged to a file in the file system edits performed by the client.

fsimage and edits documents are serialized in NameNode start time, it will fsimage content file is loaded into memory, and then performs the following operation edits the file, such metadata and the real memory synchronization, metadata exists in memory read operations to support clients.

After NameNode up, HDFS in the update operation will be re-written edits file because fsimage generally large file (GB level is very common), if all of the update operations are added to fsimage file, this will cause the system to run very slow,

But if you write to the file edits which would not be so, after each execution of the write operation, and before sending a success code to the client, need to synchronize edits files are updated.

If a file is large, so that the write operation is required to operate multiple machines, only after all write operations are performed, the write operation will return success, so that the benefits of the operation will not be any fault in the machine and cause of sync metadata

4, when an excessive number of small files, how to merge small files?

When combined smaller than each small data file, it can be small files by way of a command such as:

hadoop fs -cat hdfs://cdh5/tmp/lxw1234/*.txt | hadoop fs -appendToFile - hdfs://cdh5/tmp/hdfs_largefile.txt

When merging large amount of data when the proposed use of MR in small files

5, how to deal with missing block file hadoop appear?

After first need to locate where blocks of data loss can be checked and ruled out by looking at the log files to find the missing block location, if the file is not very important to directly delete, and then re-copy to copy to a cluster, if not deleted each cluster will have a backup, restore backup

6, hadoop daemon thread? Namenode responsibilities?

Five daemons:

SecondaryNameNode

ResourceManager

NodeManager

NameNode

DataNode

Namenode: the master node, metadata storage file (the file name, the file directory structure, file attributes - generation time, number of copies, file permissions) DataNodes block list and the like blocks, and each file is located. Receiving periodic heartbeat and block status report information (list of all data blocks on the DataNodes)

If the information received heartbeat, NN believes DN work if after 10 minutes DN also received less than a heartbeat, the downtime has been DN NN think

This time data blocks on NN ready DN should be re-replication.

Block status report contains a list of all data blocks on a DN, blocks report transmitted once every 1 hour

7, you used to write at work hdfs command

Such as: cat \ count \ get \ ls \ put \ rmr (rm -r) \ mv (cp) and so on ........

8, the command displays the health status of all datanode

hadoop?dfsadmin -report

9, how to leave secure mode

hadoop?dfsadmin?-safemode?leave

10, how to quickly kill a job

1, performs hadoop job -list get job-id

2、hadoop job -kill job-id

You can also use the command yarn: yarn application -kill appId

11, Hdfs write data flow?

1, using the client Client HDFS provided to initiate an RPC request to remote Namenode

2, Namenode checks whether the file to be created already exists, the creator has permission to operate successfully it will create a record for the file, otherwise the client will throw an exception;

3, when the client starts writing the file, the client moves the file into multiple packets, and the data queue in the form of "data queue (data queue)" internally manage these packets, and Blocks Namenode application, acquires suitable list for storing datanode replicas, the size of the list in accordance Namenode replication on the settings;

4, in the form of start Pipeline (pipeline) will be written to all the replicas of the packet. Development Library datanode the first packet is written to stream, after which the packet is stored to datanode, which is then passed to the next pipeline in this datanode until the last datanode, such as a write data pipeline manner form.

After 5, after the last datanode successful store returns an ack packet (confirmation queue), delivered to the client in the pipeline, in the client's internal development library maintains "ack queue", successfully received ack packet datanode returned will be from "ack queue" removes the corresponding packet.

6, if the transmission process, there is a datanode fails, then the current pipeline will be shut down, there datanode failure will be removed from the current pipeline, the rest of the block will continue the rest of the pipeline to continue datanode forms of transmission, at the same time will be assigned a new Namenode Datanode, holding a set number of replicas.

7, after the client finishes writing data, the data stream will call close () method, to close the data stream;

12, there is a lot of files, memory fit, how to weight?

Calculating a hash value for each line of the file, the hash value in accordance with the content of the row into a small file, needs to be divided into 100 smaller files is assumed, it is possible according to (hash 100%) to distribute the contents of the file, and then in a small file to achieve weight on it.

13, hdfs data compression algorithm?

(1) Gzip compression

Advantages: relatively high compression rate, and compression / decompression speed is faster; hadoop support itself, in the application process gzip file format and directly manipulate text on the same; most linux system comes with gzip command, easy to use.

Disadvantages: does not support split.

Scenario: each file when compressed (the size of a block) within 130M, can be considered compressed format using gzip. For example, said one day or one hour a gzip compressed into a log file when the program is running mapreduce reached by multiple concurrent gzip files. hive program, streaming the program, and the program is completely written in java mapreduce and text processing like compression after the original program does not need to make any changes.

(2) Bzip2 compression

Pros: Support split; with a high compression ratio, higher than gzip compression ratio; hadoop itself supports, but does not support native; comes with bzip2 command in linux system, easy to use.

Cons: compression / decompression is slow; does not support native.

Scenario: and it is suitable for less demanding, but when necessary a higher compression rate, as the output format mapreduce job; or after the data output is relatively large, the data needs to be compressed after treatment to reduce disk space and later archived data less than a case; or a single large text file you want compression to reduce storage space, while the need to support split, and the situation is the application (ie application does not require modification) before compatible.

(3) Lzo compression

Advantages: a compression / decompression speed is faster, a reasonable compression ratio; Split support, is the most popular hadoop compression format; command may be mounted lzop linux system easy to use.

Disadvantages: gzip compression rate is lower than the number; hadoop itself does not support, you need to install; lzo in the application file formats need to do some special treatment (to support the split need to build an index, you also need to specify inputformat lzo format).

Scenario: a large text file, after the compression 200M is also greater than the above can be considered, and the greater a single file, the more obvious advantages of LZO.

(4) Snappy compression

Pros: High-speed compression speed and sound compression ratio.

Disadvantages: does not support split; compression rate is lower than gzip; hadoop itself does not support, you need to install;

Scenario: Map when the data output Mapreduce job when larger, Map to Reduce compressed format as intermediate data; Mapreduce output or as an additional input a job and job Mapreduce.

14, datanode not backed up under what circumstances?

If the backup number 1, will not go back up

15, when there is a three datanode datanode error will happen?

After the error occurs when unable to communicate and namenode, long namenode confirm namenode is not going to shoot down, this time called a timeout, hdfs default time is 10 minutes and 30 seconds, to confirm the datanode down, hdfs number of copies will be minus 1, then the copy datanode will namenode down data to the other machine

16, describe in hadoop, which uses local caching mechanism, what role they are?

Cache node classpath jar package to perform the task,

classpath file cache ordinary task to run node

Ring buffer zone, after the map phase would write files to a local overflow, there will be a ring buffer zone between them, can improve efficiency

17, there is a 200M file? Written to HDFS is to write 128M? After it is replicated write 72M? Or all finished and then copied?

On HDFS when writing data, the data will first cut, then datanode end to form a pipeline from the customer, at least after a file is written on hdfs, indicates that the file is written successfully, and then copy the backup operation, so is all finished and then copied

18, described in Hadoop RPC protocol, and the underlying frame encapsulation of what?

Request for a user response or parameter converted into a byte stream for transmission across the machine.

Layer function calls: a function call is a layer mainly functions: positioning function to be called, and the function is executed, and the Hadoop using reflection achieved java dynamic proxy function call.

Network transport layer: The network layer describes the transport transmission between Client and Server messages, Hadoop using socket mechanism based on TCP / IP.

Server processing framework: the server processing framework can be abstracted as network I / O processing model, she describes the way the client and server-side information exchange, her design directly determines the server's concurrent processing capability. Common network I / O model has a blocking I / O, non-blocking I / O, event-driven I / O, etc., and uses Hadoop-based event driven reactor design pattern I / O model

19, the difference between the hadoop1.x 2.x and architecture?

（1）Hadoop 1.0

Hadoop 1.0 Hadoop i.e. the first generation, the distributed storage system by the HDFS and MapReduce distributed computing frameworks, where, by the HDFS and a plurality NameNode DataNode composed JobTracker and a plurality of MapReduce TaskTracker composition, corresponding to version Hadoop Apache Hadoop 0.20.x, 1.x, 0.21.X, 0.22.x and CDH3.

（2）Hadoop 2.0

I.e., the second generation Hadoop 2.0 Hadoop, to overcome the problems existing in the HDFS and MapReduce Hadoop 1.0 proposed. For the expansion of the list of issues NameNode constraints of HDFS Hadoop 1.0 in the proposed HDFS Federation, it makes a different directory and then charge more for access NameNode isolation and scale, and it completely solved the NameNode single point of failure; for Hadoop 1.0 the MapReduce inadequate in terms of scalability and multi-frame support, and resource management and job control function will JobTracker the separate components are implemented by ResourceManager and ApplicationMaster, which, ResourceManager responsible for resource allocation for all applications, but only ApplicationMaster responsible for managing an application, and then the birth of a new universal resource management framework YARN. Based YARN, users can run various types of applications (such as 1.0 no longer confined to a class of applications MapReduce), calculated from offline to online calculation MapReduce (streaming) of Storm et al. Version 2.0 corresponds Hadoop Hadoop Apache Hadoop 0.23.x, 2.x, and CDH4.

（3）MapReduce 1.0或MRv1

MapReduce 1.0 computing framework consists of three parts, namely a programming model and runtime data processing engine environment. Its basic issue is a programming model abstraction Map and Reduce into two phases, wherein the input stage Map data parsed into key / value, iteration call map () function after treatment, and then output to a local directory in the form of key / value of and Reduce stage will be the same value key protocol processing, and the end result is written on HDFS; its data processing engine and a MapTask ReduceTask, namely stages of logic dealing with Map Reduce and phase logic; it's runtime environment by the (a) JobTracker and (several) TaskTracker two types of service, of which, JobTracker responsible for resource management and control of all operations, and is responsible for receiving commands from TaskTracker JobTracker and execute it. The framework is insufficient presence in scalability, fault tolerance and multi-frame support, which also contributed to the generation of MRv2.

（4）MRv2

MRv2 MRv1 having the same programming model and the data processing engine, the only difference is the runtime environment. MRv2 is processed after the MRv1 based on the calculated running on top of the frame MapReduce YARN of resource management framework. Its runtime environment is no longer a JobTracker TaskTracker and services such as composition, but into a universal resource management and job control system YARN process ApplicationMaster, which, YARN responsible for resource management and scheduling, and ApplicationMaster only responsible for a management job. Briefly, MRv1 only a separate off-line computing framework, MRv2 YARN is running on the MapReduce.

（5）YARN

YARN is Hadoop 2.0 resource management system, which is a common resource management module, resource management and scheduling may be performed for all types of applications. YARN MapReduce is not limited to the use of a framework can also be used for other frame, such Tez (will be described in Chapter 9), Spark, Storm (will be described in Chapter 10) and the like. YARN resource management system similar to a few years ago Mesos (introduced in Chapter 12) and earlier Torque (introduced in Chapter 6). Because of the versatility YARN, next-generation MapReduce core has shifted from simple computing framework MapReduce support a single application to a common resource management system YARN.

（6）HDFS Federation

In Hadoop 2.0 to HDFS improved, so that a plurality of lateral NameNode be extended, each part of the directory NameNode charge, thereby generating the HDFS Federation, introduced into the mechanism not only enhances the scalability of HDFS, also includes the isolation HDFS .

20, Hdfs trash (anti mistakenly deleted)

The default is off, you need to manually open, modify the configuration of core-site.xml

Add to:

??? ? <name>fs.trash.interval</name>

???? <value>1440</value>

<-! Recycle bin residence time in minutes ->

</property>

???????????? <name>fs.trash.checkpoint.interval</name>

???????? <value>1440</value>

<! - Bin check interval less than or equal to the above value ->

</property>

If you open the Recycle Bin, hdfs will have built a Recycle Bin for each user, when the user deletes a file, the file is not completely disappeared,

But to the mv / user / username /.Trash/ this folder, over a period of time, the user can restore deleted files.

If the user is not actively removed, then the system will delete the file according to the time set by the user out, the user can manually empty the Recycle Bin,

Such deleted file will no longer be brought back

JavaAPI: ???Trash trash = new Trash(fs, conf);

??????????trash.moveToTrash(new Path("/xxxx"));

Shell: If you want to delete a file, rather than put it to the recycle bin, it is necessary to use the command -skipTrash

For example: hadoop fs -rm -r -skipTrash / test

View Recycle Bin: hadoop fs -ls /user/hadoop/.Trash/Current