Big Data T questions: super important points

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/pingsha_luoyan/article/details/97750251

:( short answer each question 5 points out of 100 points)

  1. The difference between Hadoop clusters and distributed pseudo-distributed cluster
  1. Distributed clusters required more than one computer into slavery ,, a host of other machines used to store data into slavery.
  2. Pseudo-distributed cluster can be used to build a computer, you can use more than one computer, you only need to add your own hostname slaves on it, the same as the other and distributed.
  1. Hadoop core is divided into three parts, each part is what, what function?
  1. Three core: hdfs, mapreduce and yarn
  2. Hadfs: Distributed File System, mapreduce: distributed computing frameworks, yarn: the resource scheduler
  3. Function: hdfs: united to achieve its functions by a number of servers, cluster servers have their own roles. For storing files, to locate files by directory tree

             Mapreduce: the use of thinking of "divide and conquer", to deal with large-scale data. The data is disassembled into a plurality of portions, and at the same time using a plurality of nodes in the cluster of data processing, the intermediate results of each node is then aggregated obtained, after further calculated (this calculation is performed in parallel), to give the final result

  1. Hadoop distributed cluster, how many default backup number, each data block size is how much? In which the configuration file can modify the backup number and size of each data block?

3 parts by default backup, the data size of each block 128M, the hdfs-site.xml configuration file can modify the backup data block size and the number.

<property>

<name>dfs.block.size</name>

<value>128</value>

</property>

   <property>

        <name>dfs.replication</name>

        <value>3</value>

   </property>

 

  1. NN distributed Hadoop cluster and DN, SN, refers to the what? They were what role?

NN: Name Node Management HDFS namespace copy of the configuration policy management block (Block) mapping information, processing client requests to read and write

DN: DataNode storing the actual data block, the data block read / write operation

SN: SecondaryNameNode auxiliary NameNode, sharing its workload, such as regular merger Fsimage and Edits, and pushed to NameNode, in case of emergency, may assist recovery NameNode

  1. Hadoop distributed cluster, if NN hang up, what happens? If the DN hang up, what happens? If SN hang up, what happens? Data will be lost it?

NN hang up, loss of data, distributed cluster crashes, DN hang up, because the other machines have a backup data is not lost, will not be a big problem for distributed cluster Under normal circumstances, SN hang up, data is not lost , but lost the backup for NameNode mapped in order to prevent accidents (NN hang), SN needs to be restored as soon as possible.

  1. What Hadoop cluster yarn model is? What is the role in distributed computing?

Yarn mode is a resource scheduler,

ResourceManager in the yarn is machine manager, responsible for coordinating and managing the entire cluster (all NodeManager) resources, respond to different types of applications the user submissions to be resolved, scheduling, monitoring and so on.

In the slave machine in the NodeManager , a provider of real execution of the application container, resource usage monitoring application (CPU, memory, disk, network), and report to the cluster resource scheduler ResourceManager heartbeat to update their health status. At the same time it will also supervise Container lifecycle management, and monitoring the use of resources (memory, CPU, etc.) for each Container, tracking the health node, log management and ancillary services used in different applications (auxiliary service).

  1. Please describe the process of mapreduce text word count. (Text and described flowchart)

Process is divided into four stages: spilt, map, shuffle, reduce

MapReduce process

1, Split phase (input fragments)

There are two documents, after fragmentation process, it will be divided into three fragments (split1, split2, split3). In turn as the input stage of the map.

After slicing process produces three fragments, each fragment is three word line, respectively, as the input stage of the map.

2, Map stage

Split output stage as an input stage of the Map, a slice corresponding to a Map task. Stage in the Map, the read value value value value of the split form. key for each word, value of 1.

Map stage to consider what the key is, what value yes. In particular key, he will reduce as the basis for later. The output example: <Deer, 1>, <River, 1>, <Bear, 1>, <Bear, 1>.

Map output stage will be used as the input stage of the Shuffle.

3, Shuffle stage

Map output from the process is understood to Reduce input, and involve network transport.

The same property of the Map key set together as an input to a Reduce

4, Reduce stage

The same accumulating key data. The output example: <Beer, 3>.

 

  1. Please describe Hadoop cluster RM, NM What is that? what's the effect?

RM and NM belong hadoop yarn Resource Scheduler:

RM: Resource Manager: 1, 2 client request to start or monitoring MRAppMaster3, monitoring NodeManager4, allocation and scheduling of resources

The NM: Node Manager: 1, resource 2, process commands from the ResourceManager 3, MRAppMaster process commands from a single node on the management

  1. RM Hadoop cluster is divided into several modules, each module has what role?

RM is divided into: two modules: ApplicationMaster (AM) and Containe.

ApplicationMaster (AM):

         Application user submitted contains one AM, in charge of monitoring applications, tracking application execution state, restart failed tasks. ApplicationMaster is the application framework, which is responsible for coordination of resources to the ResourceManager, and work together to complete the Task execution and monitoring and NodeManager.

 

Container:

        Container resources in the abstract YARN, it encapsulates the multi-dimensional resource on a node, such as memory, CPU, disk, network, etc., when the AM application resources to RM, RM resources for the return of AM is represented by Container . YARN Container will be assigned a task and can only use the resources of the Container described for each task.

 

  1. What the ETL are? kettle What is the role? Core objects you commonly use what? What are the role?

ETL is: used to describe the data subjected to extraction, transformation, loading from the source terminal to the destination process. ETL term more commonly used in data warehouse , but the object is not limited to the data warehouse .

Core object: BI and Data Warehousing

Role: BI: for the enterprise in the existing data (raw data or business data or business data, etc.) for effective integration, providing fast and accurate reports and make decision-making basis to help companies make informed business decisions.

     Data Warehouse: It provides a certain enterprise BI (business intelligence) capability, business process improvement guidance, monitoring time, cost, and quality control. Data warehouse input side is a wide variety of data sources, the final output data for business analysis of data mining, data reporting.

  1. hadoop cluster port 50070 and 8088 ports, respectively, refers to what?

50070: namenode slave master;

8088: yarn resource scheduler in the RM; there are all of the completion of the process

  1. Data warehouse hive, the hive command to start the server what? What does it all mean? Internal and external table table what's the difference? Partition and sub-barrel, refers to what?

Command: hive --service metastore start metadata

hive: hive command run locally

hiveserver2 : remote service, the default port 10000 open

Internal tables: an internal table to delete the table, the data will be deleted,

External tables: External tables when creating a need to add external, when you delete the table, the data table will still be stored in the hadoop, will not be lost

Subdivision: sub-folders: sub-directory, to a large data sets need to split into small data sets based on business

Kit of parts: data points: points tub dataset is decomposed into several parts, more manageable

  1. Mysql connection in the left and right, inside connections, what Descartes set is? Table current students (student id, student name, school time time) and transcript (grades id, student id, achievement scores core, subject id), for example (and fill in the relevant test data) and write SQL statements?

Student id, student name, school time time scores id, student id, achievement scores core, subject id

  1     ie 2019 1 1100 1

  2         zhang       2018          2       1         90          2

            3       2         80         1

   4       2         70          2

                                      5       3         60           1

                                      6       4         50           2

Left connection: the SELECT * Student, Student left the Join the Account from the Account * ON Student student id = account student id....

  1.     ie 2019 1 1100 1
  1.     ie 2 019 90 2 1 2
  2.         zhang       2018          3       2         80         1

 2         zhang       2018          4       2         70          2

Right link: the SELECT * Student, Student right from the Account * ON the Join the Account Student student id = account student id....

1         ie 2019 1 1100 1

1       ie 2 019 90 2 1 2

2        zhang       2018          3       2         80         1

2        zhang       2018          4       2         70          2

 

                                                  5       3         60           1

                                     6       4         50           2

 

 

 

En: the SELECT Student *, * the Account from the Account the Join Inner ON Student Student Student id = account student id....

1         ie 2019 1 1100 1

1       ie 2 019 90 2 1 2

2        zhang       2018          3       2         80         1

2        zhang       2018          4       2         70          2

 

Descartes set: . The SELECT * Student, Student Cross the Account from the Account * ON Student student id = account student id...

1     ie 2019 1 1100 1

1     ie 2 019    2 90 1 2

1     ie 2019    3 2 1 80

1     ie 2019    4 2 70 2

1     ie 2,019    5 3 60 1

1     ie 2019    6 4 50 2

2         zhang       2018           1       1          100        1

2         zhang       2018    2       1         90          2

2         zhang       2018    3       2         80         1

2         zhang       2018           4       2         70          2

2         zhang       2018          5       3         60           1

2         zhang       2018          6       4         50           2

 

 

  1. What is BI, what is a data warehouse? Why do we need a data warehouse hive?

BI: That business intelligence, it is a complete solution for the enterprise in the existing data (raw data or business data or business data, etc.) for effective integration, providing fast and accurate reports and make decisions based, to help companies make informed business decisions.

Data Warehouse: Data storage is a great collection, reporting and analysis for enterprise decision support purposes creation of diverse business data screening and integration.

Why: hive to provide enterprises with a certain BI (business intelligence) capability, business process improvement guidance, monitoring time, cost, and quality control. Data warehouse input side is a wide variety of data sources, the final output data for business analysis of data mining, data reporting.

  1. Hive data warehouse with hadoop cluster, what is the relationship? What are the benefits of using hive?

Relationship: Hive is based on Hadoop data warehousing tools, you can map the structure of the data file to a table, and provides SQL-like query function, hadoop is used to store data, hive is used to manipulate the data. hive data files stored on hdfs, hdfs As hadoop file location is managed.

Benefits: 1) operator interface using SQL-like syntax, the ability to provide rapid development (simple, easy to use)
2) avoiding to write MapReduce, reduce learning costs developer
3) Hive execution delay is relatively high, so the Hive commonly used in data analysis, less demanding real-time applications
4) Hive advantage of big data processing, for processing data no small advantage because of delayed execution Hive relatively high
5) Hive support for user-defined functions, the user can be realized according to their needs their function

  1. Hadoop cluster hadoop namenode -format, whether at any node out repeatedly execute this statement? If the statement is executed multiple times, what kind of results will appear? If the statement is executed in a node, how to modify?

No, the hadoop formatting will delete all files in all the computers in the cluster name, restart hadoop

  1. How to create an external table hive (including partition and sub-barrel)? When creating tables and post-import data, you need to pay attention to what issues? How to Import /data.txt files on local and external data files data.csv hdfs to the outside tables in the hive?

Create a table when keywords: External , the data field number, the type to be consistent, to the same separator, to import data not overwritten, without the cover can not overwrite.

Local data: load data local inpath '/home/user/data.csv' introduced into table table name;

Hdfs: load data inpath '/home/user/data.txt' introduced into table table name;

  1. What type of data collection in the hive? what's the effect? Under what circumstances, hive need to use the collection type?

Data type: comprises six basic types: integer, boolean, floating-point, character, time type. Byte array

               A set of two data types: struct, map, array

Collection type used in the case of basic data types can not express

  1. hive carved barrel and partitions, what is the difference? External and internal table Table What is the difference?
  2. The hive columnar storage and storage line, you know what type? What are their differences? How to understand columnar storage?

Line store: tetxfile

Columnar storage: rcfile, orcfile, parquetfile, avrofile, sequencefile

Differences: columnar storage using binary. Open data is garbled.

Understand: columnar storage make the query more convenient, faster, for example. Select * from stufent.age = 19;

When using columnar storage, direct examination age this column, it does not mean the removal of the age 19 to find data in other age = 19. Greatly accelerate the time of the query.

Guess you like

Origin blog.csdn.net/pingsha_luoyan/article/details/97750251