Hadoop knowledge points (1)

Why Hadoop is faster than traditional technical solutions

1. Distributed storage
2. Distributed parallel computing
3. Horizontal expansion of nodes
4. Move program to data terminal
5. Multiple data copies

What are the characteristics of big data?

(1) Massive quantification and
large amount of data (much)
(2) Diversified
structured data, semi-structured data, and unstructured data
(3) Rapid
data growth rate
(4) High value
Mass data has high value

What do the shell client operation commands of hdfs mean?

(1) -Display lsfile and directory information
(2) mkdir-Create a directory on hdfs, -p means that all levels of parent directories in the path will be created
(3) put-Copy a single src or multiple srcs from the local file system to the target file System
(4)-copy getthe file to the local file system
(5) appendFile-append a file to the end of the existing file
(6) cat-display the file content
(7) tail-display the last content of the file
(8) chmod-change the file Permissions. Use -R to make changes recursively under the directory structure
(9) copyFromLocal-copy files from the local file system to the hdfs path
(10) copyToLocal-copy from hdfs to the local
(11) cp-copy from one path of hdfs to another of hdfs Path
(12) mv-move files in the hdfs directory
(13) -delete the rmspecified files. Only delete non-empty directories and files. -r delete recursively
(14) -count dfthe available space information of the file system
(15) du-display the size of all files in the directory, and display the size of this file when only one file is specified

What can big data do?

(1) Quick query of massive data
(2) Storage of massive data (large amount of data, single large file)
(3) Rapid calculation of massive data (compared with traditional tools)
(4) Real-time calculation of massive data (immediately)
( 5) Data mining (mining valuable data that has not been discovered before)

What are the main functions of hdfs?

The main function of HDFS is to store a large amount of data in a distributed manner

In which file is Hadoop's trash can mechanism configured?

core-site.xmlConfiguration in file

What are the trash can configuration parameters?

fs.trash.interval

Command to start jobHistoryserver service process?

mr-jobhistory-daemon.sh start historyserverStart
mr-jobhistory-daemon.sh stop historyserverclose

What is the default port accessed by jobhistoryserver's webUI?

The default port is 19888

What are the files that need to be configured when installing hadoop?

(1)hadoop-env.sh
(2)core-site.xml
(3)hdfs-site.xml
(4)mapred-site.xml
(5)yarn-site.xml
(6)Slaves

When HDFS is started for the first time, which command must be formatted?

bin/hdfs namenode -format或者bin/hadoop namenode –format

What folders are included in the hadoop installation package directory and what are their functions?

(1) bin: Hadoop's most basic management scripts and the directory where the scripts are used
(2) etc: The directory where the Hadoop configuration files are located
(3) include: The programming library header files provided externally
(4) lib: This directory contains the programming dynamics provided by Hadoop externally Libraries and static libraries
(5) libexec: the directory where the shell configuration files used by each service pair are located
(6) sbin: the directory where the Hadoop management script is located
(7) share: the directory where the jar package compiled by each Hadoop module is located, the official example comes with

Hadoop feature advantages?

(1) Capacity expansion
(2) Low cost
(3) High efficiency
(4) Reliability

What are the ways to deploy Hadoop?

(1) Standalone mode (independent mode)
(2) Pseudo-Distributed mode
(3) Cluster mode (cluster mode)

Command for network synchronization?

ntpdate cn.pool.ntp.org(Ntpdate address)

In which file is the hostname set?

/etc/sysconfig/network

Which file is used to configure IP and hostname mapping?

/etc/hosts

Command to start HDFS NameNode?

hadoop-daemon.sh start namenode

Start HDFS DataNode on a single node?

hadoop-daemon.sh start datanode

Start YARN ResourceManager on a single node?

yarn-daemon.sh start resourcemanager

What are the one-click startup and shutdown script commands for HDFS clusters?

start-dfs.shStart script stop-dfs.shstop script

A brief overview of the difference between hadoop's combinet and partition

Both combine and partition are functions, and the only step in the middle should be shuffle! Combine is divided into map side and reduce side. The function is to merge the key-value pairs of the same key together. It can be customized. The partition is the result of dividing each node of the map. It is mapped to different reduce according to the key. It can also be self-defined. Defined. In fact, the classification can be understood here.

What does HBase rely on to provide a message communication mechanism?

Zookeeper

Please describe in detail the structure of a Cell in Hbase

A storage unit determined by row and columns in HBase is called a cell. Cell: A {row key, column(=<family> + <label>), version}uniquely determined cell. The data in the cell has no type, and is all stored in bytecode format.

The timing of compact triggering in hbase

1) After Memstore is flashed, judge whether it is compacted or not
2) CompactionChecker thread, polling periodically

The difference between hbase and mysql

Mysql stores data for rows, the data of the entire row is a whole, stored together
Hbase stores data for columns, the data of the entire row is a whole, stored together, which is conducive to compression and statistics

The compact role of hbase

1. Combine files
2. Clean up expired data
3. Improve the efficiency of reading and writing data

Big data processing flow

Data production--"data collection--"data storage--"demand analysis--"data preprocessing--"data calculation--"result data storage--"result data display

How to deal with Hbase downtime

Downtime is divided into HMaster downtime and HRegisoner downtime. If HRegisoner downtime, HMaster will redistribute the regions it manages to other active RegionServers. Since data and logs are persistent in HDFS, this operation will not cause data lost. Therefore, the consistency and security of the data are guaranteed. If HMaster is down, HMaster has no single point of problem. Multiple HMasters can be started in HBase, and there is always one Master running through Zookeeper's Master Election mechanism. That is, ZooKeeper will guarantee that there will always be an HMaster to provide services externally

Guess you like

Origin blog.csdn.net/weixin_42072754/article/details/109291607