Touge Big Data Assignment 4: HBase

Extracurricular homework four: HBase

  • Job details

content

1. Install and configure pseudo-distributed HBase. The premise is that pseudo-distributed Hadoop has been installed and started by referring to the experiment "Building a Hadoop Environment". The shutdown sequence is the opposite of the startup sequence: shut down HBase first, then Hadoop. • hbase official website download address (official website download is too slow): Index of /hbase • Domestic mirror hbase-2.4.16: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.4.16/hbase-2.4 .16-bin.tar.gz • Download command: wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.4.16/hbase-2.4.16-bin. tar.gz • Decompress: tar -zxvf hbase-2.4.16-bin.tar.gz -C /opt/ • Rename: mv /opt/hbase-2.4.16 /opt/hbase • Configure environment variables: echo 'export HBASE_HOME=/opt/hbase/' >> /etc/profile echo 'export PATH=$PATH:$HBASE_HOME/bin' >> /etc/profile source /etc/profile • Configuration file

  1. Configure /opt/hbase/conf/hbase-env.sh, and enter echo "export JAVA_HOME=/usr/java8" on the Linux command line >> /opt/hbase/conf/hbase-env.sh echo " HBASE_CLASSPATH=/opt/hbase /conf " >> /opt/hbase/conf/hbase-env.sh echo " HBASE_MANAGES_ZK=true " >> /opt/hbase/conf/hbase-env.sh
  2. vim /opt/hbase/conf/hbase-site.xml
    1.    <name>hbase.rootdir</name>
    1.    <value>hdfs://localhost:9000/hbase</value>
    1.    <name>hbase.cluster.distributed</name>
    1.    <value>true</value>
  3. • Start HBase (start Hadoop first): start-hbase.sh • jps includes (in addition to Hadoop processes): HMaster, HRegionServer, HQuorumPeer Visit the Web page: http://EIP:16010

• Operating HBase: Enter hbase shell on the Linux client to enter the HBase operation data table. exit to exit hbase shell • Stop HBase (and then stop Hadoop) stop-hbase.sh Experiment requirements:

  1. Operate HBase common Shell commands (refer to textbook 4.6.1 or the following Huawei Cloud experiment) to create a student information table and add data (the number of columns is no less than 5, and the data records are no less than 10). The name of the data table is your full name. After splicing the last four digits of the student number, the table data has its own name.
  2. The command line displays all data in the table, screenshot.
  3. The HDFS web page displays the data table, screenshot. • Compatible versions of Hadoop and HBase:
    2. Huawei Cloud KooLabs experiment
  4. "Basic Operations of HBase Data Tables" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud
  5. "HBase Data Warehouse Loading" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud III. Briefly answer the content of "Classroom Assessment"
  6. Where did you download HBase? https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.4.16/hbase-2.4.16-bin.tar.gz
  7. HBase pseudo-distributed configuration file hbase-site.xml, what is configured? The HBase pseudo-distributed configuration file hbase-site.xml configures the path for HBase and zookeeper to write data.
  8. Do I need to start Hadoop before starting HBase? Why? You need to start Hadoop before starting HBase, because the bottom layer of hbase is hdfs and relies on Hadoop.
  9. What process nodes do you see after HBase is started, and what do they do?  HMaster process: The HMaster process is the management node of the HBase cluster and is responsible for coordinating and managing the RegionServer process. It is responsible for handling a series of management tasks such as client requests, load balancing, and Region allocation, merging, and deletion.  RegionServer process: The RegionServer process is a data storage node responsible for storing and processing data. The data corresponding to each Region is stored on a RegionServer, and multiple Regions can share the same RegionServer. RegionServer is also responsible for processing read and write requests, maintaining WAL (Write-Ahead-Log), splitting and merging regions, etc.  ZooKeeper process: ZooKeeper is the coordination service of the HBase cluster and is mainly used to manage various metadata information in the HBase cluster. Such as the region information in use, whether there are new regions that need to be split, etc. ZooKeeper can also help coordinate the election process between multiple HMasters and monitor and manage the entire cluster  HQuorumPeer process: This process is an internal component created when the ZooKeeper service starts. It is mainly used to handle various session requests and maintain cluster machines and other The connection status between machines.
  10. Which path in HDFS are the tables created by HBase placed in? Each Region has a separate directory, which contains the data files and metadata files of the corresponding Region. These directories are located under the path /hbase/data/<table-name>/<region-name> by default, where <table-name> is the table name and <region-name> is the Region name. For example, if a table named "employee" is created, its data directory will be located under the /hbase/data/employee path.
  11. Talk about how HBase reads data? How to write data?  Reading data: The client requests the HBase Master node to query the row key of a certain table and obtains the storage path of this table on HDFS. The client finds the corresponding RegionServer node based on the row key Hash algorithm and sends a read request to it. RegionServer finds the HFile file where the row data specified by the client request is located, and caches the data using the in-memory Block Cache (if it exists). If the data is not in the memory cache, it is read from the disk and loaded into the Block Cache for the next query. RegionServer returns the query results to the client.  Writing data: The client requests the HBase Master node to write a certain table and obtains the appropriate RegionServer node for the table. The client packages the newly inserted (or modified) row key and column value information into a Request and sends it to the RegionServer. RegionServer creates a Memstore and caches the original data before modification. If the Memstore is full, all the data in it will be saved as a newly generated HFile file, and the Memstore will be emptied before starting to write new data. If all modification operations are completed, the client receives a confirmation message and ends the write operation. 4. Exercises • 4.8 Exercises
  12. Describe the relationship between HBase and other components in the Hadoop architecture. HBase uses Hadoop MapReduce to process massive data in HBase to achieve high-performance computing; it uses Zookeeper as a collaborative service to achieve stable service and failure recovery; it uses HDFS as a highly reliable underlying storage and uses cheap clusters to provide massive data storage capabilities; Sqoop is The underlying data import function of HBase, Pig and Hive provide high-level language support for HBase, which is the open source implementation of BigTable.
  13. Please explain the corresponding relationship between the underlying technologies of HBase and BigTable. Project BigTable HBase file storage system GFS HDFS Massive data processing MapReduce Hadoop MapReduce collaborative service management Chubby Zookeeper
  14. Please explain the difference between HBase and traditional relational databases. Differences from the traditional relational database HBase data type relational model data model data operations insert, delete, update, query, multi-table connection insert, query, delete, clear, cannot achieve association between tables The storage mode is based on row mode storage, tuple or Rows will be stored continuously on disk and are based on column storage. Each column family is saved by several files. The files of different column families are separate data indexes. Complex multiple indexes are built for different columns. Only one row key indexes the data. Maintenance uses the latest current value to replace the original old value in the record. The update operation will not delete the old version of the data, but generate a new version. Scalability is difficult to achieve horizontal expansion, and the space for vertical expansion is also relatively limited. It can be easily passed Add or reduce the number of hardware in the cluster to achieve performance scaling
  15. What types of access interfaces does HBase support? HBase provides Native Java API, HBase Shell, Thrift Gateway, REST GateWay, Pig, Hive and other access interfaces
  16. Please illustrate the HBase data model with an example. Info Name Age Sex 2022611771 zhangqiuling 20 02 2022611770 zhoushuangfeng 18 02
  17. Explain the concepts of row keys, column keys and timestamps in HBase respectively.  The row key is unique and only appears once in a table. Otherwise, the same row is being updated. The row key can be any byte array.  Column families need to be defined when creating the table, and the number should not be too large. The column family name must consist of printable characters, and there is no need to define columns when creating the table.  Timestamp is specified by the system by default, and the user can also display the setting. Use different timestamps to distinguish different versions.
  18. Please give examples to illustrate the difference between the conceptual view and the physical view of HBase. HBase data conceptual view: row key timestamp column family contents column family anchor “com.cnn.www” T5 Anchor:cnnsi.com=”CNN” T3 Anchor:my.look.ca=”CNN” “com.cnn.www ” T3 Content:html=”...”
    T2 Content:html=”...”
    T1 Content:html=”...”
    HBase data physical view: Row key timestamp column family anchor “com.cnn.www” T5 Anchor:cnnsi.com=”CNN” T4 Anchor:my.look.ca=”CNN” Row key timestamp column family contents “com.cnn.www” T3 Content:html=”...” T2 Content:html =”...” T1 Content:html=”...”
  19. Describe the functional components and their functions of HBase.  Library function: linked to each client  One Master server: The master server Master is mainly responsible for the management of tables and Regions  Many Region servers: The Region server is the core module in HBase and is responsible for maintaining the Region assigned to itself , and respond to the user's read and write requests
  20. Please explain the data partitioning mechanism of HBase. HBase uses partitioned storage. A large table will be split into many Regions, and these Regions will be distributed to different servers to implement distributed storage.
  21. How are partitions located in HBase? Each entry in the constructed mapping table contains two items, one is the Region identifier and the other is the Region server identifier. This entry identifies the correspondence between the Region and the Region server. In this way, you can know which Region server a certain Region is stored in.
  22. Describe the names and functions of each level in the three-tier structure of HBase. The role of the hierarchical name is that the first-level Zookeeper file records the location information of the -ROOT- table. The second-level -ROOT- table records the Region location information of the .META. table.

-ROOT- table can only have one Region. Through the -ROOT- table, you can access the data in the .META. table. The third layer .META. table records the Region location information of the user data table. The .META. table can have multiple Regions and saves all user data in HBase. Region location information of the table

  1. Please explain how the client accesses data under the three-tier structure of HBase. First access Zookeeper to obtain the location information of the -ROOT table, then access the -Root- table to obtain the information of the .MATA. table, then access the .MATA. table to find out which Region server the required Region is located, and finally reach the Region server reads data.
  2. Describe the basic architecture of the HBase system and the role of each component.  Client: The client contains an interface for accessing HBase, and maintains the visited Region location information in the cache to speed up the subsequent data access process.  Zookeeper server: Zookeeper can help elect a Master as the general manager of the cluster, and It is guaranteed that there is always only one Master running at any time, which avoids the "single point of failure" problem of the Master. Master: The main server Master is mainly responsible for the management of tables and Regions: managing the addition, deletion, modification, and modification of tables by users. Query and other operations; achieve load balancing between different Region servers; be responsible for re-adjusting the distribution of Regions after Region splits or merges; migrating Regions on failed Region servers  Region server: Region server is the most powerful in HBase The core module is responsible for maintaining the Region assigned to itself and responding to users' read and write requests.
  3. Please explain the basic principle of the Region server reading and writing data to HDFS. The Region server internally manages a series of Region objects and an HLog file. HLog is a record file on the disk, which records all update operations. Each Region object is composed of multiple Stores, and each Store object stores a column family in the table. Each Store contains MemStore and several StoreFiles, where MemStore is a cache in memory.
  4. Describe the working principle of HStore. Each Store corresponds to the storage of a column family in the table. Each Store includes a MenStore cache and several StoreFile files. MenStore is a sorted memory buffer. When the user writes data, the system first puts the data into the MenStore cache. When the MemStore cache is full, it will be flushed to a StoreFile file on the disk. When the size of a single StoreFile file exceeds a certain threshold , the file splitting operation will be triggered.
  5. Describe the working principle of HLog. The HBase system configures an HLog file for each Region server. It is a write-ahead log (Write Ahead Log). User update data must be written to the log first before it can be written to the MemStore cache, and until the MemStore cache content corresponds The cache content can be flushed to the disk only after the log has been written to the disk.
  6. In HBase, each Region server maintains an HLog instead of each Region maintaining a separate HLog. Please describe the advantages and disadvantages of this approach.  Advantages: Log modifications caused by update operations of multiple Region objects only need to continuously append log records to a single log file, and do not need to open and write to multiple log files at the same time.  Disadvantages: If a Region server occurs In the event of a failure, in order to restore the last Region object, the objects on the Region server and the HLog on the Region server need to be split according to the Region objects to which they belong, and then distributed to other Region servers to perform recovery operations.
  7. When a Region server terminates unexpectedly, how does the Master detect this unexpected termination? In order to restore the Region on this unexpected Region server, what will the Master do (including how to use HLog for recovery)? Zookeeper will monitor the status of each Region server in real time. When a Region server fails, Zookeeper will Notify Master. The Master will first process the HLog file left on the failed Region server. This left HLog file contains log records from multiple Region objects. The system will split the HLog data according to the Region object to which each log record belongs, and place them in the directory of the corresponding Region object. Then, the failed Region will be redistributed to an available Region server and linked to the Region object. Relevant HLog log records are also sent to the corresponding Region server. After the Region server receives the Region object assigned to itself and the related HLog log record, it will redo the various operations in the log record, write the data in the log record into the MemStore cache, and then refresh it to the disk. StoreFile file to complete data recovery.
  8. Please list several commonly used commands in HBase and explain how to use them.  create: create a table  list: list all table information in HBase  put: add data to the cells specified by the table, row, and column  get: by specifying the name, row, column, timestamp, time range, and version number To obtain the value of the corresponding cell  scan: browse related information of the table  alter: modify the column family mode  count: count the number of rows in the table  describe: display related information of the table  enable/disable: make the table valid or invalid  delete: Delete the data of the specified cell  drop: delete the table  exists: determine whether the table exists  truncate: invalidate the table, delete the table, and then re-create the table  exit: exit the HBase Shell  shutdown: close the HBase cluster  version: output HBase Version information  status: Output HBase cluster status information

Guess you like

Origin blog.csdn.net/qq_50530107/article/details/131260911