Interview review of Huawei operation and maintenance education project

Article directory

I hope you are a big data expert, familiar with development languages such as Hive SQL, python, shell, etc., and give a solution for dimensional modeling of data warehouse

Ok, I will provide you with a simple solution for dimensional modeling of a data warehouse.

Determine business requirements and goals: determine the business scope of the data warehouse and the business requirements that need to be supported, and provide a clear direction for the dimensional modeling of the data warehouse.
Identify the fact table: According to business needs, select the fact table, which contains business-related data, usually measurable indicators such as quantity, amount, and time.
Identify dimension tables: For each fact table, the dimensions related to it need to be defined. Dimensions are tables used to describe attributes or characteristics of fact tables, such as time tables, location tables, product tables, customer tables, and so on.
Design dimension relationships: Identify relationships and hierarchies between dimensions, such as relationships between products and categories, subcategories.
Design normalized dimension tables: Design dimension tables as normalized schemas. This will reduce redundant data and improve data warehouse performance.
Design aggregation levels: Identify different levels of granularity for each dimension and create corresponding aggregation tables to improve query performance.
Design data model: use the above design information to create a data model, you can use ER model, star model, snowflake model, etc.
Implement the data model: According to the design, use the development language, such as Hive SQL, Python, Shell, etc., to implement the data model.
Test and verification: Test and verify the newly created data model to check whether the data is accurate and complete, and whether it meets business requirements.
Maintenance and optimization: According to business requirements and data warehouse usage, maintain and optimize the data model, improve query performance, and ensure the data quality of the data warehouse.

The above is a relatively simple data warehouse dimensional modeling scheme, and I hope it can be helpful to you.

How to use the awk command to directly intercept the data of the first and second lines

Use the awk command to directly intercept the data of the specified line conveniently. Suppose we have a file named file.txt, and we need to extract its first and second lines of data, we can use the following command:

Copyawk 'NR==1 || NR==2' file.txt

in:

NR: Indicates the current line number (Number of Record).
==: means equal.
1, 2: Indicates the first line and the second line.

The awk command records the number of the current line at the beginning of each line, we use the command NR1 || NR2 means only process the first and second row. When a line has a line number of 1 or 2, awk prints the data for that line.

In this way, the data of the first and second rows can be output directly. If you need to output to another file, you can use the output redirection symbol to write the output to the specified file:

Copyawk 'NR==1 || NR==2' file.txt > output.txt

This will write the first and second lines of data from file.txt to output.txt.

How to use awk to split the text by vertical lines and only display the first and second columns

Suppose we have a text file file.txt, the content format is as follows:

Copyapple|red|round
orange|orange|round
banana|yellow|long
grape|purple|round

We can use the awk command to split by vertical lines and only display the data in the first and second columns. The specific commands are as follows:

Copyawk -F "|" '{print $1,$2}' file.txt

in:

-F "|": Indicates to split the text by vertical lines.
'{print $1,$2}': Indicates that only the first and second fields are output.

After executing the command, the following output will be output:

Copyapple red
orange orange
banana yellow
grape purple

Specific instructions:

$1 means the first column.
$2 means the second column.

If you need to save the results to a file, you can do so with the output redirection notation:

Copyawk -F "|" '{print $1,$2}' file.txt > output.txt

This saves the data in the first column and the second column to the output.txt file.

centos7 default file system format

The XFS file system is used by default in CentOS 7. XFS is a high-performance, 64-bit journaling file system. It supports scalability and reliability, and supports efficient management of large volumes of data and metadata. XFS was originally designed and developed by Silicon Graphics, submitted to the Linux kernel community in 2001, and has been included in the Linux kernel since Linux 2.4.25. Although by default CentOS 7 uses XFS, other file systems like Btrfs, ext4, reiserFS, etc. can also be used.

What is the difference between ext4 and xfs?

ext4 and XFS are two common high-performance file systems in Linux file systems. To a certain extent, their functions overlap, and they have their special advantages and characteristics in specific situations.

The difference is as follows:

XFS supports larger file systems and files. The maximum file size of the ext4 file system is 16TB, while XFS can support 18EB, which makes XFS especially suitable for applications that manage large-capacity data and process large files.
XFS is optimized for large files and solves the scalability problem of ext4. Therefore, XFS is a better choice when large files need to be managed.
ext4 is more suitable for small files. It has higher performance compared to XFS, because XFS bottlenecks when processing large files due to higher metadata load.
XFS supports fast recovery. When the system crashes, the XFS file system can restart and mount the file system in no time, while ext4 may take more time to check integrity and repair any problems.
When data corruption occurs, ext4 is faster because it can recover corrupted filesystems. While XFS requires a full scan to restore the file system, it may take longer.

To sum up, if you need to manage large volumes of data and handle large files, XFS is a better choice. If you need to manage smaller files, ext4 is a better choice.

Flume components

Apache Flume is a distributed, reliable, and fault-tolerant big data collection, movement, and processing system. Flume collects a large amount of log data from different data sources to the central warehouse, and can perform necessary processing. The components of Flume include:

Source: Data source, get data from the data source and pass it to the Flume topology. Flume supports multiple Source types, such as Avro, Exec, JMS, Netcat, etc.
Channel: persistent queue, cache data between Source and Sink. Flume supports multiple Channel types, such as Memory, File, and so on.
Sink: Data destination, which transfers the data cached in the Channel to the target storage. Flume supports multiple sink types, such as HDFS, Kafka, Elasticsearch, HBase, etc.
Interceptor: Interceptor, used to convert, process or filter data. Flume supports multiple Interceptor types, such as Regex Filtering, Host Interceptor, Timestamp Interceptor, etc.
Channel Selector: Used to select the Channel to use when there are multiple Channels. Flume supports multiple Channel Selector types, such as Replicating Channel Selector, Multiplexing Channel Selector, and so on.
Sink Processor: Used to connect the Sink to other components in the topology. Sink Processor can define batch behavior to ensure high throughput transmission.
Source Running State: Provide status information and error handling functions to Source and Sink. Flume provides Failover and Load Balancing level error handling.

These components form the core architecture of Flume and can be combined and configured as needed. Through flexible configuration, a flexible and reliable data processing system can be built quickly.

How many types of channels does Flume have?

There are three main types of Channels in Flume:

Memory Channel: It caches data in memory and is a lightweight Channel type. Since the data is stored in the memory, it may cause insufficient memory when processing a large amount of data, and after the Flume restarts, the data cached in the memory will be deleted, so the Memory Channel is mainly suitable for real-time data transmission requirements Scenarios that are relatively high and do not require high data security.
JDBC Channel: It caches data in a relational database and is a type of persistent storage Channel. Because the data is stored in the database, it can better handle large amounts of data and provide better data security and durability than Memory Channel. However, using JDBC Channel requires additional database configuration and may result in corresponding performance loss.
File Channel: Cache data to the local disk, which is a type of persistent storage Channel. It can handle large amounts of data and can provide better data security and reliability guarantees. After Flume restarts, the data cached on disk will not be deleted. However, compared to Memory Channel and JDBC Channel, the performance of File Channel may be slightly worse.

Flume also supports custom Channel types, and the most suitable Channel type can be selected according to actual scenarios to meet specific needs.

What kind of data can I use if I can’t lose data for 10TB per day?

For scenarios that need to process 10TB of data every day and cannot lose data, it is recommended to use the Apache Hadoop HDFS distributed file system. HDFS is a distributed file system in the Apache Hadoop ecosystem series. It has the characteristics of high fault tolerance, high scalability, and high reliability. It is often used for big data storage and data processing.

The basic composition of HDFS is divided into two parts, NameNode and DataNode:

NameNode: This node stores metadata information and records which DataNodes each data block is located in. It is responsible for managing files and directories in the file system, and maintaining metadata information such as file permissions, owners, and access times. In HDFS, NameNode is usually a single point of failure. In this case, the Failover protocol provided by Hadoop will automatically switch to another available NameNode.
DataNode: This node stores the actual data blocks and provides data reading and writing services to clients. When a block of data is written to HDFS, it is divided into smaller blocks and stored on multiple DataNodes.

HDFS uses a multi-copy mechanism to ensure that data is not lost. HDFS stores at least three copies by default, and a larger number of copies can be set as needed to improve reliability. At the same time, HDFS provides the capability of horizontal expansion, which can quickly increase data nodes to process massive data.

In addition, due to the large amount of data processed every day, it is recommended to use compressed files

Flume is divided into several channels? Which one should be used if 10TB of data cannot be lost every day?

Flume is mainly divided into three types of Channels: Memory Channel, File Channel, and JDBC Channel.

For the data requirement of 10TB per day, without data loss, and to support high-concurrency and high-throughput data reading, it is recommended to choose File Channel.

The main feature of File Channel is to support a large number of data caches, high data security and reliability, and support multi-threaded read and write operations, which can meet the needs of high concurrency and high throughput data reading, and the data cache is very reliable in the file . File Channel will cache data to the local disk, which has the characteristics of persistent storage, high availability, and good data security. It is very suitable for scenarios with large data volume, high data value, and strict data security requirements.

Of course, the File Channel also has its disadvantages. For example, for some special data types, the effect of the file channel is not very good, which may affect the read and write efficiency. At the same time, this channel type requires additional disk space to store data, so it is necessary to set and maintain a reasonable disk quota according to the actual situation.

To sum up, for the data requirement of 10TB per day, it is recommended to use File Channel to ensure high data reliability and security.

Kafka creates a topic

In Kafka, to create a new Topic, run the following command:

Copybin/kafka-topics.sh --create --bootstrap-server <broker-host>:<broker-port> --replication-factor <replicas> --partitions <partitions> --topic <topic-name>

This command has the following parameters:

--create: Indicates that a new topic is to be created.
--bootstrap-server: The address and port of the Kafka cluster, used to initialize the client and connect to the Kafka cluster. For example: localhost:9092.
--replication-factor: The number of replicas for each Partition. This value should be 2 or more. If there are 2 Brokers in the cluster, then --replication-factor=2.
--partitions: Number of topic partitions. Usually, each Partition is handled by 1 server, and it is best to set it to the number of servers currently executing.
--topic: Topic name.

For example, to create a topic testnamed where each Partition has 2 replicas and has 4 Partitions, execute the following command:

Copybin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 2 --partitions 4 --topic test

After executing the above command, Kafka will create a Topic testnamed .

It should be noted that Kafka has many other command line options, such as adding partitions, deleting partitions, deleting topics, and so on. You can use kafka-topics.shthe command to get more information.

Are you familiar with Spark or Flink?

I have a certain understanding of Spark and Flink, two big data processing frameworks.

In big data scenarios, both Spark and Flink are in-memory computing frameworks that can handle large batches of data. Spark has excellent performance in memory-based computing and provides a rich API, making it the preferred framework for large-scale data processing. The advantage of Flink lies in its stream processing capability, which can process the stream immediately when the data is generated, so as to obtain lower latency and faster response speed.

Spark and Flink are very similar in concept and function. Both frameworks provide a large number of vertical and horizontal expansion of data processing, so that they can meet high concurrency, high scalability and high performance requirements.

It should be noted that both Spark and Flink have their applicable scenarios. Spark is more suitable for processing large-scale offline data, such as batch processing, machine learning, data mining, etc., while Flink is more suitable for real-time data stream processing, such as stream processing, data processing, etc. Therefore, when using these two frameworks, choosing a more appropriate framework according to the actual scene can effectively improve the data processing efficiency.