[Complete collection of big data interview questions] Big data real interview questions (continuously updated)

[Complete collection of big data interview questions] Big data real interview questions (continuously updated)

1)Fine

1.1. A brief introduction to Flink

Flink is a distributed streaming data processing engine. It performs state calculations for bounded and unbounded data. It provides many APIs that are convenient for users to write distributed tasks. There is DataSet API, but it has been abandoned in the new version and will be eliminated soon. , Now we use DataStreamAPI and some TbaleAPIs, but they are not perfect. Compared with SqarkSQL, there is still a big gap. Flink also provides fault tolerance mechanism, FlinkCEP real-time early warning and other functions.

1.2. What is the difference between Flink and SparkStreaming

1) From the perspective of architecture: the operation of SparkStreaming Task depends on Driver, Executor, Worker, and the operation of Flink mainly depends on JobManager and TaskManager.

2) From the perspective of data processing: SparkStreaming is micro-batch processing, and the time interval of micro-batch processing needs to be specified, while Flink is driven by event time, which is stream processing in the true sense.

3) From the time mechanism: Flink provides event time, injection time, processing time, and the most important thing is event time. At the same time, Flink supports the WaterMark watermark mechanism and delayed data processing. In this regard, SaprkStreaming only supports processing time, and StructStreaming supports event time and WaterMark, but it is not as good as Flink, and Flink is more complete.

4) In terms of checkpoint, Flink's checkpoint has exactly-once, which ensures that the data is processed just once, and spark may process it multiple times.

1.3. How does Flink ensure that data is not lost

1、checkpoint

  • This is the most important point, and it will lead to many other problems. First, figure out what is checkpoint:

    During the process of data processing, the processing state and processing location are saved. The saving method is similar to taking a snapshot. If the system fails, after restarting, it will continue to calculate from the location or state recorded in the checkpoint. In this way, the data can be guaranteed not to be lost.

  • When it comes to checkpoint, we have to talk about the fence mechanism:

    JobManager starts a checkpoint coordinator thread (if you can't remember the name, you can just start a thread), and then this thread will send barriers (fences) regularly, and the barriers will pass through each operator in sequence, such as Source, In the sequence of Transfomation and Sink, when the Source passes through the Source, the Source will pause for a short time (extremely short and does not affect the operation of the data, because Flink is distributed), and then take a snapshot of the current Source state and temporarily store it in In HDFS, the fence passes, and the state storage of the next Transfomation is performed, and then the above steps are repeated, and then the fence passes, and the Sink is performed, and the above steps are also repeated, and finally the fence returns to the coordinator thread of the checkpoint, and the coordinator thread of the checkpoint When it is judged that all the above steps are OK, the checkpoint is successfully executed once, and if an operator fails in the middle, it is considered as a checkpoint failure.

2. Checkpoint execution mode (consistency):

  • at-most-once: at most once. After a failure occurs, the calculation results may be lost;

  • at-least-once: at least once. Even if a failure occurs, it is guaranteed that each data will be processed at least once, so repeated processing may occur;

  • exactly-once: exactly once. Regardless of whether the system fails or not, the result will only appear exactly once, which is exactly once;

    (Most checkpoint execution modes use exactly-once for consistency processing)

3. State (state):

Store the state in the memory of each taskManager (that is, node, such as node1, node2, node3, etc.), when it comes to memory, it is definitely not as safe as disk

Notice: checkpoint is the persistent storage of the state of each operator (persistence is stored on disk, such as HDFS)

1.4. The Flink code is written and checkpoint is set, but if someone changes the Flink code, will it affect the result of the checkpoint?

After the code is changed, as long as the data structure in the checkpoint does not change, it will not affect the data recovery.

1.5. How exactly-once is guaranteed in Flink

1. Start the transaction to create a temporary folder and write the data into this folder.

2. Pre-commit writes the data cached in memory to the file and closes it.

3. Formally submit the previously written temporary files into the target directory. This represents the final data.

4. When the lost execution is successful, it will delete all the temporary files.

5. If the failure occurs after the pre-submission, before the formal submission, the pre-submission data can be submitted according to the status, or the pre-submission data can be deleted.

1.6. What to do if the amount of Flink data is too large

1. Consider the use of processing time.

2. Add machines.

3. Use scrolling events to reduce data duplication.

4. Kafka cache.

1.7. What is the relationship between Flink's slot and parallelism

Solt is the slot on TaskManager and belongs to the provider;

Parallelism is performed on slots and belongs to the consumer.

1.8. Flink's restart strategy

It can only be started based on Flink's checkpoint. The default is to restart at a fixed interval, such as restarting several times, not restarting, etc. If no checkpoint is set, there is no restart strategy. There is also a FallBack restart strategy and a failure rate restart strategy.

1.9. Flink's broadcast variables

Because tasks in Flink are executed in parallel in slots, if multiple slots need a copy of variable data, then we can use broadcast variables to send the variables to each TaskManager (node), so that each One slot can directly obtain variable data from its own TaskManager node, which reduces the performance loss caused by remote transmission, but only one copy of the same broadcast variable can appear on each TaskManager.

1.10. Session in Flink window

There is no fixed time interval, that is, when the user exits the session, statistics will be made on the windows in the session phase.

1.11. Flink's state storage

三个:MemoryStateBackend,FSStateBackend,RocksDBstateBackend

1.12. How to solve the data skew in Flink in windows

1. Pre-aggregate before windows.

2. Re-set or modify the key.

1.13. How Flink handles backpressure

In fact, flink has its own anti-pressure strategy. The jobManager and taskManager communicate. Once the downstream processing time is long, the source stage will adjust the amount of data it pulls. If the automatic processing strategy still does not reduce the pressure effect, it can be solved by setting parrilizims to increase the degree of parallelism, and it can also be solved by setting the number of slots to increase the speed of data processing.

1.14. operatorChain in Flink

It can reduce the switching between threads, reduce the serialization and deserialization of messages, reduce the delay and improve the throughput. As for when the operatorChain will appear, when the OneToOne mode appears between consecutive tasks, it will automatically Connect these multiple tasks to form an operator chain, so what is OneToOne? OneToOne means that the parallelism of the upstream and downstream is the same, and there is no data sorting change. At this time, it is considered as the OneToOne mode, which is generally used in code debugging. The stage will close the operatorChain, so that it is convenient to observe the data flow information in the Flink-dashboard.

1.15. How to solve the hotspot problem in groupby and keyby during aggregation in Flink

1. First of all, avoid this kind of problem in business. For example, there are more orders in Shanghai and Beijing, but few others. Then we can process the data in Shanghai and Beijing separately.

2. Process on the key.

3. Parameter setting, cache a certain amount of data before triggering to reduce access to state, thereby reducing short-term data transmission and data skew.

1.16. The concept of Taskslot in Flink

The TaskManager in Flink is the real worker that executes the task. Each TaskManager will create an independent thread to execute the task task. Therefore, setting the slot is to allocate the resources of the TaskManager evenly, so that multiple Tasks can be executed at the same time. This avoids resource competition caused by multiple tasks, but note that Solt only isolates memory, not CPU.

1.17.

1.18.

1.19.

1.20.

1.21.

1.22.

1.23.

1.24.

1.25.

2)Spark

3)Java

As the most popular development language at present, using Java is already a basic skill for big data developers. Many big data components are also suitable for Java development, such as: Flink

3.1. Collections in Java

[Java-Java Collection] Detailed Explanation and Difference of Java Collection

3.2. How to implement multithreading in Java

1. Inherit the Thread class.

2. Implement the Runnable interface.

3. Implement the Callable interface.

4. Thread pool: Provides a thread queue in which all waiting threads are saved. It avoids the overhead of creating and destroying, and improves the response speed.

detailedhttps://www.cnblogs.com/big-keyboard/p/16813151.html

3.3. How to deduplicate JavaBean in Java

1. Using HashSet, generically specify the JavaBean we created, and use the add() method in hashSet to deduplicate.

public static void main(String[] args) {
    
    
	Date date=new Date();
	//获取创建好的javaBean对象并进行赋值
	JavaBean t1 = new JavaBean();
	t1.setLat("121");
	t1.setLon("30");
	t1.setMmsi("11");
	t1.setUpdateTime(date);
	//再创建一个javaBean对象,赋同样的值    
	JavaBean t2=new JavaBean();
	t2.setLat("121");
	t2.setLon("30");
	t2.setMmsi("11");
	t2.setUpdateTime(date);
	//用HashSet
	HashSet<JavaBean> hashSet = new HashSet<JavaBean>();  
	hashSet.add(t1);  
	hashSet.add(t2);  
	System.out.println(hashSet);  
	System.out.println();
	for(JavaBean t:hashSet){
    
      
	//只会出现一个值
	System.out.println(t);  
} 

2. Using ArrayList, the generic type specifies the javaBean we created, and deduplicates after judging by the contain() method in ArrayList.

//用List
List<TestMain18> lists = new ArrayList<TestMain18>();
	if(!lists.contains(t1)){
    
    
    	lists.add(0, t1);
	}

	if(!lists.contains(t2)){
    
    //重写equals
    	lists.add(0, t2);
	}
System.out.println("长度:"+lists.size());

3.4. What is the difference between == and equals in Java

==It is used to judge whether the address of the object in memory is equal, and equalsis used to judge whether the content in the object is equal.

3.5. Task Timing Scheduler in Java

1. Timer timer = new Timer() timer.schedule (rewrite new TimerTask method)

//方式1:
Timer timer = new Timer();
//TimerTask task, 要定时执行的任务
//long delay,延迟多久开始执行
//long period,每隔多久执行延迟
timer.schedule(new TimerTask() {
    
    
    @Override
    public void run() {
    
    
        System.out.println("每隔1s执行一次");
    }
}50001000 );*/

2. The object scheduleAtFixeRate returned by Executors.newScheduleThreadPool (number of thread pools) rewrites the run method.

//方式2:
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(3);
executorService.scheduleAtFixedRate(new Runnable() {
    
    
    @Override
    public void run() {
    
    
        System.out.println("每隔1s执行一次");
    }
}51TimeUnit.SECONDS);

3. SprintBoot provides relevant explanations for timing tasks, which is very convenient to use, and uses corn to define triggering rules.

@Component//表示该类是Spring的组件,会由Spring创建并管理
@EnableScheduling//表示开启定时任务扫描
public class TestTimedTask2 {
    
    
    //https://cron.qqe2.com/
    @Scheduled(cron = "0/3 * * * * ? ")
    public void task(){
    
    
        System.out.println("每隔3s执行一次");
    }
}

4)SQL

As one of the most basic languages ​​​​in the computer industry, SQL must also be understood.

4.1. Aggregate functions in SQL

avg(),max(),min(),sum(),count()wait

Notice

Aggregate functions will not be used by themselves, nor will they be used together with where. Generally, they are used together with group by (grouping must be aggregated), and they are used after the having statement.

4.2. Various joins and differences in SQL

inner join(Inner join): It is to find the intersection of two tables.

left join(Left outer join): Mainly use the left table, if there is no table corresponding to the right table, use null to supplement the table.

right join(Right outer join): The table on the right is the main one, as above.

full join(Full outer join): It is unique to Hive, but not in MySQL. All the contents of the tables on both sides are reserved, and nulls are used to supplement each other in the table if there is no corresponding one.

4.3. Briefly talk about the data structure in MySQL

The data structure of MySQL is B + tree

[Basic Principles of Computer - Data Structure] Eight Data Structure Classifications

[Basic Principles of Computer - Data Structure] Detailed Explanation of Tree in Data Structure

4.4. What is the difference between a relational database and a nosql database in a big data component

1. Features of relational database:

  • structured storage

  • using structured query language sql

  • Operational data must be consistent, such as transaction operations

  • Complicated queries such as join can be performed

  • Unable to perform high concurrent reading and writing of large amounts of data

2. Features of nosql database:

  • unstructured database

  • Strong reading and writing ability under high concurrency and big data

  • poor transactional

  • The complex operation ability of join is weak

5)Linux

When installing, deploying, and submitting jar packages in a big data environment, it will be applied to the Linux operating system, so you need to understand common Linux commands.

[Linux-Linux Common Commands] Summary of Linux Common Commands

6)Hadoop

6.1.Yarn

6.1.1. Yarn submit job process

1. The client submits tasks to RM (MR, Spark...)

2. RM receives the task, randomly finds an NM according to the task, starts AppMaster, and notifies in the form of container.

container: resource information container (node ​​information, memory information, CPU information), running: AppMaster

3. Designate NM to start AppMaster, and maintain a heartbeat mechanism with RM after startup to report that it is currently started, and pass relevant information through heartbeat.

4. According to the task information given by RM, the task is allocated according to the task information. It mainly allocates how many maps and how many reduce to start, and how much resource space each map and each reduce need to use, and then apply for resources Relevant information is sent to RM (heartbeat sending)

5. After RM receives the resource application information, it will pass the application information to the internal resource scheduler, and the resource scheduler will allocate resources according to the relevant resource scheduling scheme. If there are no resources at the moment, wait here.

Notice:

The resources are not all given to AppMaster at one time. Generally, the most likely to satisfy the plan will be adopted. If it cannot be satisfied, a certain amount of resources will be given to run first. If the free resources are not enough for even one container, these resources will be suspended and wait for resources. adequate.

6. Based on the heartbeat mechanism, AppMaster constantly asks whether the resource is ready for RM. If it is found that it is ready, it will directly obtain the resource information.

7. According to the resource information description, start the container resource container on the specified NM, and start running related tasks.

8. After receiving the start information, the NM starts to start and execute. At this time, it will maintain a heartbeat connection with AppMaster and RM.

​ RM notifies AppMaster of task-related information based on heartbeat AppMaster

notifies RM of resource usage information based on heartbeat

9. When NM finishes running, it will notify AppMaster and notify RM of resource usage completion.

10. AppMaster informs RM that the task has been completed, and RM recycles resources and notifies AppMaster to self-destruct.

Notice:

When the NM is running, if an error occurs, the RM will immediately recycle the resources, and the AppMaster needs to apply for resources from the RM again.

details: [Hadoop-Yarn] Yarn's running process

6.1.2. Yarn resource scheduling

1、FIFO scheduler先进先出调度方案

When a scheduled task enters the scheduler, the scheduler will give priority to satisfying all the resources of the first MR task. At this time, it is possible to obtain all the resources, resulting in a very short running time for the subsequent tasks themselves. However, due to the One MR snatches all the resources, causing all subsequent tasks to wait.

This kind of scheduler is generally not used in production, because the yarn platform in production is not your own .

2、Fair scheduler公平调度器

Multiple queues can be allocated in advance, which is equivalent to pre-dividing resources.

3、capacity scheduler容量调度器

This kind of scheduler is a scheduling scheme provided by Yahoo, and it is also the default scheduling scheme of Hadoop in the current Apache version. Each queue can specify the percentage of resources occupied, so as to ensure that large tasks can have a separate queue to run, and small tasks can also run normally.

6.1.3. Yarn members

clustermanager、nodemanager、applicationmaster。

6.2.HDFS

6.2.1. HDFS read and write process

[Hadoop-HDFS] HDFS read and write process & SNN data writing process

6.2.2. What is the impact of too many small files in HDFS

HDFS is good at storing large files. We know that each file in HDFS has its own metadata information. If there are a large number of small files in HDFS, it will cause metadata explosion, and the memory pressure of metadata managed by the cluster will be very large ( namenode node)

6.2.3. How to deal with too many small files in HDFS

1. Use the official tool parquet-tools to merge the specified parquet files.

# 合并 HDFS 上的 parquet 文件
hadoop  jar  parquet-tools-1.9.0.jar  merge  /tmp/a.parquet  /tmp/b.parquet
# 合并本地的 parquet 文件
java  -jar  parquet-tools-1.9.0.jar  merge  /tmp/a.parquet  /tmp/b.parquet

2. Merge local small files and upload them to HDFS (merge and upload small files through the appendToFile command of the HDFS client)

hdfs dfs -appendToFile user1.txt user2.txt /test/upload/merged_user.txt

3. Merge the small files of HDFS and download them locally. You can use the getmerge command on the HDFS client to merge many small files into one large file, then download them locally, and finally re-upload them to HDFS.

hdfs dfs -getmerge /test/upload/user*.txt ./merged_user.txt

4. Hadoop Archives (HAR files) was introduced into HDFS in version 0.18.0. Its appearance is to alleviate the problem of a large number of small files consuming NameNode memory.

HAR files work by building a hierarchical file system on top of HDFS. HAR files are created by the hadoop archive command, which actually runs a MapReduce job to package small files into a small number of HDFS files (merge small files into several large files)

# Usage: hadoop archive -archiveName name -p <parent> <src>  <dest>
# har命令说明
# 参数 “-p” 为 src path 的前缀,src 可以写多个 path
# 归档文件:
hadoop archive -archiveName  m3_monitor.har -p /tmp/test/archive_test/m3_monitor/20220809 /tmp/test/archive

# 删除数据源目录:
hdfs dfs -rm -r /tmp/test/archive_test/m3_monitor/20220809
# 查看归档文件:
hdfs dfs -ls -R  har:///tmp/test/archive/m3_monitor.har
# 解归档:将归档文件内容拷贝到另一个目录
hdfs dfs -cp har:///tmp/test/archive/m3_monitor.har/part-1-7.gz  /tmp/test/

6.2.4. HDFS members

namenode、datanode、secondarynamenode;namenode 有 active 和 standby。

6.2.5. Differences and connections between NameNode and SecondaryNameNode

SecondaryNameNode is not the backup node of NameNode, it mainly merges the Fsimage in memory and the Fsimage file on disk.

6.2.6. Detailed Explanation of Fsimage and Edits in HDFS

[Hadoop-HDFS] Detailed explanation of Fsimage and Edits in HDFS

6.3.MapReduce

6.3.1. Working mechanism of map stage

[Hadoop-MapReduce] MapReduce programming steps and working principle (see Title 4: Working Mechanism of Map Phase)

6.3.2. Working mechanism of the reduce stage

[Hadoop-MapReduce] MapReduce programming steps and working principle (see Title 5: Working Mechanism of the Reduce Phase)

6.3.3. Advantages and disadvantages of MR

Regardless of the map stage or the reduce stage, a large number of disk-to-memory and memory-to-disk related IO operations are performed. The main purpose is to solve the problem of processing massive data calculations.

  • ​Benefits: Ability to handle massive amounts of data.

  • Disadvantages: cause a lot of disk IO work, resulting in relatively low efficiency.

6.3.4. Related configuration of MR

configuration Defaults paraphrase
mapreduce.task.io.sort.mb 100 Set the memory value size of the ring buffer
mapreduce.map.sort.spill.percent 0.8 Set the overflow ratio
mapreduce.cluster.local.dir ${hadoop.tmp.dir}/mapred/local overflow data directory
mapreduce.task.io.sort.factor 10 Set how many overflow files to merge at a time

7)Hive

7.1. Storage location of Hive related data

The metadata of Hive is stored in mysql (the default is derby, which does not support multi-client access), the data is stored in HDFS, and the execution engine is MR.

7.2. The difference between the internal and external surfaces of Hive

1. Use external to distinguish when creating a table.

2. When deleting the outer table, the metadata is deleted, and when the inner table is deleted, the metadata and stored data are deleted.

3. The storage location of the external table is determined by oneself, and the storage location of the internal table is in /uer/hive/warehouse.

7.3. How Hive implements partitioning

1. When creating a table, specify a field for partitioning.

2. Use alter table to add and delete partitions on the created table.

# 建表:
create table tablename(col1 string) partitioned by(col2 string);
# 添加分区:
alter table tablename add partition(col2=202101);
# 删除分区:
alter table tablename drop partition(col2=202101);

3. Modify the partition.

alter table db.tablename
set location '/warehouse/tablespace/external/hive/test.db/tablename'

7.4. Hive loading data

alter table db.tablename add if not exists partition (sample_date='20220102',partition_name='r') location '/tmp/db/tablename/sample_date=20220102/partition_name=r';

7.5. Hive repair partition data

msck repair table test; 

7.6. Sorting and comparison in Hive

1. The sorting methods in Hive are:order by,sort by,distribute by,cluster by

2. The difference between the four sorting methods:

  • order by: Globally sort the data, only one reduce works

  • sort by: Generally used together with distribute by, when the task is 1, the effect is the same as order by

  • distribute by: Partition the key, and sort the data in the partition together with sort by

  • cluster by: When the fields of order by and distribute by are the same, cluster by can be used instead, but only in ascending order

Notice

In the production environment, sort by is rarely used, which is likely to cause oom, and sort by and distribute by are used more.

7.7. The difference between row_number(), rank() and dense_rank():

All three are to sort the data by label

  • row_number(): There will be no increase or decrease in the serial number, and there will be a ranking order when the values ​​are the same

  • rank(): When the sorting is the same, the sequence number is repeated, and the total sequence will not change

  • dense_rank(): When the sequence number is repeated when the sorting is the same, the total sequence will be reduced

7.8. How Hive implements data import and export

  • Import Data:

    ① The load data method can import data locally or on HDFS

    ②Location method

    create external if not exists stu2 like student location '/user/hive/warehouse/student/student.txt';
    

    ③ sqoop mode

  • Export data: generally use sqoop

7.9. Use of over() in Hive

8) Scoop

8.1. Sqoop common commands

[Sqoop-command] Sqoop related understanding and commands

8.2. How does Sqoop handle null values

--input-null-stringTakes two parameters, and , when exporting data --input-null-non-string.

--null-stringUse and when importing data --null-non-string.

8.3. How Sqoop handles special characters

Special characters encountered in Sqoop can be hive-drop-import-delimsdiscarded or used --hive-delims-replacement, which will replace the special characters with the characters we set.

8.4. Does the Sqoop task have a reduce phase?

There is only the map stage, and there are no tasks in the reduce stage. The default is 4 MapTasks.

9) Then

1. Oozie is generally not used alone, because it is troublesome to configure the xml file.

2. Oozie is generally used together with Hue to schedule various tasks such as Shell, MR, Hive, etc.

3. Whether it is used alone or integrated with Hue, the most important point in Oozie is the workflow configuration.

4. When integrating with Hue, configure the time of the scheduled task in the Schedule, and configure the relevant location information of the task in the workflow.

10) In Azka

1. Metadata is stored in Mysql.

2. There are three deployment modes: solo-server (all services are on one server), tow-server (web, executors are on different servers), multiple-executor-server (generally not commonly used).

3. Task scheduling of azkaban:

  • Hive script: test.sql

    use default;
    drop table aztest;
    create table aztest(id int,name string) row format delimited fields terminated by ',';
    load data inpath '/aztest/hiveinput' into table aztest;c
    reate table azres as select * from aztest;insert overwrite directory '/aztest/hiveoutput' select count(1) from aztest; 
    
  • hive.job(name.job)

    type=command #固定
    dependencies=xx #有依赖的任务的时候添加这个
    command=/home/hadoop/apps/hive/bin/hive -f 'test.sql'
    
  • Pack all the files into a zip package and upload it to Azkaban, then click summary, and then select schedule to configure the time.

11)Flume

Flume is a distributed data collection system used to collect streaming data in real time , with strong fault tolerance and high reliability.

11.1. Architecture components of Flume

(source,channel,sink)

1. Agent in Flume: a general term including source, channel, and sink.

2. source: It is a component used to collect data. It can monitor the changes of a file, the changes of new files in a directory, the content changes of all files in a directory, or customize the data source. In the configuration Time to configure the name of the source.

3. Channel: There are two ways to configure data caching, one is in memory, and the other is in the way of generating files.

4. Sink: supports HDFS, Kafka, custom target sources, etc., and can also support the next agent.

5. Event: Flume encapsulates the collected data into an event for transmission, which is essentially a byte array.

11.2. Multiple Architectures of Flume

1. Flume can be connected in series in the form of an agent.

2. Flume can be connected in parallel to transfer the sinks of multiple agents to the source of a new agent.

3. Flume can be connected in series + parallel + multiple sinks, etc.

11.3. Related configuration of Flume

1. The name of source, channel, and sink.

2. Whether the channel is memory-based or file-based.

3. The channel corresponding to the sink.

4. If the sink goes to Kafka, you need to configure the topic, port number, ack, etc. of Kafka.

12)Kafka

12.1. Why is Kafka so fast

1. Fast query speed

  • Partition, file segmentation.

  • Binary search locates which segment (file) the message is in.

2. Fast writing speed

  • written sequentially.

  • Zero copy: data is directly copied from the disk file to the network card device, without the need to go through the hands of the application program. The zero-copy technology uses DMA technology to copy the file content to the ReadBuffer in the kernel mode, and directly transfers the data of the data kernel to the network card device, so the zero-copy is aimed at the kernel, and the data realizes zero-copy in the kernel mode.

  • Batch sending: Set the batch submission data size through the batch.size parameter. The default is 16K. When the data backlog reaches this value, it will be sent uniformly, and the data will be sent to a partition.

  • Data compression: Producer-side compression, Broker-side retention, Consumer-side decompression.

12.2. How does Kafka avoid repeated consumption

Can use two-phase transaction commit (Flink) or container deduplication (HashSet, Redis, Bloom filter)

12.3. How does Kafka guarantee sequential consumption

Kafka is globally disordered but locally ordered. As long as we push messages to the same partition when we push messages, we can also specify a partition for consumption when consuming.

12.4. What is the function of Kafka partition

Improve read and write efficiency + facilitate cluster expansion + consumer load balancing

12.5. How does Kafka ensure that data is not lost

1. On the producer side, set ack (0, 1, -1 / all)

0: The producer will not wait for the Broker to return ack and continue to produce data.

1: When the Leader in the Broker receives data, it returns ack.

-1 / all: When the Leader and all the Followers in the Broker have received the data, the ack is returned.

2. The consumer side adopts the method of consuming first and then submitting, and would rather consume repeatedly than lose data.

3. Broker side: There is a copy mechanism to ensure data security.

12.6. Relationship between consumers and consumer groups

A message at the same time can only be consumed by one consumer of the same consumer group, and cannot be consumed by other consumers of the same consumer group. But it can be consumed by consumer groups of different consumers.

12.7. Kafka Architecture and Basic Principles

[Kafka-architecture and basic principles] Kafka producer, consumer, Broker principle analysis & Kafka principle flow chart

13)HBase

13.1. Architecture of HBase

  • HMaster

  • HRegionServer

  • Region

  • zookeeper

HBase:Client -> Zookeeper -> HRegionServer -> HLog -> Region -> store -> memorystore -> storeFile -> HFile

13.2. HBase read and write process

[HBase-reading and writing process] HBase reading and writing process and internal execution mechanism

13.3. Design of rowkey in HBase

  • Hash

  • reverse timestamp

13.4. Region partition and prepartition

  • HBase can specify pre-partition rules when creating a table.

  • After the HBase table is created, the partition can be changed through the split command.

  • HBase can also use the information in the split.txt file to pre-partition through the SPLIT_FILE command when building a table.

13.5. Advantages and disadvantages of HBase

  • advantage:

    • Support for storage of unstructured data

    • Compared with the relational database HBase, columnar storage is used, and the writing efficiency is very fast

    • Null values ​​in HBase will not be recorded, saving space and improving read and write performance

    • Support high concurrent reading and writing

    • Supports the storage of large amounts of data

  • shortcoming:

    • It does not support sql query itself

    • Not suitable for large scan queries

14)ClickHouse

14.1. What should be paid attention to when creating a table in ClickHouse

CH needs to specify the engine when creating the table.

14.2. SQL statement when ClickHouse updates or deletes table data

renew:Alter table update (xx="xx") where 主键 = ?

delete:Alter table delete where 主键= ?

write:insert into table() values()

14.3. ClickHouse Engine Classification and Features

  • MergeTree:

    • Common features of MergeTree engine:

      Data updates are represented by inserts. Because in CH, the primary key column is not unique, just to create an index, improve our query efficiency, insert data again, and the data is repeated. In the MergeTree engine, there is a process in the background that performs data merging operations on a regular basis, but it is unknown when.

    • Notice:

      If you use it immediately after inserting a piece of data, select * from tablenamethis operation will not read the latest data, and the queried data will be the latest after the background merge. Therefore, in order to ensure that each query is the latest data, there are two types Solution:

      1. Manually trigger background process merging operation: optimize table 表名 finalmanual merging, but if a large amount of data is merged, it will slow down, and the table will not work during the merging process (not recommended)

      2. Operate the data every time you query 分组聚合, and the purpose of deduplication has been achieved.

  • Log engine:

    • TinyLog:

      TinyLog is the smallest and most lightweight engine. When we need to quickly write small tables (within 100w rows), this engine is the most efficient and appended way to read them as a whole.

    • TinyLog Disadvantages:

      Without concurrency control, when reading and writing at the same time, the reading operation reports an exception, and writing to multiple tables at the same time, the data will be destroyed, and indexing is not supported.

  • Integration Engine:

    Kafka, MySQL, HDFS, ODBC, JDBC, used when integrating with other storage databases or storage media.

    • The reason why the project does not use the integrated MySQL engine:

      Because CH is a table mapped to MySQL, it is essentially a select query in MySQL, causing access pressure on the database.

    • Notice:

      If the MySQL engine is used to associate with the specified library in MySQL, then a library will appear in CH, and this library is operated with MySQL's sql syntax.

  • Other specific function engines (less used)

14.4. MergerTree engine classification

  • MergeTree

    1. Orderby is generally used for sorting primary keys. If orderby is used, primary by is not needed.

    2. After the table is built, in the specified directory, folders will be divided according to the partition. Each column in the partition is a .bin file, which is a compressed representation of a column. The .idx file represents the primary key index, and the .mrk file is saved block offset.

  • ReplacingMergeTree

    1. If the table will be modified, use this engine, but this engine only guarantees that the data will be modified eventually, and cannot guarantee that the primary key will be repeated during the query process.

    2. ENGINE=ReplacingMergeTree([ver]), ver is the version, if ver is not specified, then the last modified content will be kept, if ver is specified, the first content will be kept.

  • SummingMergeTree

    1. The aggregation function can only perform sum aggregation.

    2. SummingMergeTree([columns])If columns is specified, the selected column cannot be the primary key and non-numeric type. If not selected, the numeric columns in the non-primary key columns will be automatically aggregated.

    3. If you do not manually refresh, you have to wait for the background to automatically aggregate, or use the sql query of group aggregation to get the refreshed results.

  • AggregatingMergeTree

    1. The aggregation function is more comprehensive, and there are other aggregation functions besides sum.

    2. It is generally used together with MergeTree. Store the detailed data in MergeTree, then insert into the AggregatingMergeTree table, and then use the grouping aggregation query to get the aggregated results.

  • CollapsinMergeTree

    The update and delete operations are not supported in CH (the standard update and delete syntax cannot be used to operate CH), so the method recommended by CH is to use CollapsinMergeTree to delete, add a field sign to the table, 1 means status line, -1 means cancel line, Use the cancel line to indicate that the data in the status line has been deleted. When the merge operation is triggered, 1 and -1 will be folded, that is, these two data will be deleted at the same time.

    Two cases where the status line and cancel line are not collapsed:

    1. Since the merge mechanism occurs in the background, the specific execution time cannot be predicted, so data redundancy may occur.

    2. If the cancel line appears first and then the status line appears, it cannot be folded.

    3. CH cannot guarantee that the same primary key falls on the same node when data is inserted, and data on different nodes cannot be folded.

    Solution: When querying data, use sum to sum the sign fields, and just have the data greater than 0.

    Note: The feature of the cancel row is that all the data except the sign column is a copy of the status row. The data of the status row can come from ogg and canal, but we should query the data of the CH table according to the primary key and then copy the data of the cancel row. Is the sql statement for concatenating and canceling row data?

    Answer: No, the delete operation in ogg has a before field that can be directly inserted into the CH table as a cancel row, and the delete operation in canal has a data field that can be directly inserted into the CH table as a cancel row.

  • VersionCollapsinMergeTree

    Compared with CollapsinMergeTree, there is no strict requirement on the insertion order of the status line and cancel line, but not only the sign column but also a version column must be added, and the update time of the last piece of data is used as the value of version.

    1. The query still uses the method of grouping and aggregation.

    2. Or use select * from tablename final, this operation does not actually trigger the merge, but the effect of the data calculated and presented in the background (this is a very inefficient way to select data, try not to use it to query tables with a large amount of data)

14.5. Classification of ClickHouse tables

Local table: deployed in a stand-alone environment

Cluster table: deployed in a cluster environment

14.6. Update and Delete operations in ClickHouse

You cannot use regular sql statements to perform update and delete operations in CH, but use alter table to perform operations.

  • How to view the data of the operation:

    In the system library mutations table of CH, records of CH modification operations are recorded

  • The specific process of mutation:

    First use the where condition to find the specific partition, rebuild each partition, and overwrite the old partition with the new partition. Once the partition is replaced, it cannot be returned (even if a piece of data is modified, it will be overwritten, so the efficiency is relatively slow)

14.7. OLAP Component Classification

  • ROLAP: (real-time aggregation calculation query on the data of the detailed layer)

    impala、soon

  • MOLAP: (multi-dimensional pre-aggregation processing of data, and then real-time calculation and query of pre-aggregation data)

    Druid、Kylin

  • HyBirdOLAP:(ROLAP+MOLAP)

    TiDB

14.8. Does ClickHouse belong to OLAP or OLTP

ClickHouse belongs to ROLAP, but it can also do MOLAP (pre-aggregation by materialized view)

The difference between CH and other OLAP:

Druid

Real-time query, but does not support complex SQL query, does not support data update operations.

Kylin

Sub-second queries support complex SQL queries, but cubes are built, which is prone to dimension explosion. The number of dimensions is controlled at about 10, and data update operations are not supported.

ClickHouse

1. Real-time OLAP analysis can be performed on detailed data, and pre-aggregated calculations can also be performed on data.

2. Single-table query has huge advantages.

3. Support data update operation.

4. Support sql query, multi-table joint query and other operations (does not support window functions and related subqueries)

15) Must

15.1. Kudu table creation

The Kudu table must have a primary key.

15.2. Does Kudu depend on Hadoop?

Kudu does not need to rely on Hadoop cluster, Kudu has its own storage mechanism.

15.3. There are several partition methods of Kudu

There are three methods: hash, range, hash&range (mixed partition), and the hash method is used for partitioning in the project.

15.4. Does Kudu's storage depend on HDFS?

Kudu does not depend on HDFS.

15.5. Kudu's architecture system (role)

  • master:

    Responsible for managing the slave nodes and storing the location of the data in the slave nodes, that is, metadata.

  • catelog:

    The storage location of metadata cannot be directly accessed, and can only be queried through the API, which records the structure of the table, the location of the table, the state, and the key (partition) at the beginning and end of each tablet

    • effect:

      Manage metadata, load balance, monitor tablet-server status.

  • tablet-server:

    It is only responsible for storing data, similar to datanode.

15.6. Kudu's architecture system (features)

1. Kudu’s storage is in the form of a table, with partitions, column names, column types, and column values. It is similar to a relational database, but it is not yet a relational database.

2. Kudu is a master-slave structure, with master and tablet-server, and the master node is divided into action and wait.

3. In Kudu's table partition, a copy of the partition exists on each node, similar to Kafka.

4. The partition of Kudu is similar to the Region of HBase. A table is divided into several partitions according to certain interval rules.

5. Kudu's master-slave node division is similar to datanode and namenode in HDFS.

6. A partition in the Kudu table (tablet-server) will be split into multiple tablets, and each partition corresponds to a tablet. There are three partition methods: Hash, Range, and Hash+Range.

15.7. How to create Kudu table in the project (Spark)

Build a table based on DataFrame:

  • Create KuduContext object (DataFrame, SparkConf)

  • kuduContext.tableExists(tableName)

    • dataFream.schema(), return variable structure StructType.

    • StructType.fields.map(field=>{StructField(field.name, field.dataType, true/false[judging whether the field is the primary key])})

    • Create the changed StructField into a new table structure.

    • newCreateTableOption(), specifies the partition method of the table to partition according to the primary key field, and sets the number of copies.

  • kuduContext.createTable(tableName, kuduStructType, List(primaryFieldName), options)

Kudu needs to create a new createOptions object to create a table. Through this object, specify the number of copies and the partition method. When specifying the partition method, you must specify the primary key field.

16)Ogg

16.1. Introduction to OGG

OGG is a log-based structured data replication software that obtains data additions, deletions, and changes by parsing the online logs or archived logs of the source database.

OGG can realize the acquisition, transformation and transmission of a large amount of transaction data.

16.2. OGG realizes the real-time synchronization process of Oracle

  • First of all, Oracle needs to enable the archivelog (archive log) and auxiliary log functions.

  • ogg source: (start management process mgr, extract process and pump process)

    1. The mgr process is responsible for managing two child processes.

    2. The extract process is to pull logs such as additions, deletions, and changes from oracle, and then convert the logs into an intermediate file (LocalTrail)

    3. The pump process requests the mgr process at the target end to start the collect process, and then compresses the intermediate files and sends them to the connect process.

  • ogg target: (start management process mgr, connect process and replicat process)

    1. The mgr process is responsible for managing two sub-processes and completing the communication with the remote pump process.

    2. The connect process is to receive the intermediate files transmitted by the pump process. The connect process may not exist. If there is no connect process, replicat will replace the connect to receive the files.

    3. The replicat process is used to receive files and convert the files into SQL corresponding to the target end or corresponding insert statements, which can be executed on the target end to achieve data synchronization.

17)Canal

17.1. Canal Implementation Principle

  • Canal is a master-slave architecture that imitates MySQL. The master-slave principle:

    In the MySQL master-slave relationship, salve will send a dump command to the master, and the master will judge whether it is its own slave node according to the configuration. If it is its own slave node, then each time MySQL updates a piece of data, salve will also change accordingly.

  • Canal implementation principle:

    Canal is like a pretender, pretending to be a MySQL slave node, and also sending dump commands to the MySQL master, so as to achieve data synchronization between MySQL and Canal.

17.2. Canal realizes the real-time synchronization process of MySQL

  • First of all, MySQL needs to enable the bin-log log (binary-log) function. As the name suggests, all binary data is recorded in this log.

    (canal is a cs architecture with canal-server and canal-client respectively)

  • canal-server:

    canal-server has 1~n instances, and each instance has four processes to coordinate and synchronize data. Usually, an instance corresponds to a database, which is convenient for sending data to different targets.

    • EventParse: Pull data from bin-log logs.

    • MetaManager: Record the position where the bin-log is read, that is, the offset, and record the offset at which the client pulls data from the EventStore.

    • EventSink: Get data from EventParse, perform parsing and conversion, and then write the data to EventSotre.

    • EventStore: client pulls data from here (it is a memory ring buffer)

  • canal-client:

    • canal-client pulls data from the server, how much it pulls each time, and where it is pulled, is managed by the MetaManager in the canal-server.

    • The pulled data is processed and finally written to the Kafka cluster.

18)Druid

18.1. What are the characteristics of Druid (why use Druid in the project)

Druid supports real-time millisecond-level query of massive data, which is suitable for our market time-sharing data business (applied in index projects)

18.2. Why does Druid support real-time millisecond-level queries, and how does it work at the bottom?

1. Excellent distributed architecture design, each role performs its own duties and cooperates with each other, providing overall efficiency.

2. The underlying storage supports Chunk sub-folders and Segment/Partition sub-files.

3. Pre-aggregation to speed up query efficiency.

4. Bitmap index (just understand, you only need to know that space is exchanged for time, and table scanning is avoided by binary bit operations)

19) Kylin

19.1. What are the characteristics of Kylin

1. Kylin supports sql, but does not support data update.

2. Kylin will pre-calculate the data, that is, build a cube.

3. Kylin consumes more time and space during precomputation, but the query efficiency improves after the cube is built.

19.2. Kylin cube

1. Stored in HBase.

2. A cube is composed of multiple cuboids, and the calculation method of the cuboid is 2 to the power of (dimension -1).

19.3. Kylin's Expansion Rate

After the cube is built, the data size ÷ the original data size, and the expansion rate is less than 1000%, which is normal.

19.4. How to optimize cube (pruning optimization)

1. Using derived dimensions will exclude the non-primary key dimensions of the table.

2. The specified field does not perform cube construction.

3. The cube will be built only when the specified multi-dimensional combination appears at the same time, and the cube will not be built when it appears alone.

19.5. Application scenarios of Kylin

For a large table with low concurrent visits, you can use Kylin if you want to achieve sub-second query.

20)Doris

20.1. Architecture of Doris

FE、BE

  • There are three processes in FE: leader, follower, and observer. Leader and follower are the concept of master-slave nodes to prevent single point of failure. Observer is an extended query function when the query pressure is too high, and completes the backup of metadata. Therefore, observer Only read but not write.

  • As a data storage node, BE is distributed, and data query can be performed in parallel with multiple borrow points. At the same time, BE will also store data in multiple copies, which can be configured according to the data.

  • Broker is a stateless process that can help Doris access external data sources, and a Broker instance is usually deployed on each node.

20.2. Doris port number

Mainly used port numbers:

  • 8030:FEHttpServer

  • 8040:BEHttpServer

  • 9030:FEMysqlServer

20.3. Features of Doris

1. Doris can achieve millisecond-level response speed for data query, but it is not very friendly to high concurrent writing. For example, when writing streamload, it is best to execute streamload every 1s or more seconds.

2. Doris supports the jdbc protocol.

3. Both FE and BE nodes of Doris can be expanded and reduced.

20.4. Problems in the use of Doris

1. After Doris is installed, it will use its own jdk. If we install it by ourselves, it will conflict with Doris and cause Doris to crash.

2. Use the Mysql client to log in and check the status of FE: mysql-hip-P9030-uroot-pxxxxx.

3. The character type converted from varchar in Mysql to Doris needs to be ×3.

4. The text type in Doris is String.

Guess you like

Origin blog.csdn.net/weixin_53543905/article/details/130268758